WO2023082931A1

WO2023082931A1 - Method for punctuation recovery in speech recognition, and device and storage medium

Info

Publication number: WO2023082931A1
Application number: PCT/CN2022/125163
Authority: WO
Inventors: 吴礼蔚; 朱耀明; 程善伯; 王明轩
Original assignee: 北京有竹居网络技术有限公司
Priority date: 2021-11-11
Filing date: 2022-10-13
Publication date: 2023-05-19
Also published as: CN114120975A

Abstract

A method for punctuation recovery in speech recognition, and an electronic device (5) and a storage medium. A method for training a model for punctuation recovery in speech recognition. The method comprises the following steps: acquiring a text sample for model training and a corresponding audio sample, wherein for a text sample that is obtained from audio-free text, the corresponding audio sample thereof is a virtual sample (S101); and on the basis of the acquired text sample for model training and the corresponding audio sample, training a model for punctuation recovery in speech recognition (S102).

Description

Method, device and storage medium for punctuation restoration in speech recognition

Cross References to Related Applications

This application is based on a Chinese application with application number 202111335102.5 and a filing date of November 11, 2021, and claims its priority. The disclosure content of this Chinese application is hereby incorporated into this application as a whole.

technical field

The present disclosure relates to speech recognition, including punctuation recovery in speech recognition.

Background technique

Automatic Speech Recognition (ASR), a technology that converts human speech into text, has a wide range of applications and can serve as an upstream component for multiple tasks, such as voice assistants and speech translation, among others. Existing commercial speech recognition systems often output text without punctuation, which may lead to misunderstandings and affect the performance of downstream tasks such as machine translation and information extraction. Specifically, on the one hand, text without punctuation is difficult to read, has poor readability, unclear sentences, and ambiguity. On the other hand, downstream tasks such as machine translation and information extraction assume that the input is punctuated. Yes, text without punctuation can lead to performance degradation in downstream tasks.

Contents of the invention

This Summary is provided to introduce a simplified form of concepts that are described in detail later in the Detailed Description. This summary of the invention is not intended to identify key features or essential features of the claimed technical solution, nor is it intended to be used to limit the scope of the claimed technical solution.

According to some embodiments of the present disclosure, there is provided a method for training a model for speech recognition punctuation recovery, comprising the following steps: acquiring text samples and corresponding audio samples for model training, wherein for A text sample, and its corresponding audio sample is a virtual sample; and a model for punctuation recovery in speech recognition is trained based on the acquired text sample for model training and the corresponding audio sample.

According to some other embodiments of the present disclosure, a training device for a model of speech recognition punctuation recovery is provided, including: a sample acquisition unit configured to acquire text samples and corresponding audio samples for model training, wherein for A text sample obtained from an audio-free text, the corresponding audio sample being a virtual sample; and a training unit configured to train a model for speech recognition punctuation recovery based on the acquired text sample for model training and the corresponding audio sample Model.

According to some embodiments of the present disclosure, there is provided an electronic device, including: a memory; and a processor coupled to the memory, the processor configured to execute the instructions described in the present disclosure based on instructions stored in the memory. The method of any embodiment.

According to some embodiments of the present disclosure, there is provided a computer-readable storage medium, on which a computer program is stored, and the program, when executed by a processor, causes the method of any embodiment described in the present disclosure to be implemented.

According to some embodiments of the present disclosure, there is provided a computer program product comprising instructions which, when executed by a processor, cause implementing the method of any one of the embodiments described in the present disclosure.

According to some embodiments of the present disclosure, there is provided a computer program, the program code included in the computer program causes the method of any embodiment described in the present disclosure to be implemented when executed by a computer.

Other features, aspects, and advantages of the present disclosure will become apparent through the following detailed description of exemplary embodiments of the present disclosure with reference to the accompanying drawings.

Description of drawings

Preferred embodiments of the present disclosure are described below with reference to the accompanying drawings. The accompanying drawings illustrated herein are included to provide a further understanding of the disclosure, and are incorporated in and form a part of this specification together with the following detailed description, and serve to explain the disclosure. It should be understood that the drawings in the following description only relate to some embodiments of the present disclosure, rather than limiting the present disclosure. In the attached picture:

1A to 1C show a flowchart of a training method for a model for speech recognition punctuation recovery according to some embodiments of the present disclosure.

Fig. 2 shows a flowchart of a method for recovering punctuation in speech recognition according to some embodiments of the present disclosure.

FIG. 3 illustrates an exemplary implementation of model training for speech recognition punctuation recovery according to some embodiments of the present disclosure.

FIG. 4A shows a block diagram of a training device for a model of speech recognition punctuation recovery according to some embodiments of the present disclosure, and FIG. 4B shows a block diagram of a speech recognition punctuation recovery device according to some embodiments of the present disclosure.

Figure 5 shows a block diagram of some embodiments of an electronic device of the present disclosure.

FIG. 6 shows a block diagram of other embodiments of the electronic device of the present disclosure.

It should be understood that, for the convenience of description, the sizes of the various parts shown in the drawings are not necessarily drawn according to the actual proportional relationship. The same or similar reference numerals are used in the drawings to denote the same or similar components. Therefore, once an item is defined in one figure, it may not be discussed further in subsequent figures.

Detailed ways

The following will clearly and completely describe the technical solutions in the embodiments of the present disclosure with reference to the drawings in the embodiments of the present disclosure, but obviously, the described embodiments are only some of the embodiments of the present disclosure, not all of them. The following descriptions of the embodiments are only illustrative in fact, and by no means limit the present disclosure and its application or use. It should be understood that the present disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein.

It should be understood that the various steps described in the method implementations of the present disclosure may be executed in different orders, and/or executed in parallel. Additionally, method embodiments may include additional steps and/or omit performing illustrated steps. The scope of the present disclosure is not limited in this regard. Relative arrangements of components and steps, numerical expressions and values set forth in these embodiments should be construed as merely exemplary and not limiting the scope of the present disclosure unless specifically stated otherwise.

The term "comprising" and its variants used in the present disclosure mean an open term including at least the following elements/features but not excluding other elements/features, ie "including but not limited to". In addition, the term "comprising" and its variants used in the present disclosure mean an open term that includes at least the following elements/features but does not exclude other elements/features, namely "comprising but not limited to". Thus, including is synonymous with comprising. The term "based on" means "based at least in part on".

Reference throughout this specification to "one embodiment," "some embodiments," or "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one embodiment of the present disclosure. For example, the term "one embodiment" means "at least one embodiment"; the term "another embodiment" means "at least one additional embodiment"; the term "some embodiments" means "at least some embodiments." Moreover, appearances of the phrase "in one embodiment," "in some embodiments," or "in an embodiment" in various places throughout the specification are not necessarily all referring to the same embodiment, but may also refer to the same embodiment. Example.

It should be noted that concepts such as "first" and "second" mentioned in this disclosure are only used to distinguish different devices, modules or units, and are not used to limit the sequence of functions performed by these devices, modules or units or interdependence. Unless otherwise specified, the concepts of "first", "second", etc. are not intended to imply that objects so described must be in a given order, whether in time, space, rank or in any other way.

It should be noted that the modifications of "one" and "multiple" mentioned in the present disclosure are illustrative and not restrictive, and those skilled in the art should understand that unless the context clearly indicates otherwise, it should be understood as "one or more" multiple".

The names of data, messages or information exchanged in the embodiments of the present disclosure are used for illustrative purposes only, and are not used to limit the scope of these data, messages or information.

In order to solve the problem of no punctuation in the output text of speech recognition, a punctuation recovery task is proposed and a corresponding punctuation model is designed. The punctuation recovery task aims to correctly add punctuation marks to the output of the automatic speech recognition system, such as the output text sentence, and perform punctuation recovery by using a related punctuation model in the implementation. Conventional punctuation models either only use text, or require audio corresponding to the text, but are often limited by real-world scenarios.

On the one hand, conventional punctuation models perform punctuation recovery based only on textual information. However, there is ambiguity in a single text message, which easily causes some problems. A sentence may have different meanings by adding different punctuation marks. For example, the meaning of "I don't want anymore kids" is obviously different from that of "I don't want anymore, kids", which shows the importance of commas, and inappropriate additions will lead to obvious ambiguities or errors. Also, since the tone of voice of the speaker is not known, the model may have difficulty determining whether a sentence should end with a period or a question mark.

On the other hand, considering the rich information in speech audio, such as pauses and intonation, sound signals can help alleviate the ambiguity of punctuation models, so some multimodal punctuation models are proposed, which extract acoustic features from speech audio and combine them by adding /splicing to fuse acoustic and lexical features to facilitate punctuation addition. Here, the modalities may relate to text and audio, respectively. However, conventional multimodal methods face the problem of missing modes in practical applications. First, the corresponding audio is sometimes inaccessible due to storage limitations or privacy policies, and previous multimodal approaches cannot recover such audio-free text sentences; second, human-annotated text audio is very costly and difficult to obtain, This leads to a sparser training set for these multimodal models, making it difficult to obtain good multimodal models.

In view of this, the present disclosure proposes an improved solution, which can use audio text and non-audio text for punctuation model training. In particular, noting that non-audio text is easy to obtain, the scheme of the present disclosure can make full use of non-audio text to construct a large number of training sets for training the punctuation model, and provide further optimization training with the help of limited audio , thereby improving the accuracy of the punctuation model and providing a punctuation model capable of enhancing the punctuation recovery effect of text with optional audio.

In particular, the present disclosure proposes a so-called unified multimodal punctuation framework, which may be called UniPunc, which is capable of utilizing both audio text and non-audio text for punctuation model training. Specifically, for text with audio, the corresponding text and audio pair can be obtained; for text without audio, the corresponding audio can be fabricated by constructing virtual content, and the corresponding text and audio pair can also be constructed. This enables punctuation model training based on text and audio pairs obtained from both audio text and non-audio text. It is thus possible to utilize readily available large amounts of non-audio text/corpus as a training set, and utilize sound input to reduce ambiguity in punctuation marks, resulting in punctuation models with improved accuracy.

In addition, the solution disclosed in the present disclosure can further improve the training and application of the punctuation model in a multilingual scene. Conventional multimodal methods can only handle single-language input, and data in some languages is difficult to obtain and is very rare. For example, it is difficult to obtain a large amount of high-quality text and audio data in many small languages, which cannot meet the needs of model training. Direct Training results in a model that performs very poorly. In view of this, the present disclosure uses a multilingual collaborative training method to solve the problem of insufficient data in small languages. In particular, the present disclosure can simultaneously use audio and text in multiple languages for model training, so that easily obtained data of commonly used languages can be used to enhance the performance of minority languages, thereby obtaining an improved multilingual punctuation model.

Embodiments of the present disclosure will be described in detail below with reference to the accompanying drawings, but the present disclosure is not limited to these specific embodiments. The following specific embodiments may be combined with each other, and the same or similar concepts or processes may not be repeated in some embodiments. Furthermore, the particular features, structures or characteristics may be combined in any suitable manner that will be apparent to one of ordinary skill in the art from this disclosure in one or more embodiments.

FIG. 1A shows a training method of a model for punctuation recovery in speech recognition according to some embodiments of the present disclosure. In method 100, in step S101, a text sample and a corresponding audio sample for model training are obtained, wherein for a text sample obtained from a text without audio, the corresponding audio sample is a virtual sample; and in step S102, based on the The acquired text samples and corresponding audio samples for model training are used to train the model for speech recognition punctuation recovery.

According to some embodiments of the present disclosure, the text samples and corresponding audio samples used for model training substantially correspond to a set of paired text and audio samples, wherein for each text sample, its corresponding audio sample is obtained, which Either the corresponding audio content of the text sample content, or a virtual audio sample for the audio of the fictional text sample.

According to some embodiments of the present disclosure, the text samples and corresponding audio samples used for model training are obtained from both audio text and non-audio text, for example from an initial set containing both audio text and non-audio text out. Wherein, obtaining the paired text sample and audio sample from the audio text, for example, can be obtained by converting/extracting/segmenting in various appropriate ways, which will not be described in detail here. Furthermore, from the non-audio text, a text sample is obtained (such a text sample may be referred to as a single text sample) and a dummy sample is obtained as the corresponding audio sample of the text sample. In this way, on the one hand, a large amount of plain text can be exploited without audio; on the other hand, acoustic features can be effectively exploited in the presence of audio. Therefore, a large number of non-audio text/single text samples that are easy to obtain can be used to construct a training set. In particular, for a large number of single text samples, it is still possible to construct text and audio sample pairs by using dummy samples, which helps Optimize model training effect.

In some embodiments, a virtual sample may be a preset sample of audio for a fictional text sample. A virtual sample may take any suitable form. In particular, it may be a vector with preset content, such as a vector with fixed length and fixed content, preferably an all-zero vector. Of course, it can also be in other suitable forms. In some embodiments, for all single text samples used for model training, virtual samples can be constructed for them, and their virtual samples can be the same or different from each other.

In some embodiments, the initial set containing both audio text and non-audio text may be obtained by any suitable means. For example, they can be obtained from a conventional database, such as an existing training database in a storage device; or can be obtained using an appropriate type of acquisition device, for example, a microphone can be used to obtain audio data, a video device or an appropriate A text input device to enter text data; or user-entered text and/or human-annotated audio. In particular, audio text and non-audio text may be stored in any suitable manner for obtaining text samples and audio samples. For example, in some embodiments, these texts are provided with indication information of whether the associated text has audio, such as an indicator in binary form, etc., so that the model training device can determine whether to generate virtual audio for the text sample according to the indication information. Samples, such as generating dummy samples only if the indication indicates that the text sample has no audio.

According to some embodiments of the present disclosure, training a punctuation model for speech recognition punctuation recovery may include generating a multimodal mixed representation based on the acquired text samples and audio samples for model training, as in step S1021 in FIG. 1B , and Model training is performed based on the generated multi-modal mixed representation, as in step S1022 in FIG. 1B . In this way, the obtained multimodal mixed representation can contain text and audio information obtained from the training sample set effectively created by utilizing audio text and non-audio text, and such multimodal mixed representation can well support model training. In some embodiments, for each pair of text samples and audio samples used for model training, a corresponding multimodal mixed representation is obtained, thereby obtaining multimodal mixed representations of all text samples and audio samples.

According to some embodiments of the present disclosure, a multimodal hybrid representation may be obtained by performing processing based on text samples and audio samples separately, and combining the processing results based on text samples and the processing results based on audio samples. In some embodiments, a hybrid representation is generated based on the sum of processing results of text samples and audio samples. It should be noted that other appropriate processing can also be performed on the processing results of text samples and audio samples to generate a mixed representation, such as weighted sum or other appropriate mathematical operations, etc., which will not be described in detail here.

According to some embodiments of the present disclosure, the processing performed based on text samples and audio samples may be attention-based processing, so that the multi-modal hybrid representation may be based on acquired text samples and audio samples for model training. Multimodal hybrid representations obtained by performing attention-based processing on audio samples. In particular, attention-based processing is performed based on text samples and audio samples separately, and the processing results of text samples and audio samples are combined to obtain a multimodal hybrid representation.

According to some embodiments of the present disclosure, the attention-based processing applied to the text samples and audio samples may be any suitable processing, preferably they are different from each other. In some embodiments, text samples are processed using a self-attention mechanism. In some embodiments, audio samples are processed using a cross-attention mechanism. The self-attention mechanism and the cross-attention mechanism can adopt various appropriate architectures/algorithms/models, etc., which will not be described in detail here. Thus, a multimodal hybrid representation can be generated by combining the self-attention processing results of text samples and the cross-attention processing results of audio samples.

According to some embodiments of the present disclosure, text information related to text samples is also considered when performing attention-based processing on audio samples used for model training. That is to say, in the processing of the audio sample, the text information related to the text sample corresponding to the audio sample will also be used as an input parameter for processing. In some embodiments, the text information related to the text sample may be a feature converted/extracted from the text sample or a processing result obtained by performing attention-based processing on the text sample, such as self-attention processing.

According to some embodiments of the present disclosure, training the model for speech recognition punctuation recovery based on the text samples and corresponding audio samples used for model training further includes performing Punctuation model training. In some embodiments, text samples and audio samples may be converted separately to obtain intermediate feature values/sequences, and then a multimodal hybrid representation may be generated based on the intermediate feature values/sequences converted from text samples and audio samples for use in Punctuation model training. Preferably, attention-based processing can be performed on intermediate feature values/sequences converted from text samples and audio samples to generate a multimodal mixture representation, and punctuation model training is performed based on the multimodal mixture, as shown in Figure 1C. The attention-based processing performed on intermediate feature values/sequences can be performed as previously described and will not be described in detail here.

In some embodiments, the intermediate feature values/sequences may be in any suitable form and may be obtained through corresponding operations.

According to some embodiments of the present disclosure, a sequence of lexical embeddings can be converted from a text sample as its intermediate feature value/sequence. In some embodiments, text samples may be subjected to lexical encoding processing to obtain lexical embedding sequences. It should be noted that the encoding of the text samples can be processed in various appropriate ways, such as a pre-trained vocabulary encoder, or various encoding methods known in the art, which will not be described in detail here. In some embodiments, self-attention processing may be performed on sequences of word embeddings. In some embodiments, the vocabulary embedding sequence or the processing result of the vocabulary embedding sequence after self-attention processing can also be used as text information related to the text sample to be applied to the attention-based processing of the audio sample, for example, as an input parameter applied to the Cross-attention processing of audio samples.

According to some embodiments of the present disclosure, audio samples may be processed to obtain acoustic embedding content as their intermediate feature values/sequences. In some embodiments, the processing of audio samples may be implemented in various suitable ways. In some embodiments, the processing of the audio samples may include encoding and/or downsampling of the audio samples. In some embodiments, corresponding processing may be performed on the corresponding audio sample as to whether the text sample has audio or not. For example, encoding and/or downsampling may be performed on audio samples derived from audio text to generate acoustic embedded content, while virtual samples corresponding to non-audio text may be directly used as acoustic embedded content. In implementation, indication information about whether the text sample has audio, such as an indicator in binary form, etc., can be input into the model training device together with the text sample and the audio sample, so that the model training device can analyze the text according to the indication information. The audio samples corresponding to the samples are processed accordingly for model training. In some embodiments, attention-based processing, such as cross-attention processing, may be performed on acoustically embedded content.

According to some embodiments of the present disclosure, training a model for punctuation recovery in speech recognition may include learning/predicting punctuation marks based on the acquired text samples and audio samples for model training, thereby performing punctuation model training. In some embodiments, punctuation marks can be learned/predicted based on the multimodal mixed representation described above, so as to perform punctuation model training. In some embodiments, a classifier may be employed for punctuation learning/prediction. The classifier can be a classifier of various appropriate forms, such as a linear classifier (Linear+Softmax), and of course can also be any other appropriate form, which will not be described in detail here.

In this way, the scheme of the present disclosure can use single text data and paired text and audio data for training to solve the problem of insufficient paired text and audio data, especially for single text data without corresponding audio content, using virtual samples Fictitious audio is used as its corresponding audio data for model training, so that a large amount of plain text data can be used for training, and audio data can be effectively used to enhance the model training effect.

According to some embodiments of the present disclosure, the text and audio samples used to train the punctuation model may include multilingual text and audio samples. In some embodiments, multilingual text and audio samples used for model training can be obtained as described above. In particular, for each language text sample, the corresponding audio sample can be obtained, which may be a text sample A sample, or virtual sample, of the content in audio form. As a result, an improved punctuation model can be trained, which can restore punctuation more accurately in multilingual scenarios.

As an example, in the case of multiple languages, there is often the problem of insufficient data in small languages. It is difficult for many small languages to obtain a large amount of high-quality text data. Directly using a small amount of small language data for training will lead to a very poor model effect. The method of multilingual simultaneous training can use easily obtained common language data to strengthen the performance of small languages. performance. In this way, a sample library for multilingual model training can be effectively constructed. By using multilingual samples for simultaneous training, a more accurate punctuation model suitable for multiple languages can be obtained.

According to some embodiments of the present disclosure, equalization may be performed on the multilingual text and audio samples used for model training, so as to further optimize the samples for multilingual model training. In some embodiments, text samples and/or audio samples of languages that account for a small proportion of text and audio samples can be used, such as language samples that are not easily obtained, small language samples, and low-resource/few-resource language samples, Extending processing is performed to balance the proportions of text and/or audio in each language used for model training, in particular, increasing the proportion of text samples and/or audio samples in languages with low proportions. In some embodiments, data expansion can be performed in various appropriate ways, for example, operations such as repetition, interpolation, and the like can be performed on text samples and/or audio samples of languages with a low proportion. In some embodiments, multilingual text samples and/or audio samples may also be processed for equalization by performing temperature sampling or similar sampling algorithms. As an example, due to the scarcity of data in small languages, use the method of temperature sampling to increase the proportion of data in small languages and reduce the proportion of data in high-resource languages according to the amount of data in different languages. This makes the proportion of various language data in the total multilingual data as balanced as possible, which can further optimize the model training effect. For example, the trained model can improve the punctuation recovery of various language texts.

An exemplary implementation of punctuation model training according to some embodiments of the present disclosure will be described in detail below with reference to FIG. 3 . The scheme of the present disclosure receives text input and optionally audio input to perform the task of punctuation recovery to add punctuation to text. The text input and optional audio input here may be the aforementioned text samples and audio samples used for training, or the aforementioned multilingual text samples and audio samples.

The disclosed scheme mainly includes three stages of operations: the first stage processes the text input and audio input respectively, and the second stage performs attention-based processing on the processed text input and audio input respectively, so that the audio information Integrate into the representation of the text to obtain a multi-modal mixed representation, and the third stage will be based on the multi-modal mixed representation for punctuation prediction for model training. The disclosed training process uses the classic SGD algorithm and Adam optimizer to optimize the Cross-Entropy loss function.

Punctuation recovery is often modeled as a sequence labeling task. Typically, a multimodal punctuation corpus is a set of sentence-audio-punctuation triples denoted as S = {x, a, y}. Here x is an unpunctuated sentence of length T and a is the corresponding speech audio. The output of the model should be the predicted punctuated sequence y given x and a. Due to the nature of the sequence labeling task, the sequence of punctuated symbols y has the same length as the unpunctuated sentence x, i.e. |x| = |y|. The punctuation marks may be any suitable punctuation marks available for the text, for example may be of four types, comma (,), period (.), question mark (?), and no punctuation.

Text input can be encoded in the first stage. The encoding process here can be performed in various suitable ways, in particular, unpunctuated text sentences can be split into sequences of subwords and converted into sequences of lexical embeddings

As an example, a lexical encoder can be built using a pre-trained natural language (NLP) processing model as a backbone model, and the model of the lexical encoder can be fine-tuned on task-specific data. In some embodiments, the encoding process of the text may be performed by a text encoder, which may be included in the solutions according to the embodiments of the present disclosure.

In the first stage the audio input can be processed, in particular, the audio input can be converted into an acoustic embedding H _a or a virtual embedding _Hi depending on the type of the audio input. In some embodiments, audio processing may be performed by various suitable acoustic processing components, which may be included in solutions according to embodiments of the present disclosure.

In particular, audio input that is an audio representation of the content of the text input, such as an audio-annotated transcription, can be converted into an acoustic embedding H _a . Specifically, audio input can be converted into acoustic features by a pre-trained acoustic model/acoustic feature extractor. Typically, acoustic feature extractors can be first preprocessed by self-supervised training on unlabeled audio datasets, and can be fine-tuned for downstream punctuation recovery tasks. The downsampling network can then be further applied to shorten the length of the extracted acoustic features to obtain the acoustic embedding content

The goal is to make the length of the acoustic embedding close to or equal to the sentence embedding so that the model can better align the cross-modal information. As an example, a multi-layer convolutional network can be chosen as the core component of the downsampling network.

For audio-free texts, virtual embeddings H _i are used to fictionalize possible missing acoustic features, i.e., if

Then H _a =H _i . As an example, one can set the virtual embedding _Hi to a fixed-length array of learnable parameters and expect it to learn representations for missing audio. That is to say, the virtual audio sample corresponding to the non-audio text as mentioned above can be a predetermined vector sequence, such as an all-zero sequence, and a virtual embedding can be obtained from it, such as directly used as a virtual embedding, or its length can be shortened so that The length of the virtual embedding is close to or equal to the sentence embedding, so the shortened sequence serves as a virtual embedding.

Then, based on the sequence of lexical embeddings and the acoustic or virtual embeddings, the acoustic and lexical features can be combined to generate a multimodal hybrid representation. This operation can be performed by a coordinate bootstrapper, which can jointly train without audio and audio text to overcome the missing problem. In particular, the coordination guide jointly exploits acoustic and lexical features and applies attention-based operations to learn a hybrid representation of the two modalities.

Specifically, self-attention is firstly operated on the lexical embedding sequence H _l to capture the long-range dependencies S _l in unpunctuated sentences, and cross-attention is applied between the lexical embedding sequence H _l and the acoustic embedding sequence H _a operation to form a cross-modal representation S _a :

S _l =Att(H _l , H _l , H _l ) (1)

S _a =Att(H _a , H _a , H _l ) (2)

here

is an attention operation, where d _k is the dimensionality of the model. Note that for modality-missing samples, we replace the acoustic embedding H _a with the virtual embedding H _i , in this case, if

Then S _a =Att(H _i , H _i , H _l ).

Then, the hybrid representation H _h is obtained by adding the attention-processed representation to the residual connections:

H _h =S _l +S _a +H _l (3)

In implementations, coordination guides can be stacked to multiple layers to further increase model capacity.

Finally, predictions are made from the mixture of representations by an output classifier layer. The output classifier layer consists of a linear projection and a softmax activation function, and will input H _h to the classifier layer to predict the punctuation sequence

In this way, we enable representations with audio samples and representations without audio samples to share the same embedding space. Therefore, the model can receive mixed data in the same training batch, and the trained model is able to punctuate both audio text and non-audio text.

It should be noted that, according to some embodiments of the present disclosure, the aforementioned text encoder, acoustic processing component and coordination guider can all be included as submodules/frameworks in the model training device according to the embodiments of the present disclosure, namely the UniPunc framework. In this way, the UniPunc framework according to the present disclosure provides a general framework for addressing modality loss in multimodal punctuation recovery tasks. In addition, the solution according to the present disclosure can also be effectively applied to some current punctuation models, and can be used as a useful supplement thereto. As an example, acoustic processing components and coordination guiders according to the present disclosure can be applied/added to current punctuation models for correction such that they can be improved to handle modally missing samples.

In addition, the above example of model training is also applicable to multilingual scenarios. In particular, the model can be trained using data and audio in different languages at the same time. In particular, data in a plurality of different languages, including text data and/or audio data, is used as input to perform the processing at each stage in the solution according to the present disclosure, for example, audio in each language is input to the audio component, for Texts in different languages can use lexical encoders uniformly, enabling performance-enhanced models suitable for multilingual scenarios.

Moreover, although not shown, before performing the above-mentioned processing on the text and audio, equalization processing can be performed on multilingual input, and the above-mentioned processing can be further performed on the text and audio in each language after equalization, for example, the above three stages of processing. It will not be described in detail here.

Punctuation models according to some embodiments of the present disclosure may be used in various suitable punctuation recovery applications. The punctuation model according to some embodiments of the present disclosure can have good universality and can be applied to any speech recognition system, for example, the text output by the speech recognition system can be further processed to optimize the punctuation recovery of the text output by the speech recognition system . For example, punctuation can be performed on unpunctuated text, or validation can be performed on punctuated text for further correction.

FIG. 2 shows a flowchart of a method for punctuation recovery according to some embodiments of the present disclosure. In method 200, in step S201, the text to be added with punctuation is obtained, such as speech recognition text output, and in step S202, the punctuation model trained according to the model training method of the present disclosure is applied to the obtained text output to recover Punctuation in text output. In this way, punctuation marks can be appropriately added to the text, thereby realizing accurate punctuation recovery or punctuation verification/correction.

In some embodiments, input text may be entered into the model along with associated audio so that the input text may be appropriately punctuated based on both. If the input to the model is only text, which means that the text has no audio, a dummy sample can be generated to be fed into the model along with the text, and punctuation is added to the input text based on this. In some other embodiments, the audio information of the text input into the punctuation model can be obtained, for example, according to an indicator indicating whether the text input has audio, and in the case where the indicator indicates that the text has audio, the punctuation model can obtain the text audio , for example text audio is input into the model together with the text, or the text audio is obtained from a predetermined storage location, where the indicator indicates that the text has no audio, such as a single text, a virtual audio sample can be generated and input into the punctuation model, whereby, The punctuation model will be rationalized to predict representational symbols, from which punctuation-recovered texts can be obtained. This process can be performed in a similar manner as the model training process described above, such as the above three stages of processing, which will not be described in detail here.

FIG. 4A shows a punctuation model training device 400 according to some embodiments of the present disclosure. The device 400 includes a sample acquisition unit 401 configured to acquire text samples and corresponding audio samples for model training, wherein for text samples obtained from text without audio, the corresponding audio samples are virtual samples; and a training unit 402, configured to train a model for punctuation recovery in speech recognition based on the acquired text samples and corresponding audio samples for model training.

In some embodiments, the training unit 402 is further configured to perform attention-based processing based on the acquired text samples and audio samples for model training to generate a multi-modal mixed representation, whereby the training unit 402 can be based on The generated multimodal hybrid representations are used for model training. The above-mentioned multimodal mixed representation can be generated by the mixed representation generation unit 4021, although not shown, but in an exemplary implementation, model training can be performed by a component for punctuation prediction, such as a classifier or a similar model training component, etc., In this case, such components may be included in the training unit.

In some embodiments, the mixed representation generation unit 4021 may further include a text conversion unit 4022 configured to convert text samples into a vocabulary embedding sequence; an audio conversion unit 4023 configured to convert audio samples into an acoustic embedding sequence; and jointly The processing unit 4024 is configured to perform attention-based operations on the vocabulary embedding sequence and the acoustic embedding sequence respectively; and combine the processing results of the vocabulary embedding sequence and the acoustic embedding sequence to generate a multi-modal hybrid representation. It should be noted that although not shown, the text conversion unit and audio conversion unit may not be included in the hybrid representation generation unit, or even outside the training unit, that is, text conversion and audio conversion may be performed for The processing of samples, so that the input samples used for model training are actually transformed samples.

Each of the above units can be implemented in various appropriate ways. In an exemplary implementation, the text conversion unit 4022 and the audio conversion unit 4023 may at least correspond to the text encoder and the acoustic processing component described above, respectively, and the joint processing unit 4024 may correspond to the coordination guide described above, mixing The representation generation unit 4021 may at least include or correspond to the coordination guide described above, and may also include the vocabulary encoder, the acoustic processing component, and the coordination guide described above; the training unit 402 may at least include or correspond to the coordination guide described in FIG. 3 Corresponding modules/devices, etc. for all processing stages shown, such as lexical encoders and acoustic processing components, coordination guides, classifiers, etc. as described above.

FIG. 4B shows a punctuation recovery device 410 according to some embodiments of the present disclosure. Apparatus 410 includes an acquisition unit 411 configured to acquire the text output of speech recognition, and a punctuation recovery unit 412 configured to apply the punctuation model trained according to the training method of an embodiment of the present disclosure to the acquired text output to recover Punctuation in text output. The punctuation recovery unit here can perform operations/processing similar to the above-mentioned training unit, for example, it can include the above-mentioned text encoder and acoustic processing components, coordination guider, classifier and so on.

It should be noted that the above-mentioned units are only logical modules divided according to the specific functions they implement, and are not used to limit specific implementation methods, for example, they can be implemented in software, hardware, or a combination of software and hardware. In actual implementation, each of the above units may be implemented as an independent physical entity, or may also be implemented by a single entity (for example, a processor (CPU or DSP, etc.), an integrated circuit, etc.). In addition, the above-mentioned units are shown with dotted lines in the drawings to indicate that these units may not actually exist, and the operations/functions realized by them may be realized by the processing circuit itself.

In addition, although not shown, the device may further include a memory that can store various information generated in operation by the device, each unit included in the device, programs and data for operations, data to be transmitted by a communication unit, etc. . The memory can be volatile memory and/or non-volatile memory. For example, memory may include, but is not limited to, random access memory (RAM), dynamic random access memory (DRAM), static random access memory (SRAM), read only memory (ROM), flash memory. Of course, the memory could also be located external to the device. Optionally, although not shown, the device may also include a communication unit, which may be used to communicate with other devices. In one example, the communication unit may be implemented in an appropriate manner known in the art, for example including communication components such as antenna arrays and/or radio frequency links, various types of interfaces, communication units and the like. It will not be described in detail here. In addition, the device may also include other components not shown, such as a radio frequency link, a baseband processing unit, a network interface, a processor, a controller, and the like. It will not be described in detail here.

This disclosure proposes the training and application of an improved punctuation model for punctuation recovery, which can use both single text data and paired text and audio data for training, thereby solving the problem of insufficient paired text and audio data . In particular, by using pre-defined vectors instead of audio as input when there is no audio, a large amount of plain text data can be used to construct a training set, and audio data can be used to assist, thereby further improving the model training effect and improving the training results. the accuracy of the model. In addition, the present disclosure can also utilize multilingual data for collaborative training, in particular, by using language data with a large amount of data to enhance the performance of the model for a small number of languages, thereby obtaining a further optimized punctuation model. As a result, plain text data, text and audio aligned data, and data in different languages can be trained together, which improves the performance of the model on plain text data, text and audio aligned data, and minority language data. Moreover, such a model can be suitable for various application tasks, especially for speech recognition tasks, and achieve better punctuation recovery effects.

The effectiveness of the solution of the present disclosure will be further demonstrated in combination with examples below.

For the datasets used for training and testing, experiments will be mainly conducted on two real-world corpora: MuST-C and Multilingual TEDx (mTEDx), both of which have audio sourced from TED talks. The dataset is constructed based on the following two corpora: 1) English-Audio: This collection contains English audio and sentences in MuST-C, with audio for each sample. 2) English-Mixed (English-Mixed): This collection contains all English audio sentences and no audio sentences for both corpora. Note that English Audio is a subset of English Mixed.

On the above data set, the UniPunc according to the scheme of the present disclosure is tested to compare the performance with the multimodal model in the related art. The test proves that the UniPunc according to the embodiment of the present disclosure is better than the related art in the English audio collection. Multimodal model, the performance of UniPunc according to the embodiment of the present disclosure on the English-mixed set is better than its performance on the English audio set, which shows that even if no audio sentences are used, the UniPunc of the embodiment of the present disclosure can outperform Compared with the existing multimodal models, the performance of the punctuation model according to the present disclosure can be further improved by adopting non-audio sentences.

On the above-mentioned English mixed data set, the UniPunc according to the present disclosure is tested to compare performance with the single-modal model in the related art. The test proves that the UniPunc according to the embodiment of the present disclosure can effectively obtain multi-modal mixed representations, And effectively represent the acoustic features in speech, which can significantly improve the punctuation recovery of text.

In addition, by testing on the multilingual data of mTEDx, it can be found that the punctuation recovery achieved by UniPunc according to the embodiments of the present disclosure can be closer to human beings, and can better distinguish the pauses between commas and periods and the tone of questions. In addition, UniPunc according to the present disclosure has better punctuation performance than other baselines on multilingual punctuation, which shows that UniPunc has better robustness and generalization ability.

Moreover, the UniPunc of the present disclosure can be well adapted to any existing punctuation recovery method. In particular, by introducing the effective modules in the UniPunc framework of the present disclosure, especially the acoustic auxiliary module and/or the coordination guide, into the existing punctuation recovery scheme, the performance of the existing punctuation recovery scheme can be further optimized. Experiments show that the UniPunc scheme of the present disclosure, especially the modules such as the acoustic auxiliary module and/or the coordination guider, has general applicability for solving the lack of modes in punctuation recovery, and enables the previous single-modal model to handle multiple The modal corpus further improves the overall performance.

Some embodiments of the present disclosure also provide an electronic device operable to implement the operations/functions of the aforementioned model pre-training device and/or model training device. Figure 5 shows a block diagram of some embodiments of an electronic device of the present disclosure. For example, in some embodiments, the electronic device 5 can be various types of devices, such as but not limited to mobile phones, notebook computers, digital broadcast receivers, PDAs (Personal Digital Assistants), PADs (Tablet Computers), Mobile terminals such as PMPs (Portable Multimedia Players), vehicle-mounted terminals (eg, vehicle-mounted navigation terminals), and the like, and stationary terminals such as digital TVs, desktop computers, and the like. For example, the electronic device 5 may include a display panel for displaying data and/or execution results utilized in the solutions according to the present disclosure. For example, the display panel can be in various shapes, such as a rectangular panel, an oval panel, or a polygonal panel, and the like. In addition, the display panel can be not only a flat panel, but also a curved panel, or even a spherical panel.

As shown in FIG. 5 , the electronic device 5 of this embodiment includes: a memory 51 and a processor 52 coupled to the memory 51 . It should be noted that the components of the electronic device 50 shown in FIG. 5 are exemplary rather than limiting, and the electronic device 50 may also have other components according to actual application requirements. Processor 52 may control other components in electronic device 5 to perform desired functions.

In some embodiments, memory 51 is used to store one or more computer readable instructions. When the processor 52 is used to execute computer-readable instructions, the computer-readable instructions are executed by the processor 52 to implement the method according to any of the above-mentioned embodiments. For the specific implementation and related explanations of each step of the method, reference may be made to the above-mentioned embodiments, and repeated descriptions will not be repeated here.

For example, the processor 52 and the memory 51 may communicate with each other directly or indirectly. For example, processor 52 and memory 51 may communicate via a network. A network may include a wireless network, a wired network, and/or any combination of a wireless network and a wired network. The processor 52 and the memory 51 may also communicate with each other through the system bus, which is not limited in the present disclosure.

For example, the processor 52 can be embodied as various suitable processors, processing devices, etc., such as a central processing unit (CPU), a graphics processing unit (Graphics Processing Unit, GPU), a network processor (NP), etc.; Signal Processor (DSP), Application Specific Integrated Circuit (ASIC), Field Programmable Gate Array (FPGA) or other Programmable Logic Devices, Discrete Gate or Transistor Logic Devices, Discrete Hardware Components. The central processing unit (CPU) may be an X86 or ARM architecture or the like. For example, memory 51 may include any combination of various forms of computer-readable storage media, such as volatile memory and/or non-volatile memory. The memory 51 may include, for example, a system memory, and the system memory stores, for example, an operating system, an application program, a boot loader (Boot Loader), a database, and other programs. Various application programs, various data, and the like can also be stored in the storage medium.

In addition, according to some embodiments of the present disclosure, when various operations/processing according to the present disclosure are implemented by software and/or firmware, they can be transferred from a storage medium or a network to a computer system with a dedicated hardware structure, such as shown in FIG. 6 The computer system 600 shown is installed with programs constituting the software, and the computer system, when installed with various programs, can perform various functions, including functions such as those described above, and the like. FIG. 6 is a block diagram illustrating an example structure of a computer system employable in some embodiments of the present disclosure.

In FIG. 6 , a central processing unit (CPU) 601 executes various processes according to programs stored in a read only memory (ROM) 602 or loaded from a storage section 608 to a random access memory (RAM) 603 . In the RAM 603, data required when the CPU 601 executes various processing and the like is also stored as necessary. The central processing unit is only exemplary, and it may also be other types of processors, such as the various processors mentioned above. ROM 602, RAM 603 and storage portion 608 may be various forms of computer-readable storage media, as described below. It should be noted that although ROM 602, RAM 603 and storage device 608 are shown separately in FIG. 6, one or more of them may be combined or located in the same or different memory or storage modules.

The CPU 601, ROM 602, and RAM 603 are connected to each other via a bus 604. The input/output interface 605 is also connected to the bus 604 .

The following components are connected to the input/output interface 605: an input section 606, such as a touch screen, touch pad, keyboard, mouse, image sensor, microphone, accelerometer, gyroscope, etc.; an output section 607, including a display, such as a cathode ray tube (CRT ), a liquid crystal display (LCD), a speaker, a vibrator, etc.; a storage section 608 including a hard disk, a magnetic tape, etc.; and a communication section 609 including a network interface card such as a LAN card, a modem, and the like. The communication section 609 allows communication processing to be performed via a network such as the Internet. It is easy to understand that although it is shown in FIG. 6 that each device or module in the electronic device 600 communicates through the bus 604, they may also communicate through a network or other methods, wherein the network may include a wireless network, a wired network , and/or any combination of wireless and wired networks.

A driver 610 is also connected to the input/output interface 605 as needed. A removable medium 611 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 610 as necessary, so that a computer program read therefrom is installed into the storage section 608 as necessary.

In the case of realizing the above-described series of processes by software, the programs constituting the software can be installed from a network such as the Internet or a storage medium such as the removable medium 611 .

According to some embodiments of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program carried on a computer readable medium, the computer program comprising program code for performing a method according to some embodiments of the present disclosure. In such an embodiment, the computer program may be downloaded and installed from a network via communication means 609, or from storage means 608, or from ROM 602. When the computer program is executed by the CPU 601, the above-mentioned functions defined in the methods of the embodiments of the present disclosure are performed.

It should be noted that, in the context of the present disclosure, a computer-readable medium may be a tangible medium that may contain or store information for use by or in conjunction with an instruction execution system, device, or device. program. A computer readable medium may be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer-readable storage medium may be, for example, but not limited to: an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, device, or device, or any combination thereof. More specific examples of computer-readable storage media may include, but are not limited to, electrical connections with one or more wires, portable computer diskettes, hard disks, random access memory (RAM), read-only memory (ROM), erasable Programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination of the above. In the present disclosure, a computer-readable storage medium may be any tangible medium that contains or stores a program that can be used by or in conjunction with an instruction execution system, apparatus, or device. In the present disclosure, however, a computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave carrying computer-readable program code therein. Such propagated data signals may take many forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the foregoing. A computer-readable signal medium may also be any computer-readable medium other than a computer-readable storage medium, which can transmit, propagate, or transmit a program for use by or in conjunction with an instruction execution system, apparatus, or device . Program code embodied on a computer readable medium may be transmitted by any appropriate medium, including but not limited to wires, optical cables, RF (radio frequency), etc., or any suitable combination of the above.

The above-mentioned computer-readable medium may be included in the above-mentioned electronic device, or may exist independently without being incorporated into the electronic device.

In some embodiments, a computer program is also provided, including: instructions, which when executed by a processor cause the processor to execute the method of any one of the above embodiments. For example, instructions may be embodied as computer program code.

In the embodiments of the present disclosure, the computer program codes for performing the operations of the present disclosure may be written in one or more programming languages or a combination thereof, the above-mentioned programming languages include but not limited to object-oriented programming languages, Such as Java, Smalltalk, C++, also includes conventional procedural programming languages, such as the "C" language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In cases involving a remote computer, the remote computer can be connected to the user computer through any kind of network, including a local area network (LAN) or a wide area network (WAN), or it can be connected to an external computer (such as through an Internet service provider). Internet connection).

The flowchart and block diagrams in the Figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in a flowchart or block diagram may represent a module, program segment, or portion of code that contains one or more logical functions for implementing specified executable instructions. It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or they may sometimes be executed in the reverse order, depending upon the functionality involved. It should also be noted that each block of the block diagrams and/or flowchart illustrations, and combinations of blocks in the block diagrams and/or flowchart illustrations, can be implemented by a dedicated hardware-based system that performs the specified functions or operations , or may be implemented by a combination of dedicated hardware and computer instructions.

The modules, components or units involved in the embodiments described in the present disclosure may be implemented by software or by hardware. Wherein, the name of a module, component or unit does not constitute a limitation on the module, component or unit itself under certain circumstances.

The functions described herein above may be performed at least in part by one or more hardware logic components. For example, without limitation, exemplary hardware logic components that may be used include: Field Programmable Gate Array (FPGA), Application Specific Integrated Circuit (ASIC), Application Specific Standard Product (ASSP), System on Chip (SOC), Complex Programmable Logical device (CPLD) and so on.

According to some embodiments of the present disclosure, a method for training a model for speech recognition punctuation recovery is proposed, comprising the following steps: acquiring text samples and corresponding audio samples for model training, wherein for A text sample of the text, and its corresponding audio sample is a virtual sample; and a model for punctuation recovery in speech recognition is trained based on the acquired text sample and audio sample for model training.

In some embodiments, the text samples and corresponding audio samples used for model training may be obtained from both audio text and non-audio text, where pairs of text samples and audio samples are obtained from audio text , and from the audio-free text, a text sample is obtained and a dummy sample is obtained as an audio sample corresponding to the text sample.

In some embodiments, a virtual sample is a preset sample of audio for a fictional text sample.

In some embodiments, training a model for punctuation recovery in speech recognition may include: generating a mixed representation based on the acquired text samples and audio samples used for model training, and performing model training based on the mixed representation.

In some embodiments, the hybrid representation may be a multimodal hybrid representation obtained by performing attention-based processing based on the acquired text samples and audio samples for model training.

In some embodiments, attention-based processing may be performed on audio samples for model training based on textual information associated with the text samples.

In some embodiments, the text information related to the text sample may be a vocabulary feature obtained by converting the text sample or a processing result obtained by performing attention-based processing on the text sample.

In some embodiments, text samples are converted to a sequence of lexical embeddings, audio samples are converted to a sequence of acoustic embeddings, and attention-based operations are performed on the lexical and acoustic embedding sequences, respectively.

In some embodiments, where the audio sample is an audio version of the text content of the corresponding text sample, the acoustic embedding sequence is obtained based on acoustic features extracted from the audio sample; and where the audio sample is a virtual sample, the Virtual samples are embedded as sequences of acoustics.

In some embodiments, the attention-based processing performed on text samples used for model training may be self-attention-based processing, and the attention-based processing performed on audio samples used for model training The processing may be a cross-attention based processing.

In some embodiments, the acquired text samples and corresponding audio samples for model training may include multilingual text samples and audio samples.

In some embodiments, multilingual text samples and audio samples may be equalized to increase the proportion of low-resource language samples.

According to some embodiments of the present disclosure, a method for recovering punctuation in speech recognition is proposed, including the following steps: obtaining text output of speech recognition, and applying any implementation described in the present disclosure to the obtained text output The punctuation model trained by the model training method described in the example is used to restore the punctuation in the text output.

According to some other embodiments of the present disclosure, a training device for a model of speech recognition punctuation recovery is proposed, including: a sample acquisition unit configured to acquire text samples and corresponding audio samples for model training, wherein For text samples obtained from non-audio text, the corresponding audio samples are virtual samples; and a training unit configured to train a model for speech recognition punctuation recovery based on the acquired text samples and audio samples for model training .

In some embodiments, the training unit may be further configured to: perform attention-based processing based on the acquired text samples and audio samples for model training to generate a multimodal hybrid representation.

In some embodiments, the apparatus may further comprise: a text conversion unit configured to convert text samples into a vocabulary embedding sequence, an audio conversion unit configured to convert audio samples into an acoustic embedding sequence, and the training unit are configured to perform attention-based operations on sequences of lexical embeddings and sequences of acoustic embeddings, respectively.

According to some other embodiments of the present disclosure, a device for recovering punctuation in speech recognition is proposed, including: an acquisition unit configured to acquire text output from speech recognition, and a punctuation recovery unit configured to recover the acquired text The output uses the punctuation model trained according to the model training method described in any embodiment of the present disclosure to recover the punctuation in the text output.

According to still other embodiments of the present disclosure, an electronic device is provided, including: a memory; and a processor coupled to the memory, the memory storing instructions, the instructions, when executed by the processor, Making the electronic device execute the method of any embodiment described in the present disclosure.

According to still some embodiments of the present disclosure, a computer-readable storage medium is provided, on which a computer program is stored, and when the program is executed by a processor, the method of any embodiment described in the present disclosure is implemented.

According to still some embodiments of the present disclosure, a computer program is provided, including: instructions, which when executed by a processor cause the processor to perform the method of any embodiment described in the present disclosure.

According to some embodiments of the present disclosure, there is provided a computer program product comprising instructions which, when executed by a processor, implement the method of any one of the embodiments described in the present disclosure.

The above descriptions are only some embodiments of the present disclosure and illustrations of the applied technical principles. Those skilled in the art should understand that the disclosure scope involved in this disclosure is not limited to the technical solution formed by the specific combination of the above-mentioned technical features, but also covers the technical solutions formed by the above-mentioned technical features or Other technical solutions formed by any combination of equivalent features. For example, a technical solution formed by replacing the above-mentioned features with (but not limited to) technical features with similar functions disclosed in this disclosure.

In the description provided herein, numerous specific details are set forth. However, it is understood that embodiments of the present disclosure may be practiced without these specific details. In other instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure the understanding of this description.

In addition, while operations are depicted in a particular order, this should not be understood as requiring that the operations be performed in the particular order shown or to be performed in sequential order. Under certain circumstances, multitasking and parallel processing may be advantageous. Likewise, while the above discussion contains several specific implementation details, these should not be construed as limitations on the scope of the disclosure. Certain features that are described in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination.

Although some specific embodiments of the present disclosure have been described in detail through examples, those skilled in the art should understand that the above examples are for illustration only, rather than limiting the scope of the present disclosure. It will be appreciated by those skilled in the art that modifications may be made to the above embodiments without departing from the scope and spirit of the present disclosure. The scope of the present disclosure is defined by the appended claims.

Claims

A training method for a model of speech recognition punctuation recovery, comprising the following steps:

Obtaining text samples and corresponding audio samples for model training, wherein for text samples obtained from text without audio, the corresponding audio samples are dummy samples; and

A model for punctuation recovery in speech recognition is trained based on the acquired text samples and corresponding audio samples for model training.
The method of claim 1 , wherein the text samples and corresponding audio samples used for model training are obtained from both audio text and non-audio text,

Wherein, a paired text sample and an audio sample are obtained from a text with audio, and a text sample is obtained from a text without audio, and a dummy sample is obtained as an audio sample corresponding to the text sample.
A method as claimed in claim 1 or 2, wherein the virtual sample is a preset sample of audio for a fictional text sample.
The method according to claim 1, wherein training a model for speech recognition punctuation recovery comprises:

generate a multimodal hybrid representation based on the acquired text samples and audio samples for model training, and

Model training is performed based on the multimodal hybrid representation.
The method of claim 4, wherein the multimodal hybrid representation is a multimodal hybrid representation obtained by performing attention-based processing based on the acquired text samples and audio samples for model training.
The method of claim 5, wherein attention-based processing is performed on audio samples for model training based on text information associated with the text samples.
The method according to claim 6, wherein the text information related to the text sample is a vocabulary feature obtained by converting the text sample or a processing result obtained by performing attention-based processing on the text sample.
The method according to claim 1, wherein,

convert text samples into sequences of lexical embeddings,

Convert audio samples into sequences of acoustic embeddings,

Attention-based operations are performed on sequences of lexical embeddings and sequences of acoustic embeddings, respectively; and

The manipulated sequence of lexical embeddings and acoustic embeddings are combined to generate a multimodal hybrid representation.
The method of claim 8, wherein,

Obtaining an acoustic embedding sequence based on acoustic features extracted from the audio sample, where the audio sample is an audio version of the text content of the corresponding text sample;

In case the audio sample is a virtual sample, the virtual sample is used as an acoustically embedded sequence.
The method according to any one of claims 5-9, wherein the attention-based processing performed on the text samples used for model training is a self-attention (self-attention)-based processing, and the The attention-based processing performed on the training audio samples is cross-attention based.
The method according to any one of claims 1-10, wherein the acquired text samples and corresponding audio samples for model training include multilingual text samples and corresponding audio samples.
The method of claim 11, wherein the multilingual text samples and corresponding audio samples are equalized to increase the proportion of low-resource language samples.
A method for recovering punctuation in speech recognition, comprising the following steps:

Get the text output of the speech recognition, and

Applying the punctuation model trained by the method according to any one of claims 1-12 to the obtained text output restores the punctuation in the text output.
A training device for a model of speech recognition punctuation restoration, comprising:

A sample acquisition unit configured to acquire text samples and corresponding audio samples for model training, wherein for text samples obtained from text without audio, the corresponding audio samples are virtual samples; and

The training unit is configured to train a model for speech recognition punctuation recovery based on the acquired text samples and corresponding audio samples for model training.
The apparatus according to claim 14, wherein the training unit further comprises a hybrid representation generating unit configured to:

Multimodal hybrid representations are generated by performing attention-based processing based on the acquired text samples and audio samples for model training.
The apparatus of claim 15, wherein the training unit further comprises:

a text conversion unit configured to convert text samples into sequences of lexical embeddings,

an audio conversion unit configured to convert the audio samples into an acoustically embedded sequence, and

A joint processing unit configured to perform attention-based operations on the lexical embedding sequence and the acoustic embedding sequence respectively; and combine the operated lexical embedding sequence and the acoustic embedding sequence to generate a multimodal hybrid representation.
A device for restoring punctuation in speech recognition, comprising:

an acquisition unit configured to acquire text output for speech recognition, and

The punctuation recovery unit is configured to apply the punctuation model trained according to the method according to any one of claims 1-12 to the acquired text output to recover the punctuation in the text output.
An electronic device comprising:

memory; and

A processor coupled to the memory, the memory storing instructions which, when executed by the processor, cause the electronic device to perform the method according to any one of claims 1-12 .
A computer-readable storage medium, on which a computer program is stored, and when the program is executed by a processor, the method according to any one of claims 1-12 is implemented.
A computer program product comprising instructions which, when executed by a processor, cause the method according to any one of claims 1-12 to be carried out.