WO2023082931A1 - Method for punctuation recovery in speech recognition, and device and storage medium - Google Patents

Method for punctuation recovery in speech recognition, and device and storage medium Download PDF

Info

Publication number
WO2023082931A1
WO2023082931A1 PCT/CN2022/125163 CN2022125163W WO2023082931A1 WO 2023082931 A1 WO2023082931 A1 WO 2023082931A1 CN 2022125163 W CN2022125163 W CN 2022125163W WO 2023082931 A1 WO2023082931 A1 WO 2023082931A1
Authority
WO
WIPO (PCT)
Prior art keywords
text
audio
samples
punctuation
sample
Prior art date
Application number
PCT/CN2022/125163
Other languages
French (fr)
Chinese (zh)
Inventor
吴礼蔚
朱耀明
程善伯
王明轩
Original Assignee
北京有竹居网络技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京有竹居网络技术有限公司 filed Critical 北京有竹居网络技术有限公司
Publication of WO2023082931A1 publication Critical patent/WO2023082931A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems

Definitions

  • the present disclosure relates to speech recognition, including punctuation recovery in speech recognition.
  • ASR Automatic Speech Recognition
  • a technology that converts human speech into text has a wide range of applications and can serve as an upstream component for multiple tasks, such as voice assistants and speech translation, among others.
  • Existing commercial speech recognition systems often output text without punctuation, which may lead to misunderstandings and affect the performance of downstream tasks such as machine translation and information extraction.
  • text without punctuation is difficult to read, has poor readability, unclear sentences, and ambiguity.
  • downstream tasks such as machine translation and information extraction assume that the input is punctuated. Yes, text without punctuation can lead to performance degradation in downstream tasks.
  • a method for training a model for speech recognition punctuation recovery comprising the following steps: acquiring text samples and corresponding audio samples for model training, wherein for A text sample, and its corresponding audio sample is a virtual sample; and a model for punctuation recovery in speech recognition is trained based on the acquired text sample for model training and the corresponding audio sample.
  • a training device for a model of speech recognition punctuation recovery including: a sample acquisition unit configured to acquire text samples and corresponding audio samples for model training, wherein for A text sample obtained from an audio-free text, the corresponding audio sample being a virtual sample; and a training unit configured to train a model for speech recognition punctuation recovery based on the acquired text sample for model training and the corresponding audio sample Model.
  • an electronic device including: a memory; and a processor coupled to the memory, the processor configured to execute the instructions described in the present disclosure based on instructions stored in the memory. The method of any embodiment.
  • a computer-readable storage medium on which a computer program is stored, and the program, when executed by a processor, causes the method of any embodiment described in the present disclosure to be implemented.
  • a computer program product comprising instructions which, when executed by a processor, cause implementing the method of any one of the embodiments described in the present disclosure.
  • a computer program the program code included in the computer program causes the method of any embodiment described in the present disclosure to be implemented when executed by a computer.
  • FIG. 1A to 1C show a flowchart of a training method for a model for speech recognition punctuation recovery according to some embodiments of the present disclosure.
  • Fig. 2 shows a flowchart of a method for recovering punctuation in speech recognition according to some embodiments of the present disclosure.
  • FIG. 3 illustrates an exemplary implementation of model training for speech recognition punctuation recovery according to some embodiments of the present disclosure.
  • FIG. 4A shows a block diagram of a training device for a model of speech recognition punctuation recovery according to some embodiments of the present disclosure
  • FIG. 4B shows a block diagram of a speech recognition punctuation recovery device according to some embodiments of the present disclosure.
  • Figure 5 shows a block diagram of some embodiments of an electronic device of the present disclosure.
  • FIG. 6 shows a block diagram of other embodiments of the electronic device of the present disclosure.
  • comprising and its variants used in the present disclosure mean an open term including at least the following elements/features but not excluding other elements/features, ie “including but not limited to”.
  • the term “comprising” and its variants used in the present disclosure mean an open term that includes at least the following elements/features but does not exclude other elements/features, namely “comprising but not limited to”. Thus, including is synonymous with comprising.
  • the term “based on” means “based at least in part on”.
  • references throughout this specification to "one embodiment,” “some embodiments,” or “an embodiment” means that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one embodiment of the present disclosure.
  • the term “one embodiment” means “at least one embodiment”; the term “another embodiment” means “at least one additional embodiment”; the term “some embodiments” means “at least some embodiments.”
  • appearances of the phrase “in one embodiment,” “in some embodiments,” or “in an embodiment” in various places throughout the specification are not necessarily all referring to the same embodiment, but may also refer to the same embodiment. Example.
  • a punctuation recovery task In order to solve the problem of no punctuation in the output text of speech recognition, a punctuation recovery task is proposed and a corresponding punctuation model is designed.
  • the punctuation recovery task aims to correctly add punctuation marks to the output of the automatic speech recognition system, such as the output text sentence, and perform punctuation recovery by using a related punctuation model in the implementation.
  • Conventional punctuation models either only use text, or require audio corresponding to the text, but are often limited by real-world scenarios.
  • the present disclosure proposes an improved solution, which can use audio text and non-audio text for punctuation model training.
  • the scheme of the present disclosure can make full use of non-audio text to construct a large number of training sets for training the punctuation model, and provide further optimization training with the help of limited audio , thereby improving the accuracy of the punctuation model and providing a punctuation model capable of enhancing the punctuation recovery effect of text with optional audio.
  • the present disclosure proposes a so-called unified multimodal punctuation framework, which may be called UniPunc, which is capable of utilizing both audio text and non-audio text for punctuation model training.
  • UniPunc a so-called unified multimodal punctuation framework
  • the corresponding text and audio pair can be obtained; for text without audio, the corresponding audio can be fabricated by constructing virtual content, and the corresponding text and audio pair can also be constructed.
  • This enables punctuation model training based on text and audio pairs obtained from both audio text and non-audio text. It is thus possible to utilize readily available large amounts of non-audio text/corpus as a training set, and utilize sound input to reduce ambiguity in punctuation marks, resulting in punctuation models with improved accuracy.
  • the solution disclosed in the present disclosure can further improve the training and application of the punctuation model in a multilingual scene.
  • Conventional multimodal methods can only handle single-language input, and data in some languages is difficult to obtain and is very rare. For example, it is difficult to obtain a large amount of high-quality text and audio data in many small languages, which cannot meet the needs of model training.
  • Direct Training results in a model that performs very poorly.
  • the present disclosure uses a multilingual collaborative training method to solve the problem of insufficient data in small languages.
  • the present disclosure can simultaneously use audio and text in multiple languages for model training, so that easily obtained data of commonly used languages can be used to enhance the performance of minority languages, thereby obtaining an improved multilingual punctuation model.
  • FIG. 1A shows a training method of a model for punctuation recovery in speech recognition according to some embodiments of the present disclosure.
  • a text sample and a corresponding audio sample for model training are obtained, wherein for a text sample obtained from a text without audio, the corresponding audio sample is a virtual sample; and in step S102, based on the The acquired text samples and corresponding audio samples for model training are used to train the model for speech recognition punctuation recovery.
  • the text samples and corresponding audio samples used for model training substantially correspond to a set of paired text and audio samples, wherein for each text sample, its corresponding audio sample is obtained, which Either the corresponding audio content of the text sample content, or a virtual audio sample for the audio of the fictional text sample.
  • the text samples and corresponding audio samples used for model training are obtained from both audio text and non-audio text, for example from an initial set containing both audio text and non-audio text out.
  • obtaining the paired text sample and audio sample from the audio text for example, can be obtained by converting/extracting/segmenting in various appropriate ways, which will not be described in detail here.
  • a text sample is obtained (such a text sample may be referred to as a single text sample) and a dummy sample is obtained as the corresponding audio sample of the text sample.
  • a virtual sample may be a preset sample of audio for a fictional text sample.
  • a virtual sample may take any suitable form. In particular, it may be a vector with preset content, such as a vector with fixed length and fixed content, preferably an all-zero vector. Of course, it can also be in other suitable forms. In some embodiments, for all single text samples used for model training, virtual samples can be constructed for them, and their virtual samples can be the same or different from each other.
  • the initial set containing both audio text and non-audio text may be obtained by any suitable means.
  • they can be obtained from a conventional database, such as an existing training database in a storage device; or can be obtained using an appropriate type of acquisition device, for example, a microphone can be used to obtain audio data, a video device or an appropriate A text input device to enter text data; or user-entered text and/or human-annotated audio.
  • audio text and non-audio text may be stored in any suitable manner for obtaining text samples and audio samples.
  • these texts are provided with indication information of whether the associated text has audio, such as an indicator in binary form, etc., so that the model training device can determine whether to generate virtual audio for the text sample according to the indication information. Samples, such as generating dummy samples only if the indication indicates that the text sample has no audio.
  • training a punctuation model for speech recognition punctuation recovery may include generating a multimodal mixed representation based on the acquired text samples and audio samples for model training, as in step S1021 in FIG. 1B , and Model training is performed based on the generated multi-modal mixed representation, as in step S1022 in FIG. 1B .
  • the obtained multimodal mixed representation can contain text and audio information obtained from the training sample set effectively created by utilizing audio text and non-audio text, and such multimodal mixed representation can well support model training.
  • a corresponding multimodal mixed representation is obtained, thereby obtaining multimodal mixed representations of all text samples and audio samples.
  • a multimodal hybrid representation may be obtained by performing processing based on text samples and audio samples separately, and combining the processing results based on text samples and the processing results based on audio samples.
  • a hybrid representation is generated based on the sum of processing results of text samples and audio samples. It should be noted that other appropriate processing can also be performed on the processing results of text samples and audio samples to generate a mixed representation, such as weighted sum or other appropriate mathematical operations, etc., which will not be described in detail here.
  • the processing performed based on text samples and audio samples may be attention-based processing, so that the multi-modal hybrid representation may be based on acquired text samples and audio samples for model training.
  • Multimodal hybrid representations obtained by performing attention-based processing on audio samples are performed based on text samples and audio samples separately, and the processing results of text samples and audio samples are combined to obtain a multimodal hybrid representation.
  • the attention-based processing applied to the text samples and audio samples may be any suitable processing, preferably they are different from each other.
  • text samples are processed using a self-attention mechanism.
  • audio samples are processed using a cross-attention mechanism.
  • the self-attention mechanism and the cross-attention mechanism can adopt various appropriate architectures/algorithms/models, etc., which will not be described in detail here.
  • a multimodal hybrid representation can be generated by combining the self-attention processing results of text samples and the cross-attention processing results of audio samples.
  • text information related to text samples is also considered when performing attention-based processing on audio samples used for model training. That is to say, in the processing of the audio sample, the text information related to the text sample corresponding to the audio sample will also be used as an input parameter for processing.
  • the text information related to the text sample may be a feature converted/extracted from the text sample or a processing result obtained by performing attention-based processing on the text sample, such as self-attention processing.
  • training the model for speech recognition punctuation recovery based on the text samples and corresponding audio samples used for model training further includes performing Punctuation model training.
  • text samples and audio samples may be converted separately to obtain intermediate feature values/sequences, and then a multimodal hybrid representation may be generated based on the intermediate feature values/sequences converted from text samples and audio samples for use in Punctuation model training.
  • attention-based processing can be performed on intermediate feature values/sequences converted from text samples and audio samples to generate a multimodal mixture representation, and punctuation model training is performed based on the multimodal mixture, as shown in Figure 1C.
  • the attention-based processing performed on intermediate feature values/sequences can be performed as previously described and will not be described in detail here.
  • the intermediate feature values/sequences may be in any suitable form and may be obtained through corresponding operations.
  • a sequence of lexical embeddings can be converted from a text sample as its intermediate feature value/sequence.
  • text samples may be subjected to lexical encoding processing to obtain lexical embedding sequences.
  • lexical encoding processing may be performed on sequences of word embeddings.
  • the vocabulary embedding sequence or the processing result of the vocabulary embedding sequence after self-attention processing can also be used as text information related to the text sample to be applied to the attention-based processing of the audio sample, for example, as an input parameter applied to the Cross-attention processing of audio samples.
  • audio samples may be processed to obtain acoustic embedding content as their intermediate feature values/sequences.
  • the processing of audio samples may be implemented in various suitable ways.
  • the processing of the audio samples may include encoding and/or downsampling of the audio samples.
  • corresponding processing may be performed on the corresponding audio sample as to whether the text sample has audio or not. For example, encoding and/or downsampling may be performed on audio samples derived from audio text to generate acoustic embedded content, while virtual samples corresponding to non-audio text may be directly used as acoustic embedded content.
  • indication information about whether the text sample has audio can be input into the model training device together with the text sample and the audio sample, so that the model training device can analyze the text according to the indication information.
  • the audio samples corresponding to the samples are processed accordingly for model training.
  • attention-based processing such as cross-attention processing, may be performed on acoustically embedded content.
  • training a model for punctuation recovery in speech recognition may include learning/predicting punctuation marks based on the acquired text samples and audio samples for model training, thereby performing punctuation model training.
  • punctuation marks can be learned/predicted based on the multimodal mixed representation described above, so as to perform punctuation model training.
  • a classifier may be employed for punctuation learning/prediction.
  • the classifier can be a classifier of various appropriate forms, such as a linear classifier (Linear+Softmax), and of course can also be any other appropriate form, which will not be described in detail here.
  • the scheme of the present disclosure can use single text data and paired text and audio data for training to solve the problem of insufficient paired text and audio data, especially for single text data without corresponding audio content, using virtual samples Fictitious audio is used as its corresponding audio data for model training, so that a large amount of plain text data can be used for training, and audio data can be effectively used to enhance the model training effect.
  • the text and audio samples used to train the punctuation model may include multilingual text and audio samples.
  • multilingual text and audio samples used for model training can be obtained as described above.
  • the corresponding audio sample can be obtained, which may be a text sample A sample, or virtual sample, of the content in audio form.
  • equalization may be performed on the multilingual text and audio samples used for model training, so as to further optimize the samples for multilingual model training.
  • text samples and/or audio samples of languages that account for a small proportion of text and audio samples can be used, such as language samples that are not easily obtained, small language samples, and low-resource/few-resource language samples, Extending processing is performed to balance the proportions of text and/or audio in each language used for model training, in particular, increasing the proportion of text samples and/or audio samples in languages with low proportions.
  • data expansion can be performed in various appropriate ways, for example, operations such as repetition, interpolation, and the like can be performed on text samples and/or audio samples of languages with a low proportion.
  • multilingual text samples and/or audio samples may also be processed for equalization by performing temperature sampling or similar sampling algorithms.
  • temperature sampling due to the scarcity of data in small languages, use the method of temperature sampling to increase the proportion of data in small languages and reduce the proportion of data in high-resource languages according to the amount of data in different languages. This makes the proportion of various language data in the total multilingual data as balanced as possible, which can further optimize the model training effect. For example, the trained model can improve the punctuation recovery of various language texts.
  • the scheme of the present disclosure receives text input and optionally audio input to perform the task of punctuation recovery to add punctuation to text.
  • the text input and optional audio input here may be the aforementioned text samples and audio samples used for training, or the aforementioned multilingual text samples and audio samples.
  • the disclosed scheme mainly includes three stages of operations: the first stage processes the text input and audio input respectively, and the second stage performs attention-based processing on the processed text input and audio input respectively, so that the audio information Integrate into the representation of the text to obtain a multi-modal mixed representation, and the third stage will be based on the multi-modal mixed representation for punctuation prediction for model training.
  • the disclosed training process uses the classic SGD algorithm and Adam optimizer to optimize the Cross-Entropy loss function.
  • Punctuation recovery is often modeled as a sequence labeling task.
  • x is an unpunctuated sentence of length T and a is the corresponding speech audio.
  • the output of the model should be the predicted punctuated sequence y given x and a. Due to the nature of the sequence labeling task, the sequence of punctuated symbols y has the same length as the unpunctuated sentence x, i.e.
  • the punctuation marks may be any suitable punctuation marks available for the text, for example may be of four types, comma (,), period (.), question mark (?), and no punctuation.
  • Text input can be encoded in the first stage.
  • the encoding process here can be performed in various suitable ways, in particular, unpunctuated text sentences can be split into sequences of subwords and converted into sequences of lexical embeddings
  • a lexical encoder can be built using a pre-trained natural language (NLP) processing model as a backbone model, and the model of the lexical encoder can be fine-tuned on task-specific data.
  • NLP pre-trained natural language
  • the encoding process of the text may be performed by a text encoder, which may be included in the solutions according to the embodiments of the present disclosure.
  • the audio input can be processed, in particular, the audio input can be converted into an acoustic embedding H a or a virtual embedding Hi depending on the type of the audio input.
  • audio processing may be performed by various suitable acoustic processing components, which may be included in solutions according to embodiments of the present disclosure.
  • audio input that is an audio representation of the content of the text input, such as an audio-annotated transcription
  • audio input can be converted into acoustic features by a pre-trained acoustic model/acoustic feature extractor.
  • acoustic feature extractors can be first preprocessed by self-supervised training on unlabeled audio datasets, and can be fine-tuned for downstream punctuation recovery tasks.
  • the downsampling network can then be further applied to shorten the length of the extracted acoustic features to obtain the acoustic embedding content
  • the goal is to make the length of the acoustic embedding close to or equal to the sentence embedding so that the model can better align the cross-modal information.
  • a multi-layer convolutional network can be chosen as the core component of the downsampling network.
  • the virtual embedding Hi can be a fixed-length array of learnable parameters and expect it to learn representations for missing audio. That is to say, the virtual audio sample corresponding to the non-audio text as mentioned above can be a predetermined vector sequence, such as an all-zero sequence, and a virtual embedding can be obtained from it, such as directly used as a virtual embedding, or its length can be shortened so that The length of the virtual embedding is close to or equal to the sentence embedding, so the shortened sequence serves as a virtual embedding.
  • the acoustic and lexical features can be combined to generate a multimodal hybrid representation.
  • This operation can be performed by a coordinate bootstrapper, which can jointly train without audio and audio text to overcome the missing problem.
  • the coordination guide jointly exploits acoustic and lexical features and applies attention-based operations to learn a hybrid representation of the two modalities.
  • self-attention is firstly operated on the lexical embedding sequence H l to capture the long-range dependencies S l in unpunctuated sentences, and cross-attention is applied between the lexical embedding sequence H l and the acoustic embedding sequence H a operation to form a cross-modal representation S a :
  • the hybrid representation H h is obtained by adding the attention-processed representation to the residual connections:
  • coordination guides can be stacked to multiple layers to further increase model capacity.
  • the output classifier layer consists of a linear projection and a softmax activation function, and will input H h to the classifier layer to predict the punctuation sequence
  • the model can receive mixed data in the same training batch, and the trained model is able to punctuate both audio text and non-audio text.
  • the aforementioned text encoder, acoustic processing component and coordination guider can all be included as submodules/frameworks in the model training device according to the embodiments of the present disclosure, namely the UniPunc framework.
  • the UniPunc framework according to the present disclosure provides a general framework for addressing modality loss in multimodal punctuation recovery tasks.
  • the solution according to the present disclosure can also be effectively applied to some current punctuation models, and can be used as a useful supplement thereto.
  • acoustic processing components and coordination guiders according to the present disclosure can be applied/added to current punctuation models for correction such that they can be improved to handle modally missing samples.
  • model training is also applicable to multilingual scenarios.
  • the model can be trained using data and audio in different languages at the same time.
  • data in a plurality of different languages including text data and/or audio data, is used as input to perform the processing at each stage in the solution according to the present disclosure, for example, audio in each language is input to the audio component, for Texts in different languages can use lexical encoders uniformly, enabling performance-enhanced models suitable for multilingual scenarios.
  • equalization processing can be performed on multilingual input, and the above-mentioned processing can be further performed on the text and audio in each language after equalization, for example, the above three stages of processing. It will not be described in detail here.
  • Punctuation models according to some embodiments of the present disclosure may be used in various suitable punctuation recovery applications.
  • the punctuation model according to some embodiments of the present disclosure can have good universality and can be applied to any speech recognition system, for example, the text output by the speech recognition system can be further processed to optimize the punctuation recovery of the text output by the speech recognition system .
  • punctuation can be performed on unpunctuated text, or validation can be performed on punctuated text for further correction.
  • FIG. 2 shows a flowchart of a method for punctuation recovery according to some embodiments of the present disclosure.
  • the text to be added with punctuation is obtained, such as speech recognition text output
  • the punctuation model trained according to the model training method of the present disclosure is applied to the obtained text output to recover Punctuation in text output.
  • punctuation marks can be appropriately added to the text, thereby realizing accurate punctuation recovery or punctuation verification/correction.
  • input text may be entered into the model along with associated audio so that the input text may be appropriately punctuated based on both. If the input to the model is only text, which means that the text has no audio, a dummy sample can be generated to be fed into the model along with the text, and punctuation is added to the input text based on this.
  • the audio information of the text input into the punctuation model can be obtained, for example, according to an indicator indicating whether the text input has audio, and in the case where the indicator indicates that the text has audio, the punctuation model can obtain the text audio , for example text audio is input into the model together with the text, or the text audio is obtained from a predetermined storage location, where the indicator indicates that the text has no audio, such as a single text, a virtual audio sample can be generated and input into the punctuation model, whereby, The punctuation model will be rationalized to predict representational symbols, from which punctuation-recovered texts can be obtained.
  • This process can be performed in a similar manner as the model training process described above, such as the above three stages of processing, which will not be described in detail here.
  • FIG. 4A shows a punctuation model training device 400 according to some embodiments of the present disclosure.
  • the device 400 includes a sample acquisition unit 401 configured to acquire text samples and corresponding audio samples for model training, wherein for text samples obtained from text without audio, the corresponding audio samples are virtual samples; and a training unit 402, configured to train a model for punctuation recovery in speech recognition based on the acquired text samples and corresponding audio samples for model training.
  • the training unit 402 is further configured to perform attention-based processing based on the acquired text samples and audio samples for model training to generate a multi-modal mixed representation, whereby the training unit 402 can be based on The generated multimodal hybrid representations are used for model training.
  • the above-mentioned multimodal mixed representation can be generated by the mixed representation generation unit 4021, although not shown, but in an exemplary implementation, model training can be performed by a component for punctuation prediction, such as a classifier or a similar model training component, etc., In this case, such components may be included in the training unit.
  • the mixed representation generation unit 4021 may further include a text conversion unit 4022 configured to convert text samples into a vocabulary embedding sequence; an audio conversion unit 4023 configured to convert audio samples into an acoustic embedding sequence; and jointly
  • the processing unit 4024 is configured to perform attention-based operations on the vocabulary embedding sequence and the acoustic embedding sequence respectively; and combine the processing results of the vocabulary embedding sequence and the acoustic embedding sequence to generate a multi-modal hybrid representation.
  • the text conversion unit and audio conversion unit may not be included in the hybrid representation generation unit, or even outside the training unit, that is, text conversion and audio conversion may be performed for The processing of samples, so that the input samples used for model training are actually transformed samples.
  • the text conversion unit 4022 and the audio conversion unit 4023 may at least correspond to the text encoder and the acoustic processing component described above, respectively, and the joint processing unit 4024 may correspond to the coordination guide described above, mixing
  • the representation generation unit 4021 may at least include or correspond to the coordination guide described above, and may also include the vocabulary encoder, the acoustic processing component, and the coordination guide described above;
  • the training unit 402 may at least include or correspond to the coordination guide described in FIG. 3 Corresponding modules/devices, etc. for all processing stages shown, such as lexical encoders and acoustic processing components, coordination guides, classifiers, etc. as described above.
  • FIG. 4B shows a punctuation recovery device 410 according to some embodiments of the present disclosure.
  • Apparatus 410 includes an acquisition unit 411 configured to acquire the text output of speech recognition, and a punctuation recovery unit 412 configured to apply the punctuation model trained according to the training method of an embodiment of the present disclosure to the acquired text output to recover Punctuation in text output.
  • the punctuation recovery unit here can perform operations/processing similar to the above-mentioned training unit, for example, it can include the above-mentioned text encoder and acoustic processing components, coordination guider, classifier and so on.
  • each of the above units may be implemented as an independent physical entity, or may also be implemented by a single entity (for example, a processor (CPU or DSP, etc.), an integrated circuit, etc.).
  • the above-mentioned units are shown with dotted lines in the drawings to indicate that these units may not actually exist, and the operations/functions realized by them may be realized by the processing circuit itself.
  • the device may further include a memory that can store various information generated in operation by the device, each unit included in the device, programs and data for operations, data to be transmitted by a communication unit, etc. .
  • the memory can be volatile memory and/or non-volatile memory.
  • memory may include, but is not limited to, random access memory (RAM), dynamic random access memory (DRAM), static random access memory (SRAM), read only memory (ROM), flash memory.
  • RAM random access memory
  • DRAM dynamic random access memory
  • SRAM static random access memory
  • ROM read only memory
  • flash memory flash memory
  • the device may also include a communication unit, which may be used to communicate with other devices.
  • the communication unit may be implemented in an appropriate manner known in the art, for example including communication components such as antenna arrays and/or radio frequency links, various types of interfaces, communication units and the like. It will not be described in detail here.
  • the device may also include other components not shown, such as a radio frequency link, a baseband processing unit, a network interface, a processor, a controller, and the like. It will not be described in detail here.
  • This disclosure proposes the training and application of an improved punctuation model for punctuation recovery, which can use both single text data and paired text and audio data for training, thereby solving the problem of insufficient paired text and audio data .
  • an improved punctuation model for punctuation recovery which can use both single text data and paired text and audio data for training, thereby solving the problem of insufficient paired text and audio data .
  • pre-defined vectors instead of audio as input when there is no audio
  • audio data can be used to assist, thereby further improving the model training effect and improving the training results. the accuracy of the model.
  • the present disclosure can also utilize multilingual data for collaborative training, in particular, by using language data with a large amount of data to enhance the performance of the model for a small number of languages, thereby obtaining a further optimized punctuation model.
  • plain text data, text and audio aligned data, and data in different languages can be trained together, which improves the performance of the model on plain text data, text and audio aligned data, and minority language data.
  • such a model can be suitable for various application tasks, especially for speech recognition tasks, and achieve better punctuation recovery effects.
  • the UniPunc according to the scheme of the present disclosure is tested to compare the performance with the multimodal model in the related art.
  • the test proves that the UniPunc according to the embodiment of the present disclosure is better than the related art in the English audio collection.
  • Multimodal model the performance of UniPunc according to the embodiment of the present disclosure on the English-mixed set is better than its performance on the English audio set, which shows that even if no audio sentences are used, the UniPunc of the embodiment of the present disclosure can outperform
  • the performance of the punctuation model according to the present disclosure can be further improved by adopting non-audio sentences.
  • the UniPunc according to the present disclosure is tested to compare performance with the single-modal model in the related art.
  • the test proves that the UniPunc according to the embodiment of the present disclosure can effectively obtain multi-modal mixed representations, And effectively represent the acoustic features in speech, which can significantly improve the punctuation recovery of text.
  • UniPunc has better punctuation performance than other baselines on multilingual punctuation, which shows that UniPunc has better robustness and generalization ability.
  • the UniPunc of the present disclosure can be well adapted to any existing punctuation recovery method.
  • the effective modules in the UniPunc framework of the present disclosure, especially the acoustic auxiliary module and/or the coordination guide into the existing punctuation recovery scheme, the performance of the existing punctuation recovery scheme can be further optimized.
  • the UniPunc scheme of the present disclosure, especially the modules such as the acoustic auxiliary module and/or the coordination guider has general applicability for solving the lack of modes in punctuation recovery, and enables the previous single-modal model to handle multiple The modal corpus further improves the overall performance.
  • FIG. 5 shows a block diagram of some embodiments of an electronic device of the present disclosure.
  • the electronic device 5 can be various types of devices, such as but not limited to mobile phones, notebook computers, digital broadcast receivers, PDAs (Personal Digital Assistants), PADs (Tablet Computers), Mobile terminals such as PMPs (Portable Multimedia Players), vehicle-mounted terminals (eg, vehicle-mounted navigation terminals), and the like, and stationary terminals such as digital TVs, desktop computers, and the like.
  • PDAs Personal Digital Assistants
  • PADs Tablett Computers
  • PMPs Portable Multimedia Players
  • vehicle-mounted terminals eg, vehicle-mounted navigation terminals
  • stationary terminals such as digital TVs, desktop computers, and the like.
  • the electronic device 5 may include a display panel for displaying data and/or execution results utilized in the solutions according to the present disclosure.
  • the display panel can be in various shapes, such as a rectangular panel, an oval panel, or a polygonal panel, and the like.
  • the display panel can be not only a flat panel, but also a curved panel, or even a spherical panel.
  • the electronic device 5 of this embodiment includes: a memory 51 and a processor 52 coupled to the memory 51 .
  • the components of the electronic device 50 shown in FIG. 5 are exemplary rather than limiting, and the electronic device 50 may also have other components according to actual application requirements.
  • Processor 52 may control other components in electronic device 5 to perform desired functions.
  • memory 51 is used to store one or more computer readable instructions.
  • processor 52 is used to execute computer-readable instructions, the computer-readable instructions are executed by the processor 52 to implement the method according to any of the above-mentioned embodiments.
  • the specific implementation and related explanations of each step of the method reference may be made to the above-mentioned embodiments, and repeated descriptions will not be repeated here.
  • processor 52 and the memory 51 may communicate with each other directly or indirectly.
  • processor 52 and memory 51 may communicate via a network.
  • a network may include a wireless network, a wired network, and/or any combination of a wireless network and a wired network.
  • the processor 52 and the memory 51 may also communicate with each other through the system bus, which is not limited in the present disclosure.
  • the processor 52 can be embodied as various suitable processors, processing devices, etc., such as a central processing unit (CPU), a graphics processing unit (Graphics Processing Unit, GPU), a network processor (NP), etc.; Signal Processor (DSP), Application Specific Integrated Circuit (ASIC), Field Programmable Gate Array (FPGA) or other Programmable Logic Devices, Discrete Gate or Transistor Logic Devices, Discrete Hardware Components.
  • the central processing unit (CPU) may be an X86 or ARM architecture or the like.
  • memory 51 may include any combination of various forms of computer-readable storage media, such as volatile memory and/or non-volatile memory.
  • the memory 51 may include, for example, a system memory, and the system memory stores, for example, an operating system, an application program, a boot loader (Boot Loader), a database, and other programs. Various application programs, various data, and the like can also be stored in the storage medium.
  • a system memory stores, for example, an operating system, an application program, a boot loader (Boot Loader), a database, and other programs.
  • Various application programs, various data, and the like can also be stored in the storage medium.
  • FIG. 6 is a block diagram illustrating an example structure of a computer system employable in some embodiments of the present disclosure.
  • a central processing unit (CPU) 601 executes various processes according to programs stored in a read only memory (ROM) 602 or loaded from a storage section 608 to a random access memory (RAM) 603 .
  • ROM read only memory
  • RAM random access memory
  • the central processing unit is only exemplary, and it may also be other types of processors, such as the various processors mentioned above.
  • ROM 602, RAM 603 and storage portion 608 may be various forms of computer-readable storage media, as described below. It should be noted that although ROM 602, RAM 603 and storage device 608 are shown separately in FIG. 6, one or more of them may be combined or located in the same or different memory or storage modules.
  • the CPU 601, ROM 602, and RAM 603 are connected to each other via a bus 604.
  • the input/output interface 605 is also connected to the bus 604 .
  • the following components are connected to the input/output interface 605: an input section 606, such as a touch screen, touch pad, keyboard, mouse, image sensor, microphone, accelerometer, gyroscope, etc.; an output section 607, including a display, such as a cathode ray tube (CRT ), a liquid crystal display (LCD), a speaker, a vibrator, etc.; a storage section 608 including a hard disk, a magnetic tape, etc.; and a communication section 609 including a network interface card such as a LAN card, a modem, and the like.
  • the communication section 609 allows communication processing to be performed via a network such as the Internet. It is easy to understand that although it is shown in FIG. 6 that each device or module in the electronic device 600 communicates through the bus 604, they may also communicate through a network or other methods, wherein the network may include a wireless network, a wired network , and/or any combination of wireless and wired networks.
  • a driver 610 is also connected to the input/output interface 605 as needed.
  • a removable medium 611 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 610 as necessary, so that a computer program read therefrom is installed into the storage section 608 as necessary.
  • the programs constituting the software can be installed from a network such as the Internet or a storage medium such as the removable medium 611 .
  • embodiments of the present disclosure include a computer program product comprising a computer program carried on a computer readable medium, the computer program comprising program code for performing a method according to some embodiments of the present disclosure.
  • the computer program may be downloaded and installed from a network via communication means 609, or from storage means 608, or from ROM 602.
  • the computer program is executed by the CPU 601, the above-mentioned functions defined in the methods of the embodiments of the present disclosure are performed.
  • a computer-readable medium may be a tangible medium that may contain or store information for use by or in conjunction with an instruction execution system, device, or device. program.
  • a computer readable medium may be a computer readable signal medium or a computer readable storage medium or any combination of the two.
  • a computer-readable storage medium may be, for example, but not limited to: an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, device, or device, or any combination thereof.
  • Computer-readable storage media may include, but are not limited to, electrical connections with one or more wires, portable computer diskettes, hard disks, random access memory (RAM), read-only memory (ROM), erasable Programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination of the above.
  • a computer-readable storage medium may be any tangible medium that contains or stores a program that can be used by or in conjunction with an instruction execution system, apparatus, or device.
  • a computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave carrying computer-readable program code therein.
  • Such propagated data signals may take many forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the foregoing.
  • a computer-readable signal medium may also be any computer-readable medium other than a computer-readable storage medium, which can transmit, propagate, or transmit a program for use by or in conjunction with an instruction execution system, apparatus, or device .
  • Program code embodied on a computer readable medium may be transmitted by any appropriate medium, including but not limited to wires, optical cables, RF (radio frequency), etc., or any suitable combination of the above.
  • the above-mentioned computer-readable medium may be included in the above-mentioned electronic device, or may exist independently without being incorporated into the electronic device.
  • a computer program including: instructions, which when executed by a processor cause the processor to execute the method of any one of the above embodiments.
  • instructions may be embodied as computer program code.
  • the computer program codes for performing the operations of the present disclosure may be written in one or more programming languages or a combination thereof, the above-mentioned programming languages include but not limited to object-oriented programming languages, Such as Java, Smalltalk, C++, also includes conventional procedural programming languages, such as the "C" language or similar programming languages.
  • the program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server.
  • the remote computer can be connected to the user computer through any kind of network, including a local area network (LAN) or a wide area network (WAN), or it can be connected to an external computer (such as through an Internet service provider). Internet connection).
  • LAN local area network
  • WAN wide area network
  • Internet connection any kind of network, including a local area network (LAN) or a wide area network (WAN), or it can be connected to an external computer (such as through an Internet service provider). Internet connection).
  • each block in a flowchart or block diagram may represent a module, program segment, or portion of code that contains one or more logical functions for implementing specified executable instructions.
  • the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or they may sometimes be executed in the reverse order, depending upon the functionality involved.
  • each block of the block diagrams and/or flowchart illustrations, and combinations of blocks in the block diagrams and/or flowchart illustrations can be implemented by a dedicated hardware-based system that performs the specified functions or operations , or may be implemented by a combination of dedicated hardware and computer instructions.
  • modules, components or units involved in the embodiments described in the present disclosure may be implemented by software or by hardware. Wherein, the name of a module, component or unit does not constitute a limitation on the module, component or unit itself under certain circumstances.
  • exemplary hardware logic components include: Field Programmable Gate Array (FPGA), Application Specific Integrated Circuit (ASIC), Application Specific Standard Product (ASSP), System on Chip (SOC), Complex Programmable Logical device (CPLD) and so on.
  • FPGA Field Programmable Gate Array
  • ASIC Application Specific Integrated Circuit
  • ASSP Application Specific Standard Product
  • SOC System on Chip
  • CPLD Complex Programmable Logical device
  • a method for training a model for speech recognition punctuation recovery comprising the following steps: acquiring text samples and corresponding audio samples for model training, wherein for A text sample of the text, and its corresponding audio sample is a virtual sample; and a model for punctuation recovery in speech recognition is trained based on the acquired text sample and audio sample for model training.
  • the text samples and corresponding audio samples used for model training may be obtained from both audio text and non-audio text, where pairs of text samples and audio samples are obtained from audio text , and from the audio-free text, a text sample is obtained and a dummy sample is obtained as an audio sample corresponding to the text sample.
  • a virtual sample is a preset sample of audio for a fictional text sample.
  • training a model for punctuation recovery in speech recognition may include: generating a mixed representation based on the acquired text samples and audio samples used for model training, and performing model training based on the mixed representation.
  • the hybrid representation may be a multimodal hybrid representation obtained by performing attention-based processing based on the acquired text samples and audio samples for model training.
  • attention-based processing may be performed on audio samples for model training based on textual information associated with the text samples.
  • the text information related to the text sample may be a vocabulary feature obtained by converting the text sample or a processing result obtained by performing attention-based processing on the text sample.
  • text samples are converted to a sequence of lexical embeddings
  • audio samples are converted to a sequence of acoustic embeddings
  • attention-based operations are performed on the lexical and acoustic embedding sequences, respectively.
  • the acoustic embedding sequence is obtained based on acoustic features extracted from the audio sample; and where the audio sample is a virtual sample, the Virtual samples are embedded as sequences of acoustics.
  • the attention-based processing performed on text samples used for model training may be self-attention-based processing, and the attention-based processing performed on audio samples used for model training
  • the processing may be a cross-attention based processing.
  • the acquired text samples and corresponding audio samples for model training may include multilingual text samples and audio samples.
  • multilingual text samples and audio samples may be equalized to increase the proportion of low-resource language samples.
  • a method for recovering punctuation in speech recognition including the following steps: obtaining text output of speech recognition, and applying any implementation described in the present disclosure to the obtained text output
  • the punctuation model trained by the model training method described in the example is used to restore the punctuation in the text output.
  • a training device for a model of speech recognition punctuation recovery including: a sample acquisition unit configured to acquire text samples and corresponding audio samples for model training, wherein For text samples obtained from non-audio text, the corresponding audio samples are virtual samples; and a training unit configured to train a model for speech recognition punctuation recovery based on the acquired text samples and audio samples for model training .
  • the training unit may be further configured to: perform attention-based processing based on the acquired text samples and audio samples for model training to generate a multimodal hybrid representation.
  • the apparatus may further comprise: a text conversion unit configured to convert text samples into a vocabulary embedding sequence, an audio conversion unit configured to convert audio samples into an acoustic embedding sequence, and the training unit are configured to perform attention-based operations on sequences of lexical embeddings and sequences of acoustic embeddings, respectively.
  • a device for recovering punctuation in speech recognition including: an acquisition unit configured to acquire text output from speech recognition, and a punctuation recovery unit configured to recover the acquired text
  • the output uses the punctuation model trained according to the model training method described in any embodiment of the present disclosure to recover the punctuation in the text output.
  • an electronic device including: a memory; and a processor coupled to the memory, the memory storing instructions, the instructions, when executed by the processor, Making the electronic device execute the method of any embodiment described in the present disclosure.
  • a computer-readable storage medium on which a computer program is stored, and when the program is executed by a processor, the method of any embodiment described in the present disclosure is implemented.
  • a computer program including: instructions, which when executed by a processor cause the processor to perform the method of any embodiment described in the present disclosure.
  • a computer program product comprising instructions which, when executed by a processor, implement the method of any one of the embodiments described in the present disclosure.
  • a computer program the program code included in the computer program causes the method of any embodiment described in the present disclosure to be implemented when executed by a computer.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Machine Translation (AREA)

Abstract

A method for punctuation recovery in speech recognition, and an electronic device (5) and a storage medium. A method for training a model for punctuation recovery in speech recognition. The method comprises the following steps: acquiring a text sample for model training and a corresponding audio sample, wherein for a text sample that is obtained from audio-free text, the corresponding audio sample thereof is a virtual sample (S101); and on the basis of the acquired text sample for model training and the corresponding audio sample, training a model for punctuation recovery in speech recognition (S102).

Description

用于语音识别标点恢复的方法、设备和存储介质Method, device and storage medium for punctuation restoration in speech recognition
相关申请的交叉引用Cross References to Related Applications
本申请是以申请号为202111335102.5、申请日为2021年11月11日的中国申请为基础,并主张其优先权,该中国申请的公开内容在此作为整体引入本申请中。This application is based on a Chinese application with application number 202111335102.5 and a filing date of November 11, 2021, and claims its priority. The disclosure content of this Chinese application is hereby incorporated into this application as a whole.
技术领域technical field
本公开涉及语音识别,包括语音识别中的标点恢复。The present disclosure relates to speech recognition, including punctuation recovery in speech recognition.
背景技术Background technique
自动语音识别(ASR)是一种将人的语音转换为文本的技术,有着广泛的应用,并可以作为上游组件服务于多个任务,如语音助手和语音翻译等等。现有的商用语音识别系统输出的往往都是没有标点的文本,这样可能会导致人们产生误解,影响诸如机器翻译和信息抽取等的下游任务的性能。具体而言,没有标点的文本一方面在阅读上存在困难,可读性差,断句不清晰并且存在歧义,另一方面,类似于机器翻译和信息抽取之类的下游任务都是假设输入是有标点的,没有标点的文本会导致下游任务的性能劣化。Automatic Speech Recognition (ASR), a technology that converts human speech into text, has a wide range of applications and can serve as an upstream component for multiple tasks, such as voice assistants and speech translation, among others. Existing commercial speech recognition systems often output text without punctuation, which may lead to misunderstandings and affect the performance of downstream tasks such as machine translation and information extraction. Specifically, on the one hand, text without punctuation is difficult to read, has poor readability, unclear sentences, and ambiguity. On the other hand, downstream tasks such as machine translation and information extraction assume that the input is punctuated. Yes, text without punctuation can lead to performance degradation in downstream tasks.
发明内容Contents of the invention
提供该发明内容部分以便以简要的形式介绍构思,这些构思将在后面的具体实施方式部分被详细描述。该发明内容部分并不旨在标识要求保护的技术方案的关键特征或必要特征,也不旨在用于限制所要求的保护的技术方案的范围。This Summary is provided to introduce a simplified form of concepts that are described in detail later in the Detailed Description. This summary of the invention is not intended to identify key features or essential features of the claimed technical solution, nor is it intended to be used to limit the scope of the claimed technical solution.
根据本公开的一些实施例,提供了一种用于语音识别标点恢复的模型的训练方法,包括以下步骤:获取用于模型训练的文本样本和对应的音频样本,其中对于得自无音频文本的文本样本,其对应的音频样本为虚拟样本;以及基于所获取的用于模型训练的文本样本和对应的音频样本来训练用于语音识别标点恢复的模型。According to some embodiments of the present disclosure, there is provided a method for training a model for speech recognition punctuation recovery, comprising the following steps: acquiring text samples and corresponding audio samples for model training, wherein for A text sample, and its corresponding audio sample is a virtual sample; and a model for punctuation recovery in speech recognition is trained based on the acquired text sample for model training and the corresponding audio sample.
根据本公开的另一些实施例,提供了一种用于语音识别标点恢复的模型的训练装置,包括:样本获取单元,被配置为获取用于模型训练的文本样本和对应的音频样本,其中对于得自无音频文本的文本样本,其对应的音频样本为虚拟样本;以及训练单元,被配置为基于所获取的用于模型训练的文本样本和对应的音频样本来训练用于语音 识别标点恢复的模型。According to some other embodiments of the present disclosure, a training device for a model of speech recognition punctuation recovery is provided, including: a sample acquisition unit configured to acquire text samples and corresponding audio samples for model training, wherein for A text sample obtained from an audio-free text, the corresponding audio sample being a virtual sample; and a training unit configured to train a model for speech recognition punctuation recovery based on the acquired text sample for model training and the corresponding audio sample Model.
根据本公开的一些实施例,提供一种电子设备,包括:存储器;和耦接至存储器的处理器,所述处理器被配置为基于存储在所述存储器中的指令,执行本公开中所述的任一实施例的方法。According to some embodiments of the present disclosure, there is provided an electronic device, including: a memory; and a processor coupled to the memory, the processor configured to execute the instructions described in the present disclosure based on instructions stored in the memory. The method of any embodiment.
根据本公开的一些实施例,提供一种计算机可读存储介质,其上存储有计算机程序,该程序在被处理器执行时导致实现本公开中所述的任一实施例的方法。According to some embodiments of the present disclosure, there is provided a computer-readable storage medium, on which a computer program is stored, and the program, when executed by a processor, causes the method of any embodiment described in the present disclosure to be implemented.
根据本公开的一些实施例,提供一种计算机程序产品,包括指令,该指令在由处理器执行时导致实现本公开中所述的任一实施例的方法。According to some embodiments of the present disclosure, there is provided a computer program product comprising instructions which, when executed by a processor, cause implementing the method of any one of the embodiments described in the present disclosure.
根据本公开的一些实施例,提供了一种计算机程序,所述计算机程序包括的程序代码在由计算机执行时导致实现本公开中所述的任一实施例的方法。According to some embodiments of the present disclosure, there is provided a computer program, the program code included in the computer program causes the method of any embodiment described in the present disclosure to be implemented when executed by a computer.
通过以下参照附图对本公开的示例性实施例的详细描述,本公开的其它特征、方面及其优点将会变得清楚。Other features, aspects, and advantages of the present disclosure will become apparent through the following detailed description of exemplary embodiments of the present disclosure with reference to the accompanying drawings.
附图说明Description of drawings
下面参照附图说明本公开的优选实施例。此处所说明的附图用来提供对本公开的进一步理解,各附图连同下面的具体描述一起包含在本说明书中并形成说明书的一部分,用于解释本公开。应当理解的是,下面描述中的附图仅仅涉及本公开的一些实施例,而非对本公开构成限制。在附图中:Preferred embodiments of the present disclosure are described below with reference to the accompanying drawings. The accompanying drawings illustrated herein are included to provide a further understanding of the disclosure, and are incorporated in and form a part of this specification together with the following detailed description, and serve to explain the disclosure. It should be understood that the drawings in the following description only relate to some embodiments of the present disclosure, rather than limiting the present disclosure. In the attached picture:
图1A到1C示出根据本公开的一些实施例的用于语音识别标点恢复的模型的训练方法的流程图。1A to 1C show a flowchart of a training method for a model for speech recognition punctuation recovery according to some embodiments of the present disclosure.
图2示出了根据本公开的一些实施例的语音识别标点恢复方法的流程图。Fig. 2 shows a flowchart of a method for recovering punctuation in speech recognition according to some embodiments of the present disclosure.
图3示出根据本公开的一些实施例的用于语音识别标点恢复的模型训练的示例性实现。FIG. 3 illustrates an exemplary implementation of model training for speech recognition punctuation recovery according to some embodiments of the present disclosure.
图4A示出了根据本公开的一些实施例的用于语音识别标点恢复的模型的训练装置的框图,并且图4B示出了根据本公开的一些实施例的语音识别标点恢复装置的框图。FIG. 4A shows a block diagram of a training device for a model of speech recognition punctuation recovery according to some embodiments of the present disclosure, and FIG. 4B shows a block diagram of a speech recognition punctuation recovery device according to some embodiments of the present disclosure.
图5示出本公开的电子设备的一些实施例的框图。Figure 5 shows a block diagram of some embodiments of an electronic device of the present disclosure.
图6示出本公开的电子设备的另一些实施例的框图。FIG. 6 shows a block diagram of other embodiments of the electronic device of the present disclosure.
应当明白,为了便于描述,附图中所示出的各个部分的尺寸并不一定是按照实际的比例关系绘制的。在各附图中使用了相同或相似的附图标记来表示相同或者相似的 部件。因此,一旦某一项在一个附图中被定义,则在随后的附图中可能不再对其进行进一步讨论。It should be understood that, for the convenience of description, the sizes of the various parts shown in the drawings are not necessarily drawn according to the actual proportional relationship. The same or similar reference numerals are used in the drawings to denote the same or similar components. Therefore, once an item is defined in one figure, it may not be discussed further in subsequent figures.
具体实施方式Detailed ways
下面将结合本公开实施例中的附图,对本公开实施例中的技术方案进行清楚、完整地描述,但是显然,所描述的实施例仅仅是本公开一部分实施例,而不是全部的实施例。以下对实施例的描述实际上也仅仅是说明性的,决不作为对本公开及其应用或使用的任何限制。应当理解的是,本公开可以通过各种形式来实现,而且不应该被解释为限于这里阐述的实施例。The following will clearly and completely describe the technical solutions in the embodiments of the present disclosure with reference to the drawings in the embodiments of the present disclosure, but obviously, the described embodiments are only some of the embodiments of the present disclosure, not all of them. The following descriptions of the embodiments are only illustrative in fact, and by no means limit the present disclosure and its application or use. It should be understood that the present disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein.
应当理解,本公开的方法实施方式中记载的各个步骤可以按照不同的顺序执行,和/或并行执行。此外,方法实施方式可以包括附加的步骤和/或省略执行示出的步骤。本公开的范围在此方面不受限制。除非另外具体说明,否则在这些实施例中阐述的部件和步骤的相对布置、数字表达式和数值应被解释为仅仅是示例性的,不限制本公开的范围。It should be understood that the various steps described in the method implementations of the present disclosure may be executed in different orders, and/or executed in parallel. Additionally, method embodiments may include additional steps and/or omit performing illustrated steps. The scope of the present disclosure is not limited in this regard. Relative arrangements of components and steps, numerical expressions and values set forth in these embodiments should be construed as merely exemplary and not limiting the scope of the present disclosure unless specifically stated otherwise.
本公开中使用的术语“包括”及其变型意指至少包括后面的元件/特征、但不排除其他元件/特征的开放性术语,即“包括但不限于”。此外,本公开使用的术语“包含”及其变型意指至少包含后面的元件/特征、但不排除其他元件/特征的开放性术语,即“包含但不限于”。因此,包括与包含是同义的。术语“基于”意指“至少部分地基于”。The term "comprising" and its variants used in the present disclosure mean an open term including at least the following elements/features but not excluding other elements/features, ie "including but not limited to". In addition, the term "comprising" and its variants used in the present disclosure mean an open term that includes at least the following elements/features but does not exclude other elements/features, namely "comprising but not limited to". Thus, including is synonymous with comprising. The term "based on" means "based at least in part on".
整个说明书中所称“一个实施例”、“一些实施例”或“实施例”意味着与实施例结合描述的特定的特征、结构或特性被包括在本公开的至少一个实施例中。例如,术语“一个实施例”表示“至少一个实施例”;术语“另一实施例”表示“至少一个另外的实施例”;术语“一些实施例”表示“至少一些实施例”。而且,短语“在一个实施例中”、“在一些实施例中”或“在实施例中”在整个说明书中各个地方的出现不一定全都指的是同一个实施例,但是也可以指同一个实施例。Reference throughout this specification to "one embodiment," "some embodiments," or "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one embodiment of the present disclosure. For example, the term "one embodiment" means "at least one embodiment"; the term "another embodiment" means "at least one additional embodiment"; the term "some embodiments" means "at least some embodiments." Moreover, appearances of the phrase "in one embodiment," "in some embodiments," or "in an embodiment" in various places throughout the specification are not necessarily all referring to the same embodiment, but may also refer to the same embodiment. Example.
需要注意,本公开中提及的“第一”、“第二”等概念仅用于对不同的装置、模块或单元进行区分,并非用于限定这些装置、模块或单元所执行的功能的顺序或者相互依存关系。除非另有指定,否则“第一”、“第二”等概念并非意图暗示如此描述的对象必须按时间上、空间上、排名上的给定顺序或任何其他方式的给定顺序。It should be noted that concepts such as "first" and "second" mentioned in this disclosure are only used to distinguish different devices, modules or units, and are not used to limit the sequence of functions performed by these devices, modules or units or interdependence. Unless otherwise specified, the concepts of "first", "second", etc. are not intended to imply that objects so described must be in a given order, whether in time, space, rank or in any other way.
需要注意,本公开中提及的“一个”、“多个”的修饰是示意性而非限制性的,本领域技术人员应当理解,除非在上下文另有明确指出,否则应该理解为“一个或多个”。It should be noted that the modifications of "one" and "multiple" mentioned in the present disclosure are illustrative and not restrictive, and those skilled in the art should understand that unless the context clearly indicates otherwise, it should be understood as "one or more" multiple".
本公开实施方式中所交互的数据、消息或者信息的名称仅用于说明性的目的,而并不是用于对这些数据、消息或信息的范围进行限制。The names of data, messages or information exchanged in the embodiments of the present disclosure are used for illustrative purposes only, and are not used to limit the scope of these data, messages or information.
为了解决语音识别输出文本的无标点问题,提出了标点恢复任务并设计了相应的标点模型。标点恢复任务旨在正确地给自动语音识别系统的输出结果,例如输出的文本句子,添加标点符号,并且在实现中通过采用相关的标点模型来进行标点恢复。常规的标点模型或者只使用文本,或者要求提供文本相应的音频,但是往往受到真实场景的限制。In order to solve the problem of no punctuation in the output text of speech recognition, a punctuation recovery task is proposed and a corresponding punctuation model is designed. The punctuation recovery task aims to correctly add punctuation marks to the output of the automatic speech recognition system, such as the output text sentence, and perform punctuation recovery by using a related punctuation model in the implementation. Conventional punctuation models either only use text, or require audio corresponding to the text, but are often limited by real-world scenarios.
一方面,常规的标点模型仅基于文本信息来进行标点恢复。然而,单一的文本信息存在歧义,这容易引起了一些问题。一个句子可能有被添加不同的标点符号而分别产生不同的意思。例如,“I don’t want anymore kids”的含义明显不同于“I don’t want anymore,kids”,由此可见逗号的重要性,而不适当的添加会导致明显的歧义或错误。此外,由于不知道说话人的语气,模型可能会很难确定一个句子应该以句号还是问号结尾。On the one hand, conventional punctuation models perform punctuation recovery based only on textual information. However, there is ambiguity in a single text message, which easily causes some problems. A sentence may have different meanings by adding different punctuation marks. For example, the meaning of "I don't want anymore kids" is obviously different from that of "I don't want anymore, kids", which shows the importance of commas, and inappropriate additions will lead to obvious ambiguities or errors. Also, since the tone of voice of the speaker is not known, the model may have difficulty determining whether a sentence should end with a period or a question mark.
另一方面,考虑到语音音频中丰富的信息,如停顿和语调,声音信号可以帮助减轻标点模型的歧义,因此提出了一些多模态标点模型,从语音音频中提取声学特征,并通过相加/拼接来融合声学和词汇特征,以有利于标点符号的添加。这里,模态可分别涉及文本和音频。但是,常规多模态方法在实际应用中面临模态缺失的问题。首先,由于存储限制或隐私政策,有时无法访问相应的音频,而以前的多模式方法无法对这种无音频的文本句子进行标点恢复;其次,人工标注的文本音频的成本非常高,难以获得,这导致这些多模态模型的训练集更加稀少,难以获得良好的多模态模型。On the other hand, considering the rich information in speech audio, such as pauses and intonation, sound signals can help alleviate the ambiguity of punctuation models, so some multimodal punctuation models are proposed, which extract acoustic features from speech audio and combine them by adding /splicing to fuse acoustic and lexical features to facilitate punctuation addition. Here, the modalities may relate to text and audio, respectively. However, conventional multimodal methods face the problem of missing modes in practical applications. First, the corresponding audio is sometimes inaccessible due to storage limitations or privacy policies, and previous multimodal approaches cannot recover such audio-free text sentences; second, human-annotated text audio is very costly and difficult to obtain, This leads to a sparser training set for these multimodal models, making it difficult to obtain good multimodal models.
鉴于此,本公开提出了改进方案,其能够利用有音频文本和无音频文本来进行标点模型训练。特别地,注意到无音频的文本是容易获得的,本公开的方案能够充分利用无音频的文本来构建数量大的训练集合以进行标点模型的训练,并且借助于有限的音频来提供进一步优化训练,由此提高了标点模型的准确性,并且提供了能够使用可选的音频来增强文本的标点恢复效果的标点模型。In view of this, the present disclosure proposes an improved solution, which can use audio text and non-audio text for punctuation model training. In particular, noting that non-audio text is easy to obtain, the scheme of the present disclosure can make full use of non-audio text to construct a large number of training sets for training the punctuation model, and provide further optimization training with the help of limited audio , thereby improving the accuracy of the punctuation model and providing a punctuation model capable of enhancing the punctuation recovery effect of text with optional audio.
特别地,本公开提出了一个所谓的统一的多模态标点符号框架,可被称为UniPunc,其能够利用有音频文本和无音频文本两者来进行标点模型训练。具体来说,对于有音频文本,可以获取相应的文本和音频对,而对于无音频文本,则可以通过构建虚拟内容来虚构其对应音频,也构建相应的文本和音频对。这样可以基于从有音频文本和无音频文本两者获取的文本和音频对来进行标点模型训练。由此可以利用容易获得的大 量的无音频文本/语料作为训练集,并利用声音输入来减少标点符号中的歧义,从而获得的标点模型具有提高的精确度。In particular, the present disclosure proposes a so-called unified multimodal punctuation framework, which may be called UniPunc, which is capable of utilizing both audio text and non-audio text for punctuation model training. Specifically, for text with audio, the corresponding text and audio pair can be obtained; for text without audio, the corresponding audio can be fabricated by constructing virtual content, and the corresponding text and audio pair can also be constructed. This enables punctuation model training based on text and audio pairs obtained from both audio text and non-audio text. It is thus possible to utilize readily available large amounts of non-audio text/corpus as a training set, and utilize sound input to reduce ambiguity in punctuation marks, resulting in punctuation models with improved accuracy.
此外,本公开的方案还能够进一步改进多语言场景下的标点模型的训练和应用。常规的多模态方法只能处理单语言的输入,而有些语言的数据难以获得,非常稀少,例如很多小语种难以获得大量的高质量的文本和音频数据,无法满足模型训练的需求,而直接训练会导致模型的效果非常糟糕。鉴于此,本公开使用多语言协同训练的方法解决小语种数据不足的问题。特别地,本公开可以同时利用多种语言的音频和文本来进行模型训练,这样能够使用容易获得的常用语言的数据来加强小语种语言的性能,从而获得改进的多语言标点模型。In addition, the solution disclosed in the present disclosure can further improve the training and application of the punctuation model in a multilingual scene. Conventional multimodal methods can only handle single-language input, and data in some languages is difficult to obtain and is very rare. For example, it is difficult to obtain a large amount of high-quality text and audio data in many small languages, which cannot meet the needs of model training. Direct Training results in a model that performs very poorly. In view of this, the present disclosure uses a multilingual collaborative training method to solve the problem of insufficient data in small languages. In particular, the present disclosure can simultaneously use audio and text in multiple languages for model training, so that easily obtained data of commonly used languages can be used to enhance the performance of minority languages, thereby obtaining an improved multilingual punctuation model.
下面将结合附图对本公开的实施例进行详细说明,但是本公开并不限于这些具体的实施例。下面这些具体实施例可以相互结合,对于相同或者相似的概念或过程可能在某些实施例不再赘述。此外,在一个或多个实施例中,特定的特征、结构或特性可以由本领域的普通技术人员从本公开将清楚的任何合适的方式组合。Embodiments of the present disclosure will be described in detail below with reference to the accompanying drawings, but the present disclosure is not limited to these specific embodiments. The following specific embodiments may be combined with each other, and the same or similar concepts or processes may not be repeated in some embodiments. Furthermore, the particular features, structures or characteristics may be combined in any suitable manner that will be apparent to one of ordinary skill in the art from this disclosure in one or more embodiments.
图1A示出了根据本公开的一些实施例的用于语音识别标点恢复的模型的训练方法。在方法100中,在步骤S101,获取用于模型训练的文本样本和对应的音频样本,其中对于得自无音频文本的文本样本,其对应的音频样本为虚拟样本;以及在步骤S102,基于所获取的用于模型训练的文本样本和对应的音频样本来训练用于语音识别标点恢复的模型。FIG. 1A shows a training method of a model for punctuation recovery in speech recognition according to some embodiments of the present disclosure. In method 100, in step S101, a text sample and a corresponding audio sample for model training are obtained, wherein for a text sample obtained from a text without audio, the corresponding audio sample is a virtual sample; and in step S102, based on the The acquired text samples and corresponding audio samples for model training are used to train the model for speech recognition punctuation recovery.
根据本公开的一些实施例,用于模型训练的文本样本和对应的音频样本实质上对应于成对的文本和音频样本的集合,其中对于每个文本样本,都获取其对应的音频样本,其或者是该文本样本内容的对应音频内容,或者是用于虚构文本样本的音频的虚拟音频样本。According to some embodiments of the present disclosure, the text samples and corresponding audio samples used for model training substantially correspond to a set of paired text and audio samples, wherein for each text sample, its corresponding audio sample is obtained, which Either the corresponding audio content of the text sample content, or a virtual audio sample for the audio of the fictional text sample.
根据本公开的一些实施例,用于模型训练的文本样本和对应的音频样本是从有音频文本和无音频文本两者获取的,例如从包含有音频文本和无音频文本两者的初始集合得出的。其中,从有音频文本获取成对的文本样本和音频样本,例如可以采用各种适当的方式进行转换/提取/分割等而获取,这里将不再详细描述。此外,从无音频文本,获取文本样本(这样的文本样本可被称为单一文本样本)并且获取虚拟样本作为该文本样本相对应的音频样本。这样,一方面,可以在没有音频的情况下利用大量纯文本;另一方面,可以在音频存在的情况下有效地利用声学特征。从而,可以利用容易获取的数量大的无音频文本/单一文本样本来构建训练集合,特别地,对于大量的单 一文本样本,仍可以通过利用虚拟样本来构建文本和音频样本对,从而有助于优化模型训练效果。According to some embodiments of the present disclosure, the text samples and corresponding audio samples used for model training are obtained from both audio text and non-audio text, for example from an initial set containing both audio text and non-audio text out. Wherein, obtaining the paired text sample and audio sample from the audio text, for example, can be obtained by converting/extracting/segmenting in various appropriate ways, which will not be described in detail here. Furthermore, from the non-audio text, a text sample is obtained (such a text sample may be referred to as a single text sample) and a dummy sample is obtained as the corresponding audio sample of the text sample. In this way, on the one hand, a large amount of plain text can be exploited without audio; on the other hand, acoustic features can be effectively exploited in the presence of audio. Therefore, a large number of non-audio text/single text samples that are easy to obtain can be used to construct a training set. In particular, for a large number of single text samples, it is still possible to construct text and audio sample pairs by using dummy samples, which helps Optimize model training effect.
在一些实施例中,虚拟样本可以为用于虚构文本样本的音频的预先设定的样本。虚拟样本可以采用各种适当的形式。特别地,其可以是具有预先设定内容的向量,例如具有固定长度、固定内容的向量,优选地为全零向量。当然,其也可以为其它适当形式。在一些实施例中,对于所有用于模型训练的单一文本样本,可以对于它们都构建虚拟样本,它们的虚拟样本可以是相同的,也可以是彼此不同的。In some embodiments, a virtual sample may be a preset sample of audio for a fictional text sample. A virtual sample may take any suitable form. In particular, it may be a vector with preset content, such as a vector with fixed length and fixed content, preferably an all-zero vector. Of course, it can also be in other suitable forms. In some embodiments, for all single text samples used for model training, virtual samples can be constructed for them, and their virtual samples can be the same or different from each other.
在一些实施例中,包含有音频文本和无音频文本两者的初始集合可以通过任何适当的方式来得到。例如,它们可以得自于常规的数据库,例如存储装置中的已有的训练数据库;或者可以采用适当类型的采集设备来获取,例如,可以采用麦克风来获取音频数据,可以采用视频设备或者适当的文本输入设备来输入文本数据;或者是用户输入的文本和/或人工标注的音频。特别地,有音频文本和无音频文本可以被以任何适当的方式存储以供获取文本样本和音频样本。例如,在一些实施例中,这些文本被具有相关联的文本是否具有音频的指示信息,例如二值形式的指示符等等,从而模型训练装置可以根据该指示信息来确定是否对于文本样本生成虚拟样本,例如只有当指示信息指示文本样本无音频的情况下才生成虚拟样本。In some embodiments, the initial set containing both audio text and non-audio text may be obtained by any suitable means. For example, they can be obtained from a conventional database, such as an existing training database in a storage device; or can be obtained using an appropriate type of acquisition device, for example, a microphone can be used to obtain audio data, a video device or an appropriate A text input device to enter text data; or user-entered text and/or human-annotated audio. In particular, audio text and non-audio text may be stored in any suitable manner for obtaining text samples and audio samples. For example, in some embodiments, these texts are provided with indication information of whether the associated text has audio, such as an indicator in binary form, etc., so that the model training device can determine whether to generate virtual audio for the text sample according to the indication information. Samples, such as generating dummy samples only if the indication indicates that the text sample has no audio.
根据本公开的一些实施例,训练用于语音识别标点恢复的标点模型可以包括基于所获取的用于模型训练的文本样本和音频样本生成多模态混合表示,如图1B中的步骤S1021,并且基于所生成的多模态混合表示来进行模型训练,如图1B中的步骤S1022。这样,所获取的多模态混合表示可以包含从利用有音频文本和无音频文本有效创建的训练样本集获取的文本和音频的信息,这样的多模态混合表示可以良好地支持模型训练。在一些实施例中,对于用于模型训练的每个文本样本和音频样本对,获得相对应的多模态混合表示,由此可以获得所有文本样本和音频样本的多模态混合表示。According to some embodiments of the present disclosure, training a punctuation model for speech recognition punctuation recovery may include generating a multimodal mixed representation based on the acquired text samples and audio samples for model training, as in step S1021 in FIG. 1B , and Model training is performed based on the generated multi-modal mixed representation, as in step S1022 in FIG. 1B . In this way, the obtained multimodal mixed representation can contain text and audio information obtained from the training sample set effectively created by utilizing audio text and non-audio text, and such multimodal mixed representation can well support model training. In some embodiments, for each pair of text samples and audio samples used for model training, a corresponding multimodal mixed representation is obtained, thereby obtaining multimodal mixed representations of all text samples and audio samples.
根据本公开的一些实施例,可以通过分别基于文本样本和音频样本执行处理,并且将基于文本样本的处理结果和基于音频样本的处理结果进行组合,以获得多模态混合表示。在一些实施例中,基于文本样本和音频样本的处理结果之和来生成混合表示。应指出,也可以通过文本样本和音频样本的处理结果进行其他适当处理来生成混合表示,例如加权和或者其它适当的数学运算等,这里将不再详细描述。According to some embodiments of the present disclosure, a multimodal hybrid representation may be obtained by performing processing based on text samples and audio samples separately, and combining the processing results based on text samples and the processing results based on audio samples. In some embodiments, a hybrid representation is generated based on the sum of processing results of text samples and audio samples. It should be noted that other appropriate processing can also be performed on the processing results of text samples and audio samples to generate a mixed representation, such as weighted sum or other appropriate mathematical operations, etc., which will not be described in detail here.
根据本公开的一些实施例,基于文本样本和音频样本执行的处理可以是基于注意力(attention)的处理,这样,多模态混合表示可以是通过基于所获取的用于模型训 练的文本样本和音频样本执行基于注意力的处理而获得的多模态混合表示。特别地,分别基于文本样本和音频样本执行基于注意力的处理,并且将文本样本的处理结果和音频样本的处理结果进行组合以获得多模态混合表示。According to some embodiments of the present disclosure, the processing performed based on text samples and audio samples may be attention-based processing, so that the multi-modal hybrid representation may be based on acquired text samples and audio samples for model training. Multimodal hybrid representations obtained by performing attention-based processing on audio samples. In particular, attention-based processing is performed based on text samples and audio samples separately, and the processing results of text samples and audio samples are combined to obtain a multimodal hybrid representation.
根据本公开的一些实施例,应用于文本样本和音频样本的基于注意力的处理可以为任何适当的处理,优选地它们彼此不同。在一些实施例中,采用自注意力(self-attention)机制对文本样本进行处理。在一些实施例中,采用交叉注意力(cross-attention)机制来对音频样本进行处理。自注意力机制和交叉注意力机制可以采用各种适当的架构/算法/模型等,这里将不再详细描述。由此,可以通过将文本样本的自注意力处理结果和音频样本的交叉注意力处理结果进行组合来生成多模态混合表示。According to some embodiments of the present disclosure, the attention-based processing applied to the text samples and audio samples may be any suitable processing, preferably they are different from each other. In some embodiments, text samples are processed using a self-attention mechanism. In some embodiments, audio samples are processed using a cross-attention mechanism. The self-attention mechanism and the cross-attention mechanism can adopt various appropriate architectures/algorithms/models, etc., which will not be described in detail here. Thus, a multimodal hybrid representation can be generated by combining the self-attention processing results of text samples and the cross-attention processing results of audio samples.
根据本公开的一些实施例,在用于模型训练的音频样本执行基于注意力的处理时还考虑文本样本相关的文本信息。也就是说,在对于音频样本的处理中,与该音频样本对应的文本样本相关的文本信息也将作为输入参数以执行处理。在一些实施例中,文本样本相关的文本信息可以是从文本样本转换/提取得到的特征或者对于文本样本执行基于注意力的处理,例如自注意力处理,而得到的处理结果。According to some embodiments of the present disclosure, text information related to text samples is also considered when performing attention-based processing on audio samples used for model training. That is to say, in the processing of the audio sample, the text information related to the text sample corresponding to the audio sample will also be used as an input parameter for processing. In some embodiments, the text information related to the text sample may be a feature converted/extracted from the text sample or a processing result obtained by performing attention-based processing on the text sample, such as self-attention processing.
根据本公开的一些实施例,基于用于模型训练的文本样本和对应的音频样本来训练用于语音识别标点恢复的模型还包括基于从文本样本和音频样本转换得到的中间特征值/序列来进行标点模型训练。在一些实施例中,可以对于文本样本和音频样本分别进行转换以得到中间特征值/序列,然后基于从文本样本和音频样本转换得到的中间特征值/序列生成多模态混合表示,以用于标点模型训练。优选地,可以对于从文本样本和音频样本转换得到的中间特征值/序列执行基于注意力的处理来生成多模态混合表示,并基于多模态混合进行标点模型训练,如图1C所示。对于中间特征值/序列执行的基于注意力的处理可以如前所述地执行,这里将不再详细描述。According to some embodiments of the present disclosure, training the model for speech recognition punctuation recovery based on the text samples and corresponding audio samples used for model training further includes performing Punctuation model training. In some embodiments, text samples and audio samples may be converted separately to obtain intermediate feature values/sequences, and then a multimodal hybrid representation may be generated based on the intermediate feature values/sequences converted from text samples and audio samples for use in Punctuation model training. Preferably, attention-based processing can be performed on intermediate feature values/sequences converted from text samples and audio samples to generate a multimodal mixture representation, and punctuation model training is performed based on the multimodal mixture, as shown in Figure 1C. The attention-based processing performed on intermediate feature values/sequences can be performed as previously described and will not be described in detail here.
在一些实施例中,中间特征值/序列可以为任何适当的形式,并且可通过相应的操作来获取。In some embodiments, the intermediate feature values/sequences may be in any suitable form and may be obtained through corresponding operations.
根据本公开的一些实施例,可以从文本样本转换得到词汇嵌入序列作为其中间特征值/序列。在一些实施例中,可以对文本样本进行词汇编码处理以获取词汇嵌入序列。应指出,对于文本样本的编码可以采用各种适当方式来处理,例如预先训练的词汇编码器,或者本领域已知的各种编码方式,这里将不再详细描述。在一些实施例,可以对于词汇嵌入序列进行自注意力处理。在一些实施例中,词汇嵌入序列或者词汇嵌入 序列经自注意力处理的处理结果也可作为文本样本相关的文本信息被应用于对于音频样本的基于注意力的处理,例如作为输入参数应用于对于音频样本的交叉注意力处理。According to some embodiments of the present disclosure, a sequence of lexical embeddings can be converted from a text sample as its intermediate feature value/sequence. In some embodiments, text samples may be subjected to lexical encoding processing to obtain lexical embedding sequences. It should be noted that the encoding of the text samples can be processed in various appropriate ways, such as a pre-trained vocabulary encoder, or various encoding methods known in the art, which will not be described in detail here. In some embodiments, self-attention processing may be performed on sequences of word embeddings. In some embodiments, the vocabulary embedding sequence or the processing result of the vocabulary embedding sequence after self-attention processing can also be used as text information related to the text sample to be applied to the attention-based processing of the audio sample, for example, as an input parameter applied to the Cross-attention processing of audio samples.
根据本公开的一些实施例,可以对音频样本进行处理以获取声学嵌入内容作为其中间特征值/序列。在一些实施例中,对于音频样本的处理可以采用各种适当方式来实现。在一些实施例中,对于音频样本的处理可以包括对音频样本的编码和/或下采样。在一些实施例中,可以对于文本样本是否具有音频而对于对应的音频样本执行相应地处理。例如,对于有音频文本得出的音频样本,可以对其执行编码和/或下采样以生成声学嵌入内容,而对于非音频文本所对应的虚拟样本,可以直接将其作为声学嵌入内容。在实现中,关于文本样本是否具有音频的指示信息,例如二值形式的指示符等等,可以连同文本样本和音频样本一起被输入模型训练装置,从而模型训练装置可以根据该指示信息来对文本样本对应的音频样本进行相应的处理,以便进行模型训练。在一些实施例,可以对于声学嵌入内容来进行基于注意力的处理,例如交叉注意力处理。According to some embodiments of the present disclosure, audio samples may be processed to obtain acoustic embedding content as their intermediate feature values/sequences. In some embodiments, the processing of audio samples may be implemented in various suitable ways. In some embodiments, the processing of the audio samples may include encoding and/or downsampling of the audio samples. In some embodiments, corresponding processing may be performed on the corresponding audio sample as to whether the text sample has audio or not. For example, encoding and/or downsampling may be performed on audio samples derived from audio text to generate acoustic embedded content, while virtual samples corresponding to non-audio text may be directly used as acoustic embedded content. In implementation, indication information about whether the text sample has audio, such as an indicator in binary form, etc., can be input into the model training device together with the text sample and the audio sample, so that the model training device can analyze the text according to the indication information. The audio samples corresponding to the samples are processed accordingly for model training. In some embodiments, attention-based processing, such as cross-attention processing, may be performed on acoustically embedded content.
根据本公开的一些实施例,训练用于语音识别标点恢复的模型可以包括基于所获取的用于模型训练的文本样本和音频样本来进行标点符号学习/预测,从而进行标点模型训练。在一些实施例中,可以基于前文所述的多模态混合表示来标点符号学习/预测,从而进行标点模型训练。在一些实施例中,可以采用分类器来进行标点符号学习/预测。该分类器可以是各种适当形式的分类器,例如线性分类器(Linear+Softmax),当然还可以是任何其它适当形式,这里将不再详细描述。According to some embodiments of the present disclosure, training a model for punctuation recovery in speech recognition may include learning/predicting punctuation marks based on the acquired text samples and audio samples for model training, thereby performing punctuation model training. In some embodiments, punctuation marks can be learned/predicted based on the multimodal mixed representation described above, so as to perform punctuation model training. In some embodiments, a classifier may be employed for punctuation learning/prediction. The classifier can be a classifier of various appropriate forms, such as a linear classifier (Linear+Softmax), and of course can also be any other appropriate form, which will not be described in detail here.
这样,本公开的方案可以同时使用单一文本数据和成对的文本和音频数据进行训练来解决成对的文本和音频数据不足的问题,特别地对于没有对应音频内容的单一文本数据,利用虚拟样本来虚构音频作为其对应音频数据以进行模型训练,从而可以利用大量的纯文本数据进行训练,而且还可以有效地利用音频数据来增强模型训练效果。In this way, the scheme of the present disclosure can use single text data and paired text and audio data for training to solve the problem of insufficient paired text and audio data, especially for single text data without corresponding audio content, using virtual samples Fictitious audio is used as its corresponding audio data for model training, so that a large amount of plain text data can be used for training, and audio data can be effectively used to enhance the model training effect.
根据本公开的一些实施例,用于训练标点模型的文本和音频样本可以包括多语言的文本和音频样本。在一些实施例中,可以如前所述那样获取用于模型训练的多语言的文本和音频样本,特别地,可以对于每种语言的文本样本,获取其对应的音频样本,其或者为文本样本内容的音频形式的样本,或者虚拟样本。由此可以训练得到改进的标点模型,其在多语言场景下可以更加准确地进行标点恢复。According to some embodiments of the present disclosure, the text and audio samples used to train the punctuation model may include multilingual text and audio samples. In some embodiments, multilingual text and audio samples used for model training can be obtained as described above. In particular, for each language text sample, the corresponding audio sample can be obtained, which may be a text sample A sample, or virtual sample, of the content in audio form. As a result, an improved punctuation model can be trained, which can restore punctuation more accurately in multilingual scenarios.
作为示例,多语言情况下,往往存在小语种数据不足的问题。很多小语种难以获得大量的高质量的文本数据,直接使用少量小语种数据进行训练会导致模型的效果非 常糟糕,使用多语言同时训练的方法能够使用容易获得的常用语言的数据来加强小语种的性能。这样可以有效地构建用于多语言模型训练的样本库,通过使用多语言样本同时训练,可以获得更加精确的适合于多语言的标点模型。As an example, in the case of multiple languages, there is often the problem of insufficient data in small languages. It is difficult for many small languages to obtain a large amount of high-quality text data. Directly using a small amount of small language data for training will lead to a very poor model effect. The method of multilingual simultaneous training can use easily obtained common language data to strengthen the performance of small languages. performance. In this way, a sample library for multilingual model training can be effectively constructed. By using multilingual samples for simultaneous training, a more accurate punctuation model suitable for multiple languages can be obtained.
根据本公开的一些实施例,可以对于用于模型训练的多语言的文本和音频样本进行均衡化,以便进一步优化进行多语言模型训练的样本。在一些实施例中,可通过对于在文本和音频样本中占比低的语言的文本样本和/或音频样本,例如不容易获得的语言样本,小语种样本,低资源/资源少的语言样本,进行扩展处理,以使得用于进行模型训练的各个语言的文本和/或音频的占比均衡,特别地,提高占比低的语言的文本样本和/或音频样本的占比。一些实施例中,可以采用各种适当的方式来进行数据扩展,例如可以对于占比低的语言的文本样本和/或音频样本进行重复化、插值等等操作。在一些实施例中,还可以通过执行温度采样(temperature sampling)或类似采样算法来处理多语言文本样本和/或音频样本以实现均衡化。作为示例,由于小语种数据的稀少,使用temperature sampling的方法,根据不同语言数据的数据量,提高小语种数据的占比,同时减少高资源语种数据的占比。这样使得各种语言数据在总的多语言数据中的占比尽量均衡,可以进一步优化模型训练效果,例如训练得到的模型对于各种语言文本的标点恢复都得到改善。According to some embodiments of the present disclosure, equalization may be performed on the multilingual text and audio samples used for model training, so as to further optimize the samples for multilingual model training. In some embodiments, text samples and/or audio samples of languages that account for a small proportion of text and audio samples can be used, such as language samples that are not easily obtained, small language samples, and low-resource/few-resource language samples, Extending processing is performed to balance the proportions of text and/or audio in each language used for model training, in particular, increasing the proportion of text samples and/or audio samples in languages with low proportions. In some embodiments, data expansion can be performed in various appropriate ways, for example, operations such as repetition, interpolation, and the like can be performed on text samples and/or audio samples of languages with a low proportion. In some embodiments, multilingual text samples and/or audio samples may also be processed for equalization by performing temperature sampling or similar sampling algorithms. As an example, due to the scarcity of data in small languages, use the method of temperature sampling to increase the proportion of data in small languages and reduce the proportion of data in high-resource languages according to the amount of data in different languages. This makes the proportion of various language data in the total multilingual data as balanced as possible, which can further optimize the model training effect. For example, the trained model can improve the punctuation recovery of various language texts.
以下将参照附图3来详细描述根据本公开的一些实施例的标点模型训练的示例性实现。本公开的方案接收文本输入和可选的音频输入来执行标点恢复任务,以给文本添加标点。这里的文本输入和可选的音频输入可以是前述的用于训练的文本样本和音频样本,还可以是前述的多语言的文本样本和音频样本。An exemplary implementation of punctuation model training according to some embodiments of the present disclosure will be described in detail below with reference to FIG. 3 . The scheme of the present disclosure receives text input and optionally audio input to perform the task of punctuation recovery to add punctuation to text. The text input and optional audio input here may be the aforementioned text samples and audio samples used for training, or the aforementioned multilingual text samples and audio samples.
本公开的方案主要包含三个阶段的操作:第一阶段分别对文本输入和音频输入进行处理,第二阶段分别对处理后的文本输入和音频输入执行基于注意力的处理,以便将音频的信息融入到文本的表示中以得到的多模态混合表示,并且第三阶段将基于多模态混合表示进行标点预测,从而进行模型训练。本公开的训练过程使用经典的SGD的算法和Adam优化器,优化Cross-Entropy损失函数。The disclosed scheme mainly includes three stages of operations: the first stage processes the text input and audio input respectively, and the second stage performs attention-based processing on the processed text input and audio input respectively, so that the audio information Integrate into the representation of the text to obtain a multi-modal mixed representation, and the third stage will be based on the multi-modal mixed representation for punctuation prediction for model training. The disclosed training process uses the classic SGD algorithm and Adam optimizer to optimize the Cross-Entropy loss function.
标点恢复通常被建模为序列标记任务。通常,多模态标点符号语料库是一组句子-音频-标点符号三元组,表示为S={x,a,y}。这里x是一个长度为T的未加标点符号的句子,a是对应的语音音频。该模型的输出应该是预测的在给定x和a的情况下的添加标点符号的序列y。由于序列标记任务的本质,标点符号y的序列的长度与未加标点符号的句子x相同,即|x|=|y|。标点符号可以是文本用可用的任何适当的标点 符号,例如可以是四种类型,即逗号(,),句号(。),问号(?),以及无标点符号。Punctuation recovery is often modeled as a sequence labeling task. Typically, a multimodal punctuation corpus is a set of sentence-audio-punctuation triples denoted as S = {x, a, y}. Here x is an unpunctuated sentence of length T and a is the corresponding speech audio. The output of the model should be the predicted punctuated sequence y given x and a. Due to the nature of the sequence labeling task, the sequence of punctuated symbols y has the same length as the unpunctuated sentence x, i.e. |x| = |y|. The punctuation marks may be any suitable punctuation marks available for the text, for example may be of four types, comma (,), period (.), question mark (?), and no punctuation.
在第一阶段中可以对文本输入进行编码处理。这里的编码处理可以通过各种适当的方式来执行,特别地,可以将未加标点的文本句子拆分为子词序列,并将其转换为词汇嵌入序列
Figure PCTCN2022125163-appb-000001
作为示例,可以利用预先训练的自然语言(NLP)处理模型作为主干模型来构建词汇编码器,并且可根据特定任务的数据对词汇编码器的模型进行微调。在一些实施例中,文本的编码处理可通过文本编码器来执行,该文本编码器可以被包含在根据本公开的实施例的方案中。
Text input can be encoded in the first stage. The encoding process here can be performed in various suitable ways, in particular, unpunctuated text sentences can be split into sequences of subwords and converted into sequences of lexical embeddings
Figure PCTCN2022125163-appb-000001
As an example, a lexical encoder can be built using a pre-trained natural language (NLP) processing model as a backbone model, and the model of the lexical encoder can be fine-tuned on task-specific data. In some embodiments, the encoding process of the text may be performed by a text encoder, which may be included in the solutions according to the embodiments of the present disclosure.
在第一阶段中可以对音频输入进行处理,特别地,可基于音频输入的类型,将音频输入转换成声学嵌入H a或虚拟嵌入H i。在一些实施例中,音频处理可通过各种适当的声学处理组件来执行,该声学处理组件可以被包含在根据本公开的实施例的方案中。 In the first stage the audio input can be processed, in particular, the audio input can be converted into an acoustic embedding H a or a virtual embedding Hi depending on the type of the audio input. In some embodiments, audio processing may be performed by various suitable acoustic processing components, which may be included in solutions according to embodiments of the present disclosure.
特别的,对于作为文本输入的内容的音频表示的音频输入,例如带音频注释的转录,可以将其转换成声学嵌入H a。具体而言,可以通过预先训练的声学模型/声学特征提取器将音频输入转换为声学特征。通常,声学特征提取器可以首先通过自监督训练在未标记的音频数据集上进行预处理,并且可根据下游标点恢复任务而进行微调。然后可进一步应用下采样网络来缩短提取的声学特征长度,得到声学嵌入内容
Figure PCTCN2022125163-appb-000002
目的是使声学嵌入的长度接近或等于句子嵌入,以便模型能够更好地对齐交叉模态信息。作为示例,可以选择多层卷积网络作为下采样网络的核心组件。
In particular, audio input that is an audio representation of the content of the text input, such as an audio-annotated transcription, can be converted into an acoustic embedding H a . Specifically, audio input can be converted into acoustic features by a pre-trained acoustic model/acoustic feature extractor. Typically, acoustic feature extractors can be first preprocessed by self-supervised training on unlabeled audio datasets, and can be fine-tuned for downstream punctuation recovery tasks. The downsampling network can then be further applied to shorten the length of the extracted acoustic features to obtain the acoustic embedding content
Figure PCTCN2022125163-appb-000002
The goal is to make the length of the acoustic embedding close to or equal to the sentence embedding so that the model can better align the cross-modal information. As an example, a multi-layer convolutional network can be chosen as the core component of the downsampling network.
对于无音频文本,则使用虚拟嵌入H i来虚构可能缺失的声学特征,即如果
Figure PCTCN2022125163-appb-000003
则H a=H i。作为示例,可以将虚拟嵌入H i设置为固定长度的可学习参数数组,并期望它学习缺失音频的表示。也即是说,如前文所述的对应于无音频文本的虚拟音频样本可以为预定向量序列,例如全零序列,并且可由其得到虚拟嵌入,例如直接用作虚拟嵌入,或者缩短其长度以使得虚拟嵌入的长度接近或等于句子嵌入,从而缩短后的序列作为虚拟嵌入。
For audio-free texts, virtual embeddings H i are used to fictionalize possible missing acoustic features, i.e., if
Figure PCTCN2022125163-appb-000003
Then H a =H i . As an example, one can set the virtual embedding Hi to a fixed-length array of learnable parameters and expect it to learn representations for missing audio. That is to say, the virtual audio sample corresponding to the non-audio text as mentioned above can be a predetermined vector sequence, such as an all-zero sequence, and a virtual embedding can be obtained from it, such as directly used as a virtual embedding, or its length can be shortened so that The length of the virtual embedding is close to or equal to the sentence embedding, so the shortened sequence serves as a virtual embedding.
然后,可以基于词汇嵌入序列和声学嵌入或虚拟嵌入,来组合声学和词汇特征以生成多模态混合表示。此操作可通过协调引导器(coordinate bootstrapper)来执行,其可以联合训练无音频和音频文本,以克服缺失问题。特别地,协调引导器联合利用声学和词汇特征,并应用基于注意力的操作来学习两种模态的混合表示。Then, based on the sequence of lexical embeddings and the acoustic or virtual embeddings, the acoustic and lexical features can be combined to generate a multimodal hybrid representation. This operation can be performed by a coordinate bootstrapper, which can jointly train without audio and audio text to overcome the missing problem. In particular, the coordination guide jointly exploits acoustic and lexical features and applies attention-based operations to learn a hybrid representation of the two modalities.
具体来说,首先对词汇嵌入序列H l进行自注意力操作,以捕获未加标点句子中的 长程依存关系S l,并在词汇嵌入序列H l和声学嵌入序列H a之间应用交叉注意力操作,以形成交叉模态表示S a: Specifically, self-attention is firstly operated on the lexical embedding sequence H l to capture the long-range dependencies S l in unpunctuated sentences, and cross-attention is applied between the lexical embedding sequence H l and the acoustic embedding sequence H a operation to form a cross-modal representation S a :
S l=Att(H l,H l,H l)        (1) S l =Att(H l , H l , H l ) (1)
S a=Att(H a,H a,H l)           (2) S a =Att(H a , H a , H l ) (2)
这里
Figure PCTCN2022125163-appb-000004
是一种注意力操作,其中d k为模型的维度大小。请注意,对于模态缺失样本,我们用虚拟嵌入H i替换声学嵌入H a,在这种情况下,如果
Figure PCTCN2022125163-appb-000005
则S a=Att(H i,H i,H l)。
here
Figure PCTCN2022125163-appb-000004
is an attention operation, where d k is the dimensionality of the model. Note that for modality-missing samples, we replace the acoustic embedding H a with the virtual embedding H i , in this case, if
Figure PCTCN2022125163-appb-000005
Then S a =Att(H i , H i , H l ).
然后,通过将经注意力处理后的表示与残余连接相加来获得混合表示H hThen, the hybrid representation H h is obtained by adding the attention-processed representation to the residual connections:
H h=S l+S a+H l         (3) H h =S l +S a +H l (3)
在实现中,协调引导器可以堆叠到多个层以进一步增加模型容量。In implementations, coordination guides can be stacked to multiple layers to further increase model capacity.
最后,通过输出分类器层来从混合表示进行预测。输出分类器层由线性投影和softmax激活函数组成,并且将向分类器层输入H h来预测标点符号序列
Figure PCTCN2022125163-appb-000006
Finally, predictions are made from the mixture of representations by an output classifier layer. The output classifier layer consists of a linear projection and a softmax activation function, and will input H h to the classifier layer to predict the punctuation sequence
Figure PCTCN2022125163-appb-000006
这样,我们使得音频样本的表示和没有音频样本的表示能够共享相同的嵌入空间。因此,该模型可以在同一训练批次中接收混合数据,并且经训练的模型能够对音频文本和无音频文本进行标点。In this way, we enable representations with audio samples and representations without audio samples to share the same embedding space. Therefore, the model can receive mixed data in the same training batch, and the trained model is able to punctuate both audio text and non-audio text.
应指出,根据本公开的一些实施例,前述的文本编码器、声学处理组件和协调引导器均可作为子模块/框架被包含在根据本公开的实施例的模型训练装置,即UniPunc框架中。这样,根据本公开的UniPunc框架为解决多模态标点恢复任务中的模态缺失提供了一个普适的框架。此外,根据本公开的方案还可以有效地适用于当前的一些标点模型,并且作为其的有益补充。作为示例,可以将根据本公开的声学处理组件和协调引导器应用于/添加到当前的标点模型中以进行修正,使得它们能够进行改进以处理模态缺失样本。It should be noted that, according to some embodiments of the present disclosure, the aforementioned text encoder, acoustic processing component and coordination guider can all be included as submodules/frameworks in the model training device according to the embodiments of the present disclosure, namely the UniPunc framework. In this way, the UniPunc framework according to the present disclosure provides a general framework for addressing modality loss in multimodal punctuation recovery tasks. In addition, the solution according to the present disclosure can also be effectively applied to some current punctuation models, and can be used as a useful supplement thereto. As an example, acoustic processing components and coordination guiders according to the present disclosure can be applied/added to current punctuation models for correction such that they can be improved to handle modally missing samples.
此外,上述模型训练的实例同样适用于多语言场景,特别地,可以同时使用不同语言的数据和音频来训练得到模型。特别地,将多种不同语言的数据,包括文本数据和/或音频数据,作为输入来执行上述根据本公开的方案中的各个阶段的处理,例如,将各语言的音频输入到音频组件,对于不同语言的文本可以统一使用词汇编码器,从而能够获得适合于多语言场景的性能增强的模型。In addition, the above example of model training is also applicable to multilingual scenarios. In particular, the model can be trained using data and audio in different languages at the same time. In particular, data in a plurality of different languages, including text data and/or audio data, is used as input to perform the processing at each stage in the solution according to the present disclosure, for example, audio in each language is input to the audio component, for Texts in different languages can use lexical encoders uniformly, enabling performance-enhanced models suitable for multilingual scenarios.
而且,尽管未示出,但是在对文本和音频进行上述处理之前,可以对于多语言的输入进行均衡化处理,并且对于均衡化之后的各语言的文本和音频进一步执行上述的 处理,例如上述三个阶段的处理。这里将不再详细描述。Moreover, although not shown, before performing the above-mentioned processing on the text and audio, equalization processing can be performed on multilingual input, and the above-mentioned processing can be further performed on the text and audio in each language after equalization, for example, the above three stages of processing. It will not be described in detail here.
根据本公开的一些实施例的标点模型可用于各种适当的标点符号恢复应用。根据本公开的一些实施例的标点模型可以具有很好的普适性,可以应用于任何语音识别系统,例如可以对语音识别系统输出的文本进一步处理,以优化语音识别系统输出的文本的标点恢复。例如,可以对未添加标点符号的文本进行标点符号添加,或者对于已添加标点符号的文本进行验证,以进行进一步的修正。Punctuation models according to some embodiments of the present disclosure may be used in various suitable punctuation recovery applications. The punctuation model according to some embodiments of the present disclosure can have good universality and can be applied to any speech recognition system, for example, the text output by the speech recognition system can be further processed to optimize the punctuation recovery of the text output by the speech recognition system . For example, punctuation can be performed on unpunctuated text, or validation can be performed on punctuated text for further correction.
图2示出了根据本公开的一些实施例的标点符号恢复方法的流程图。在方法200中,在步骤S201,获取待添加标点符号的文本,例如语音识别文本输出,并且在步骤S202,对于所获取的文本输出应用根据本公开的模型训练方法训练得到的标点模型,来恢复文本输出中的标点。由此可以为该文本适当地添加标点符号,从而实现准确的标点恢复或者标点验证/修正。FIG. 2 shows a flowchart of a method for punctuation recovery according to some embodiments of the present disclosure. In method 200, in step S201, the text to be added with punctuation is obtained, such as speech recognition text output, and in step S202, the punctuation model trained according to the model training method of the present disclosure is applied to the obtained text output to recover Punctuation in text output. In this way, punctuation marks can be appropriately added to the text, thereby realizing accurate punctuation recovery or punctuation verification/correction.
在一些实施例中,输入文本可以和相关的音频一起输入模型,从而可以基于这两者来为输入文本适当地添加标点符号。如果输入模型的仅为文本,则这意味着该文本无音频,则可以生成虚拟样本与该文本一起输入模型,并且基于此来为输入文本添加标点符号。在另一些实施例中,输入标点模型的文本的音频信息可以被获取,例如可以根据指示文本输入是否具有音频的指示符,在该指示符指示文本具有音频的情况下,标点模型可以获取文本音频,例如文本音频与文本一起输入模型,或者从预定存储位置获得文本音频,在该指示符指示文本不具有音频,例如为单一文本的情况下,可以生成虚拟音频样本并输入标点模型,由此,标点模型将进行合理以预测表现符号,由此可以获得标点恢复的文本。此过程可以如前文所述的模型训练过程那样以类似的方式执行,例如上述三个阶段的处理,这里将不再详细描述。In some embodiments, input text may be entered into the model along with associated audio so that the input text may be appropriately punctuated based on both. If the input to the model is only text, which means that the text has no audio, a dummy sample can be generated to be fed into the model along with the text, and punctuation is added to the input text based on this. In some other embodiments, the audio information of the text input into the punctuation model can be obtained, for example, according to an indicator indicating whether the text input has audio, and in the case where the indicator indicates that the text has audio, the punctuation model can obtain the text audio , for example text audio is input into the model together with the text, or the text audio is obtained from a predetermined storage location, where the indicator indicates that the text has no audio, such as a single text, a virtual audio sample can be generated and input into the punctuation model, whereby, The punctuation model will be rationalized to predict representational symbols, from which punctuation-recovered texts can be obtained. This process can be performed in a similar manner as the model training process described above, such as the above three stages of processing, which will not be described in detail here.
图4A示出了根据本公开的一些实施例的标点模型训练装置400。装置400包括样本获取单元401,被配置为获取用于模型训练的文本样本和对应的音频样本,其中对于得自无音频文本的文本样本,其对应的音频样本为虚拟样本;以及训练单元402,被配置为基于所获取的用于模型训练的文本样本和对应的音频样本来训练用于语音识别标点恢复的模型。FIG. 4A shows a punctuation model training device 400 according to some embodiments of the present disclosure. The device 400 includes a sample acquisition unit 401 configured to acquire text samples and corresponding audio samples for model training, wherein for text samples obtained from text without audio, the corresponding audio samples are virtual samples; and a training unit 402, configured to train a model for punctuation recovery in speech recognition based on the acquired text samples and corresponding audio samples for model training.
在一些实施例中,训练单元402进一步配置为基于所获取的用于模型训练的文本样本和音频样本执行基于注意力(attention)的处理而生成多模态混合表示,由此训练单元402可以基于所生成的多模态混合表示来进行模型训练。上述多模态混合表示可由混合表示生成单元4021生成,尽管未示出,但是在示例性实现中,模型训练可 通过用于进行标点预测的组件来进行,例如分类器或者类似模型训练组件等,在此情况下,这样的组件可被包括在训练单元中。In some embodiments, the training unit 402 is further configured to perform attention-based processing based on the acquired text samples and audio samples for model training to generate a multi-modal mixed representation, whereby the training unit 402 can be based on The generated multimodal hybrid representations are used for model training. The above-mentioned multimodal mixed representation can be generated by the mixed representation generation unit 4021, although not shown, but in an exemplary implementation, model training can be performed by a component for punctuation prediction, such as a classifier or a similar model training component, etc., In this case, such components may be included in the training unit.
在一些实施例中,混合表示生成单元4021可进一步包括文本转换单元4022,被配置为将文本样本转换成词汇嵌入序列;音频转换单元4023,被配置为将音频样本转换成声学嵌入序列;以及联合处理单元4024,被配置为分别对于词汇嵌入序列和声学嵌入序列进行基于注意力的操作;并且将词汇嵌入序列和声学嵌入序列的处理结果进行组合以生成多模态混合表示。应指出,尽管未示出,但文本转换单元和音频转换单元可不被包含在混合表示生成单元中,甚至在训练单元之外,也就是说,文本转换和音频转换可以是在进行模型训练之前对于样本的处理,这样用于模型训练的输入的样本实际上是转换之后的样本。In some embodiments, the mixed representation generation unit 4021 may further include a text conversion unit 4022 configured to convert text samples into a vocabulary embedding sequence; an audio conversion unit 4023 configured to convert audio samples into an acoustic embedding sequence; and jointly The processing unit 4024 is configured to perform attention-based operations on the vocabulary embedding sequence and the acoustic embedding sequence respectively; and combine the processing results of the vocabulary embedding sequence and the acoustic embedding sequence to generate a multi-modal hybrid representation. It should be noted that although not shown, the text conversion unit and audio conversion unit may not be included in the hybrid representation generation unit, or even outside the training unit, that is, text conversion and audio conversion may be performed for The processing of samples, so that the input samples used for model training are actually transformed samples.
上述各单元可以采用各种适当的方式来实现。在一种示例性实现中,文本转换单元4022和音频转换单元4023可至少分别对应于前文所述的文本编码器和声学处理组件,联合处理单元4024可对应于前文所述的协调引导器,混合表示生成单元4021可至少包含或对应于前文所述的协调引导器,并且还可以包括前文所述的词汇编码器和声学处理组件、协调引导器;训练单元402可至少包含或对应于图3所示的所有处理阶段的对应模块/设备等,例如前文所述的词汇编码器和声学处理组件、协调引导器、分类器等。Each of the above units can be implemented in various appropriate ways. In an exemplary implementation, the text conversion unit 4022 and the audio conversion unit 4023 may at least correspond to the text encoder and the acoustic processing component described above, respectively, and the joint processing unit 4024 may correspond to the coordination guide described above, mixing The representation generation unit 4021 may at least include or correspond to the coordination guide described above, and may also include the vocabulary encoder, the acoustic processing component, and the coordination guide described above; the training unit 402 may at least include or correspond to the coordination guide described in FIG. 3 Corresponding modules/devices, etc. for all processing stages shown, such as lexical encoders and acoustic processing components, coordination guides, classifiers, etc. as described above.
图4B示出了根据本公开的一些实施例的标点恢复装置410。装置410包括获取单元411,被配置为获取语音识别的文本输出,以及标点恢复单元412,被配置为对于所获取的文本输出应用根据本公开的实施例的训练方法训练得到的标点模型,来恢复文本输出中的标点。这里的标点恢复单元可以执行与前文所述的训练单元相似的操作/处理,例如可以包括前文所述的文本编码器和声学处理组件、协调引导器、分类器等。FIG. 4B shows a punctuation recovery device 410 according to some embodiments of the present disclosure. Apparatus 410 includes an acquisition unit 411 configured to acquire the text output of speech recognition, and a punctuation recovery unit 412 configured to apply the punctuation model trained according to the training method of an embodiment of the present disclosure to the acquired text output to recover Punctuation in text output. The punctuation recovery unit here can perform operations/processing similar to the above-mentioned training unit, for example, it can include the above-mentioned text encoder and acoustic processing components, coordination guider, classifier and so on.
应注意,上述各个单元仅是根据其所实现的具体功能划分的逻辑模块,而不是用于限制具体的实现方式,例如可以以软件、硬件或者软硬件结合的方式来实现。在实际实现时,上述各个单元可被实现为独立的物理实体,或者也可由单个实体(例如,处理器(CPU或DSP等)、集成电路等)来实现。此外,上述各个单元在附图中用虚线示出指示这些单元可以并不实际存在,而它们所实现的操作/功能可由处理电路本身来实现。It should be noted that the above-mentioned units are only logical modules divided according to the specific functions they implement, and are not used to limit specific implementation methods, for example, they can be implemented in software, hardware, or a combination of software and hardware. In actual implementation, each of the above units may be implemented as an independent physical entity, or may also be implemented by a single entity (for example, a processor (CPU or DSP, etc.), an integrated circuit, etc.). In addition, the above-mentioned units are shown with dotted lines in the drawings to indicate that these units may not actually exist, and the operations/functions realized by them may be realized by the processing circuit itself.
此外,尽管未示出,该装置还可以包括存储器,其可以存储由装置、装置所包含的各个单元在操作中产生的各种信息、用于操作的程序和数据、将由通信单元发送的 数据等。存储器可以是易失性存储器和/或非易失性存储器。例如,存储器可以包括但不限于随机存储存储器(RAM)、动态随机存储存储器(DRAM)、静态随机存取存储器(SRAM)、只读存储器(ROM)、闪存存储器。当然,存储器可也位于该装置之外。可选地,尽管未示出,但是该装置也可以包括通信单元,其可用于与其它装置进行通信。在一个示例中,通信单元可以被按照本领域已知的适当方式来实现,例如包括天线阵列和/或射频链路等通信部件,各种类型的接口、通信单元等等。这里将不再详细描述。此外,装置还可以包括未示出的其它部件,诸如射频链路、基带处理单元、网络接口、处理器、控制器等。这里将不再详细描述。In addition, although not shown, the device may further include a memory that can store various information generated in operation by the device, each unit included in the device, programs and data for operations, data to be transmitted by a communication unit, etc. . The memory can be volatile memory and/or non-volatile memory. For example, memory may include, but is not limited to, random access memory (RAM), dynamic random access memory (DRAM), static random access memory (SRAM), read only memory (ROM), flash memory. Of course, the memory could also be located external to the device. Optionally, although not shown, the device may also include a communication unit, which may be used to communicate with other devices. In one example, the communication unit may be implemented in an appropriate manner known in the art, for example including communication components such as antenna arrays and/or radio frequency links, various types of interfaces, communication units and the like. It will not be described in detail here. In addition, the device may also include other components not shown, such as a radio frequency link, a baseband processing unit, a network interface, a processor, a controller, and the like. It will not be described in detail here.
本公开提出了改进的用于标点恢复的标点模型的训练和应用,该模型可以同时使用单一的文本数据和成对的文本和音频数据进行训练,从而解决成对的文本和音频数据不足的问题。特别地,通过利用预先定义的向量在没有音频的时候来代替音频作为输入,可以利用大量的纯文本数据构建训练集合,并且利用音频数据进行辅助,从而进一步改善了模型训练效果,提高了训练得到的模型的准确性。此外,本公开还可以利用多语言的数据协同地进行训练,特别地,通过使用数据量大的语言数据来加强对于数量少的语言的模型性能,从而获得了进一步优化的标点模型。由此,可以将纯文本数据,文本和音频对齐的数据,不同语言的数据在一起训练,提升了模型在纯文本数据,文本和音频对齐的数据还有小语种数据上的性能。而且,这样的模型可以适合于各种应用任务、尤其适合于语音识别任务,并且获得更好的标点恢复效果。This disclosure proposes the training and application of an improved punctuation model for punctuation recovery, which can use both single text data and paired text and audio data for training, thereby solving the problem of insufficient paired text and audio data . In particular, by using pre-defined vectors instead of audio as input when there is no audio, a large amount of plain text data can be used to construct a training set, and audio data can be used to assist, thereby further improving the model training effect and improving the training results. the accuracy of the model. In addition, the present disclosure can also utilize multilingual data for collaborative training, in particular, by using language data with a large amount of data to enhance the performance of the model for a small number of languages, thereby obtaining a further optimized punctuation model. As a result, plain text data, text and audio aligned data, and data in different languages can be trained together, which improves the performance of the model on plain text data, text and audio aligned data, and minority language data. Moreover, such a model can be suitable for various application tasks, especially for speech recognition tasks, and achieve better punctuation recovery effects.
以下将结合示例来进一步展现本公开的方案的有效性。The effectiveness of the solution of the present disclosure will be further demonstrated in combination with examples below.
对于训练和测试用的数据集,将主要在两个真实世界的语料库上进行实验:MuST-C和多语言TEDx(mTEDx),它们的音频都源于TED演讲。基于以下两个语料库构建了数据集:1)英语音频(English-Audio):这个集合包含了MuST-C中的英语音频和句子,每个样本都有音频。2)英语-混合(English-Mixed):这个集合包含两个语料库的所有英语音频句子和无音频句子。请注意,英语音频是英语混合的一个子集。For the datasets used for training and testing, experiments will be mainly conducted on two real-world corpora: MuST-C and Multilingual TEDx (mTEDx), both of which have audio sourced from TED talks. The dataset is constructed based on the following two corpora: 1) English-Audio: This collection contains English audio and sentences in MuST-C, with audio for each sample. 2) English-Mixed (English-Mixed): This collection contains all English audio sentences and no audio sentences for both corpora. Note that English Audio is a subset of English Mixed.
在上述数据集上,对于根据本公开的方案UniPunc与相关技术中的多模态模型进行测试以比较性能,测试证明,根据本公开的实施例的UniPunc在英语音频集合上优于相关技术中的多模态模型,根据本公开的实施例的UniPunc在英语-混合集合上的性能优于其在英语音频集合上的性能,这样说明即使没有采用无音频句子,本公开的实施例的UniPunc能够优于现有的多模态模型,而通过采用无音频句子能够进一步改 善根据本公开的标点模型的性能。On the above data set, the UniPunc according to the scheme of the present disclosure is tested to compare the performance with the multimodal model in the related art. The test proves that the UniPunc according to the embodiment of the present disclosure is better than the related art in the English audio collection. Multimodal model, the performance of UniPunc according to the embodiment of the present disclosure on the English-mixed set is better than its performance on the English audio set, which shows that even if no audio sentences are used, the UniPunc of the embodiment of the present disclosure can outperform Compared with the existing multimodal models, the performance of the punctuation model according to the present disclosure can be further improved by adopting non-audio sentences.
在上述英语混合数据集上,对于根据本公开的UniPunc与相关技术中的单模态模型进行测试以比较性能,测试证明,根据本公开的实施例的UniPunc能够有效地获取多模态混合表示,并且有效地代表了语音中的声学特征,这样能够明显改进对于文本的标点恢复。On the above-mentioned English mixed data set, the UniPunc according to the present disclosure is tested to compare performance with the single-modal model in the related art. The test proves that the UniPunc according to the embodiment of the present disclosure can effectively obtain multi-modal mixed representations, And effectively represent the acoustic features in speech, which can significantly improve the punctuation recovery of text.
此外,通过在mTEDx的多语言数据上进行测试,可以发现根据本公开的实施例所实现的UniPunc实现的标点符号恢复能更接近人类,可以更好地区分逗号和句号的停顿以及问题的语气。此外,根据本公开的UniPunc在多语言标点上比其他基线具有更好的标点性能,这表明UniPunc具有更好的鲁棒性和泛化能力。In addition, by testing on the multilingual data of mTEDx, it can be found that the punctuation recovery achieved by UniPunc according to the embodiments of the present disclosure can be closer to human beings, and can better distinguish the pauses between commas and periods and the tone of questions. In addition, UniPunc according to the present disclosure has better punctuation performance than other baselines on multilingual punctuation, which shows that UniPunc has better robustness and generalization ability.
而且,本公开的UniPunc能够较好地适用于任何现有的标点恢复方法。特别地,通过将本公开的UniPunc框架中的有效模块,尤其是声学辅助模块和/或协调引导器,引入现有的标点恢复方案中,可以进一步优化现有的标点恢复方案的性能。实验表明,本公开的UniPunc方案、尤其是其中声学辅助模块和/或协调引导器等模块,对于解决标点符号恢复中的模态缺失具有普适性,并且使得以前的单模态模型能够处理多模态语料库,进一步提升了整体性能。Moreover, the UniPunc of the present disclosure can be well adapted to any existing punctuation recovery method. In particular, by introducing the effective modules in the UniPunc framework of the present disclosure, especially the acoustic auxiliary module and/or the coordination guide, into the existing punctuation recovery scheme, the performance of the existing punctuation recovery scheme can be further optimized. Experiments show that the UniPunc scheme of the present disclosure, especially the modules such as the acoustic auxiliary module and/or the coordination guider, has general applicability for solving the lack of modes in punctuation recovery, and enables the previous single-modal model to handle multiple The modal corpus further improves the overall performance.
本公开的一些实施例还提供一种电子设备,其可以操作以实现前述的模型预训练设备和/或模型训练设备的操作/功能。图5示出本公开的电子设备的一些实施例的框图。例如,在一些实施例中,电子设备5可以为各种类型的设备,例如可以包括但不限于诸如移动电话、笔记本电脑、数字广播接收器、PDA(个人数字助理)、PAD(平板电脑)、PMP(便携式多媒体播放器)、车载终端(例如车载导航终端)等等的移动终端以及诸如数字TV、台式计算机等等的固定终端。例如,电子设备5可以包括显示面板,以用于显示根据本公开的方案中所利用的数据和/或执行结果。例如,显示面板可以为各种形状,例如矩形面板、椭圆形面板或多边形面板等。另外,显示面板不仅可以为平面面板,也可以为曲面面板,甚至球面面板。Some embodiments of the present disclosure also provide an electronic device operable to implement the operations/functions of the aforementioned model pre-training device and/or model training device. Figure 5 shows a block diagram of some embodiments of an electronic device of the present disclosure. For example, in some embodiments, the electronic device 5 can be various types of devices, such as but not limited to mobile phones, notebook computers, digital broadcast receivers, PDAs (Personal Digital Assistants), PADs (Tablet Computers), Mobile terminals such as PMPs (Portable Multimedia Players), vehicle-mounted terminals (eg, vehicle-mounted navigation terminals), and the like, and stationary terminals such as digital TVs, desktop computers, and the like. For example, the electronic device 5 may include a display panel for displaying data and/or execution results utilized in the solutions according to the present disclosure. For example, the display panel can be in various shapes, such as a rectangular panel, an oval panel, or a polygonal panel, and the like. In addition, the display panel can be not only a flat panel, but also a curved panel, or even a spherical panel.
如图5所示,该实施例的电子设备5包括:存储器51以及耦接至该存储器51的处理器52。应当注意,图5所示的电子设备50的组件只是示例性的,而非限制性的,根据实际应用需要,该电子设备50还可以具有其他组件。处理器52可以控制电子设备5中的其它组件以执行期望的功能。As shown in FIG. 5 , the electronic device 5 of this embodiment includes: a memory 51 and a processor 52 coupled to the memory 51 . It should be noted that the components of the electronic device 50 shown in FIG. 5 are exemplary rather than limiting, and the electronic device 50 may also have other components according to actual application requirements. Processor 52 may control other components in electronic device 5 to perform desired functions.
在一些实施例中,存储器51用于存储一个或多个计算机可读指令。处理器52用于运行计算机可读指令时,计算机可读指令被处理器52运行时实现根据上述任一实 施例所述的方法。关于该方法的各个步骤的具体实现以及相关解释内容可以参见上述的实施例,重复之处在此不作赘述。In some embodiments, memory 51 is used to store one or more computer readable instructions. When the processor 52 is used to execute computer-readable instructions, the computer-readable instructions are executed by the processor 52 to implement the method according to any of the above-mentioned embodiments. For the specific implementation and related explanations of each step of the method, reference may be made to the above-mentioned embodiments, and repeated descriptions will not be repeated here.
例如,处理器52和存储器51之间可以直接或间接地互相通信。例如,处理器52和存储器51可以通过网络进行通信。网络可以包括无线网络、有线网络、和/或无线网络和有线网络的任意组合。处理器52和存储器51之间也可以通过系统总线实现相互通信,本公开对此不作限制。For example, the processor 52 and the memory 51 may communicate with each other directly or indirectly. For example, processor 52 and memory 51 may communicate via a network. A network may include a wireless network, a wired network, and/or any combination of a wireless network and a wired network. The processor 52 and the memory 51 may also communicate with each other through the system bus, which is not limited in the present disclosure.
例如,处理器52可以体现为各种适当的处理器、处理装置等,诸如中央处理器(CPU)、图形处理器(Graphics Processing Unit,GPU)、网络处理器(NP)等;还可以是数字信号处理器(DSP)、专用集成电路(ASIC)、现场可编程门阵列(FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件。中央处理元(CPU)可以为X86或ARM架构等。例如,存储器51可以包括各种形式的计算机可读存储介质的任意组合,例如易失性存储器和/或非易失性存储器。存储器51例如可以包括系统存储器,系统存储器例如存储有操作系统、应用程序、引导装载程序(Boot Loader)、数据库以及其他程序等。在存储介质中还可以存储各种应用程序和各种数据等。For example, the processor 52 can be embodied as various suitable processors, processing devices, etc., such as a central processing unit (CPU), a graphics processing unit (Graphics Processing Unit, GPU), a network processor (NP), etc.; Signal Processor (DSP), Application Specific Integrated Circuit (ASIC), Field Programmable Gate Array (FPGA) or other Programmable Logic Devices, Discrete Gate or Transistor Logic Devices, Discrete Hardware Components. The central processing unit (CPU) may be an X86 or ARM architecture or the like. For example, memory 51 may include any combination of various forms of computer-readable storage media, such as volatile memory and/or non-volatile memory. The memory 51 may include, for example, a system memory, and the system memory stores, for example, an operating system, an application program, a boot loader (Boot Loader), a database, and other programs. Various application programs, various data, and the like can also be stored in the storage medium.
另外,根据本公开的一些实施例,根据本公开的各种操作/处理在通过软件和/或固件实现的情况下,可从存储介质或网络向具有专用硬件结构的计算机系统,例如图6所示的计算机系统600安装构成该软件的程序,该计算机系统在安装有各种程序时,能够执行各种功能,包括诸如前文所述的功能等等。图6是示出根据本公开的一些实施例的中可采用的计算机系统的示例结构的框图。In addition, according to some embodiments of the present disclosure, when various operations/processing according to the present disclosure are implemented by software and/or firmware, they can be transferred from a storage medium or a network to a computer system with a dedicated hardware structure, such as shown in FIG. 6 The computer system 600 shown is installed with programs constituting the software, and the computer system, when installed with various programs, can perform various functions, including functions such as those described above, and the like. FIG. 6 is a block diagram illustrating an example structure of a computer system employable in some embodiments of the present disclosure.
在图6中,中央处理单元(CPU)601根据只读存储器(ROM)602中存储的程序或从存储部分608加载到随机存取存储器(RAM)603的程序执行各种处理。在RAM 603中,也根据需要存储当CPU 601执行各种处理等时所需的数据。中央处理单元仅仅是示例性的,其也可以是其它类型的处理器,诸如前文所述的各种处理器。ROM 602、RAM 603和存储部分608可以是各种形式的计算机可读存储介质,如下文所述。需要注意的是,虽然图6中分别示出了ROM 602、RAM 603和存储装置608,但是它们中的一个或多个可以合并或者位于相同或不同的存储器或存储模块中。In FIG. 6 , a central processing unit (CPU) 601 executes various processes according to programs stored in a read only memory (ROM) 602 or loaded from a storage section 608 to a random access memory (RAM) 603 . In the RAM 603, data required when the CPU 601 executes various processing and the like is also stored as necessary. The central processing unit is only exemplary, and it may also be other types of processors, such as the various processors mentioned above. ROM 602, RAM 603 and storage portion 608 may be various forms of computer-readable storage media, as described below. It should be noted that although ROM 602, RAM 603 and storage device 608 are shown separately in FIG. 6, one or more of them may be combined or located in the same or different memory or storage modules.
CPU 601、ROM 602和RAM 603经由总线604彼此连接。输入/输出接口605也连接到总线604。The CPU 601, ROM 602, and RAM 603 are connected to each other via a bus 604. The input/output interface 605 is also connected to the bus 604 .
下述部件连接到输入/输出接口605:输入部分606,诸如触摸屏、触摸板、键盘、鼠标、图像传感器、麦克风、加速度计、陀螺仪等;输出部分607,包括显示器,比 如阴极射线管(CRT)、液晶显示器(LCD),扬声器,振动器等;存储部分608,包括硬盘,磁带等;和通信部分609,包括网络接口卡比如LAN卡、调制解调器等。通信部分609允许经由网络比如因特网执行通信处理。容易理解的是,虽然图6中示出电子设备600中的各个装置或模块是通过总线604来通信的,但它们也可以通过网络或其它方式进行通信,其中,网络可以包括无线网络、有线网络、和/或无线网络和有线网络的任意组合。The following components are connected to the input/output interface 605: an input section 606, such as a touch screen, touch pad, keyboard, mouse, image sensor, microphone, accelerometer, gyroscope, etc.; an output section 607, including a display, such as a cathode ray tube (CRT ), a liquid crystal display (LCD), a speaker, a vibrator, etc.; a storage section 608 including a hard disk, a magnetic tape, etc.; and a communication section 609 including a network interface card such as a LAN card, a modem, and the like. The communication section 609 allows communication processing to be performed via a network such as the Internet. It is easy to understand that although it is shown in FIG. 6 that each device or module in the electronic device 600 communicates through the bus 604, they may also communicate through a network or other methods, wherein the network may include a wireless network, a wired network , and/or any combination of wireless and wired networks.
根据需要,驱动器610也连接到输入/输出接口605。可拆卸介质611比如磁盘、光盘、磁光盘、半导体存储器等等根据需要被安装在驱动器610上,使得从中读出的计算机程序根据需要被安装到存储部分608中。A driver 610 is also connected to the input/output interface 605 as needed. A removable medium 611 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 610 as necessary, so that a computer program read therefrom is installed into the storage section 608 as necessary.
在通过软件实现上述系列处理的情况下,可以从网络比如因特网或存储介质比如可拆卸介质611安装构成软件的程序。In the case of realizing the above-described series of processes by software, the programs constituting the software can be installed from a network such as the Internet or a storage medium such as the removable medium 611 .
根据本公开的一些实施例,上文参考流程图描述的过程可以被实现为计算机软件程序。例如,本公开的实施例包括一种计算机程序产品,其包括承载在计算机可读介质上的计算机程序,该计算机程序包含用于执行根据本公开的一些实施例的方法的程序代码。在这样的实施例中,该计算机程序可以通过通信装置609从网络上被下载和安装,或者从存储装置608被安装,或者从ROM 602被安装。在该计算机程序被CPU 601执行时,执行本公开实施例的方法中限定的上述功能。According to some embodiments of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program carried on a computer readable medium, the computer program comprising program code for performing a method according to some embodiments of the present disclosure. In such an embodiment, the computer program may be downloaded and installed from a network via communication means 609, or from storage means 608, or from ROM 602. When the computer program is executed by the CPU 601, the above-mentioned functions defined in the methods of the embodiments of the present disclosure are performed.
需要说明的是,在本公开的上下文中,计算机可读介质可以是有形的介质,其可以包含或存储以供指令执行系统、装置或设备使用或与指令执行系统、装置或设备结合地使用的程序。计算机可读介质可以是计算机可读信号介质或者计算机可读存储介质或者是上述两者的任意组合。计算机可读存储介质例如可以是,但不限于:电、磁、光、电磁、红外线、或半导体的系统、装置或器件,或者任意以上的组合。计算机可读存储介质的更具体的例子可以包括但不限于:具有一个或多个导线的电连接、便携式计算机磁盘、硬盘、随机访问存储器(RAM)、只读存储器(ROM)、可擦式可编程只读存储器(EPROM或闪存)、光纤、便携式紧凑磁盘只读存储器(CD-ROM)、光存储器件、磁存储器件、或者上述的任意合适的组合。在本公开中,计算机可读存储介质可以是任何包含或存储程序的有形介质,该程序可以被指令执行系统、装置或者器件使用或者与其结合使用。而在本公开中,计算机可读信号介质可以包括在基带中或者作为载波一部分传播的数据信号,其中承载了计算机可读的程序代码。这种传播的数据信号可以采用多种形式,包括但不限于电磁信号、光信号或上述的任意合适的组合。 计算机可读信号介质还可以是计算机可读存储介质以外的任何计算机可读介质,该计算机可读信号介质可以发送、传播或者传输用于由指令执行系统、装置或者器件使用或者与其结合使用的程序。计算机可读介质上包含的程序代码可以用任何适当的介质传输,包括但不限于:电线、光缆、RF(射频)等等,或者上述的任意合适的组合。It should be noted that, in the context of the present disclosure, a computer-readable medium may be a tangible medium that may contain or store information for use by or in conjunction with an instruction execution system, device, or device. program. A computer readable medium may be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer-readable storage medium may be, for example, but not limited to: an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, device, or device, or any combination thereof. More specific examples of computer-readable storage media may include, but are not limited to, electrical connections with one or more wires, portable computer diskettes, hard disks, random access memory (RAM), read-only memory (ROM), erasable Programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination of the above. In the present disclosure, a computer-readable storage medium may be any tangible medium that contains or stores a program that can be used by or in conjunction with an instruction execution system, apparatus, or device. In the present disclosure, however, a computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave carrying computer-readable program code therein. Such propagated data signals may take many forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the foregoing. A computer-readable signal medium may also be any computer-readable medium other than a computer-readable storage medium, which can transmit, propagate, or transmit a program for use by or in conjunction with an instruction execution system, apparatus, or device . Program code embodied on a computer readable medium may be transmitted by any appropriate medium, including but not limited to wires, optical cables, RF (radio frequency), etc., or any suitable combination of the above.
上述计算机可读介质可以是上述电子设备中所包含的;也可以是单独存在,而未装配入该电子设备中。The above-mentioned computer-readable medium may be included in the above-mentioned electronic device, or may exist independently without being incorporated into the electronic device.
在一些实施例中,还提供了一种计算机程序,包括:指令,指令当由处理器执行时使处理器执行上述任一个实施例的方法。例如,指令可以体现为计算机程序代码。In some embodiments, a computer program is also provided, including: instructions, which when executed by a processor cause the processor to execute the method of any one of the above embodiments. For example, instructions may be embodied as computer program code.
在本公开的实施例中,可以以一种或多种程序设计语言或其组合来编写用于执行本公开的操作的计算机程序代码,上述程序设计语言包括但不限于面向对象的程序设计语言,诸如Java、Smalltalk、C++,还包括常规的过程式程序设计语言,诸如“C”语言或类似的程序设计语言。程序代码可以完全地在用户计算机上执行、部分地在用户计算机上执行、作为一个独立的软件包执行、部分在用户计算机上部分在远程计算机上执行、或者完全在远程计算机或服务器上执行。在涉及远程计算机的情形中,远程计算机可以通过任意种类的网络(,包括局域网(LAN)或广域网(WAN))连接到用户计算机,或者,可以连接到外部计算机(例如利用因特网服务提供商来通过因特网连接)。In the embodiments of the present disclosure, the computer program codes for performing the operations of the present disclosure may be written in one or more programming languages or a combination thereof, the above-mentioned programming languages include but not limited to object-oriented programming languages, Such as Java, Smalltalk, C++, also includes conventional procedural programming languages, such as the "C" language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In cases involving a remote computer, the remote computer can be connected to the user computer through any kind of network, including a local area network (LAN) or a wide area network (WAN), or it can be connected to an external computer (such as through an Internet service provider). Internet connection).
附图中的流程图和框图,图示了按照本公开各种实施例的系统、方法和计算机程序产品的可能实现的体系架构、功能和操作。在这点上,流程图或框图中的每个方框可以代表一个模块、程序段、或代码的一部分,该模块、程序段、或代码的一部分包含一个或多个用于实现规定的逻辑功能的可执行指令。也应当注意,在有些作为替换的实现中,方框中所标注的功能也可以以不同于附图中所标注的顺序发生。例如,两个接连地表示的方框实际上可以基本并行地执行,它们有时也可以按相反的顺序执行,这依所涉及的功能而定。也要注意的是,框图和/或流程图中的每个方框、以及框图和/或流程图中的方框的组合,可以用执行规定的功能或操作的专用的基于硬件的系统来实现,或者可以用专用硬件与计算机指令的组合来实现。The flowchart and block diagrams in the Figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in a flowchart or block diagram may represent a module, program segment, or portion of code that contains one or more logical functions for implementing specified executable instructions. It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or they may sometimes be executed in the reverse order, depending upon the functionality involved. It should also be noted that each block of the block diagrams and/or flowchart illustrations, and combinations of blocks in the block diagrams and/or flowchart illustrations, can be implemented by a dedicated hardware-based system that performs the specified functions or operations , or may be implemented by a combination of dedicated hardware and computer instructions.
描述于本公开实施例中所涉及到的模块、部件或单元可以通过软件的方式实现,也可以通过硬件的方式来实现。其中,模块、部件或单元的名称在某种情况下并不构成对该模块、部件或单元本身的限定。The modules, components or units involved in the embodiments described in the present disclosure may be implemented by software or by hardware. Wherein, the name of a module, component or unit does not constitute a limitation on the module, component or unit itself under certain circumstances.
本文中以上描述的功能可以至少部分地由一个或多个硬件逻辑部件来执行。例如,非限制性地,可以使用的示例性的硬件逻辑部件包括:现场可编程门阵列(FPGA)、专 用集成电路(ASIC)、专用标准产品(ASSP)、片上系统(SOC)、复杂可编程逻辑设备(CPLD)等等。The functions described herein above may be performed at least in part by one or more hardware logic components. For example, without limitation, exemplary hardware logic components that may be used include: Field Programmable Gate Array (FPGA), Application Specific Integrated Circuit (ASIC), Application Specific Standard Product (ASSP), System on Chip (SOC), Complex Programmable Logical device (CPLD) and so on.
根据本公开的一些实施例,提出了、一种用于语音识别标点恢复的模型的训练方法,包括以下步骤:获取用于模型训练的文本样本和相对应的音频样本,其中对于得自无音频文本的文本样本,其对应的音频样本为虚拟样本;以及基于所获取的用于模型训练的文本样本和音频样本来训练用于语音识别标点恢复的模型。According to some embodiments of the present disclosure, a method for training a model for speech recognition punctuation recovery is proposed, comprising the following steps: acquiring text samples and corresponding audio samples for model training, wherein for A text sample of the text, and its corresponding audio sample is a virtual sample; and a model for punctuation recovery in speech recognition is trained based on the acquired text sample and audio sample for model training.
在一些实施例中,用于模型训练的文本样本和相对应的音频样本可以是是从有音频文本和无音频文本两者获取的,其中,从有音频文本获取成对的文本样本和音频样本,并且从无音频文本,获取文本样本并且获取虚拟样本作为该文本样本相对应的音频样本。In some embodiments, the text samples and corresponding audio samples used for model training may be obtained from both audio text and non-audio text, where pairs of text samples and audio samples are obtained from audio text , and from the audio-free text, a text sample is obtained and a dummy sample is obtained as an audio sample corresponding to the text sample.
在一些实施例中,虚拟样本为用于虚构文本样本的音频的预先设定的样本。In some embodiments, a virtual sample is a preset sample of audio for a fictional text sample.
在一些实施例中,训练用于语音识别标点恢复的模型可包括:基于所获取的用于模型训练的文本样本和音频样本生成混合表示,并且基于该混合表示来进行模型训练。In some embodiments, training a model for punctuation recovery in speech recognition may include: generating a mixed representation based on the acquired text samples and audio samples used for model training, and performing model training based on the mixed representation.
在一些实施例中,混合表示可以是通过基于所获取的用于模型训练的文本样本和音频样本执行基于注意力(attention)的处理而获得的多模态混合表示。In some embodiments, the hybrid representation may be a multimodal hybrid representation obtained by performing attention-based processing based on the acquired text samples and audio samples for model training.
在一些实施例中,可以基于文本样本相关的文本信息对用于模型训练的音频样本执行基于注意力的处理。In some embodiments, attention-based processing may be performed on audio samples for model training based on textual information associated with the text samples.
在一些实施例中,文本样本相关的文本信息可以是将文本样本转换得到的词汇特征或者对于文本样本执行基于注意力的处理而得到的处理结果。In some embodiments, the text information related to the text sample may be a vocabulary feature obtained by converting the text sample or a processing result obtained by performing attention-based processing on the text sample.
在一些实施例中,将文本样本转换成词汇嵌入序列,将音频样本转换成声学嵌入序列,并且分别对于词汇嵌入序列和声学嵌入序列进行基于注意力的操作。In some embodiments, text samples are converted to a sequence of lexical embeddings, audio samples are converted to a sequence of acoustic embeddings, and attention-based operations are performed on the lexical and acoustic embedding sequences, respectively.
在一些实施例中,在音频样本为对应文本样本的文本内容的音频形式的情况下,基于从音频样本提取的声学特征来获得声学嵌入序列;并且在音频样本为虚拟样本的情况下,将该虚拟样本作为声学嵌入序列。In some embodiments, where the audio sample is an audio version of the text content of the corresponding text sample, the acoustic embedding sequence is obtained based on acoustic features extracted from the audio sample; and where the audio sample is a virtual sample, the Virtual samples are embedded as sequences of acoustics.
在一些实施例中,对于用于模型训练的文本样本执行的基于注意力的处理可以是基于自注意力(self-attention)的处理,以及对于用于模型训练的音频样本执行的基于注意力的处理可以是基于交叉注意力(cross-attention)的处理。In some embodiments, the attention-based processing performed on text samples used for model training may be self-attention-based processing, and the attention-based processing performed on audio samples used for model training The processing may be a cross-attention based processing.
在一些实施例中,所获取用于模型训练的文本样本和相对应的音频样本可包括多语言的文本样本和音频样本。In some embodiments, the acquired text samples and corresponding audio samples for model training may include multilingual text samples and audio samples.
在一些实施例中,多语言的文本样本和音频样本可经过均衡化处理以提高低资源 语言样本的占比。In some embodiments, multilingual text samples and audio samples may be equalized to increase the proportion of low-resource language samples.
根据本公开的一些实施例,提出了一种用于语音识别标点恢复的方法,包括以下步骤:获取语音识别的文本输出,以及对于所获取的文本输出应用根据本公开中所述的任一实施例所述的模型训练方法训练得到的标点模型,来恢复文本输出中的标点。According to some embodiments of the present disclosure, a method for recovering punctuation in speech recognition is proposed, including the following steps: obtaining text output of speech recognition, and applying any implementation described in the present disclosure to the obtained text output The punctuation model trained by the model training method described in the example is used to restore the punctuation in the text output.
根据本公开的另一些实施例,提出了一种用于语音识别标点恢复的模型的训练装置,包括:样本获取单元,被配置为获取用于模型训练的文本样本和相对应的音频样本,其中对于得自无音频文本的文本样本,其对应的音频样本为虚拟样本;以及训练单元,被配置为基于所获取的用于模型训练的文本样本和音频样本来训练用于语音识别标点恢复的模型。According to some other embodiments of the present disclosure, a training device for a model of speech recognition punctuation recovery is proposed, including: a sample acquisition unit configured to acquire text samples and corresponding audio samples for model training, wherein For text samples obtained from non-audio text, the corresponding audio samples are virtual samples; and a training unit configured to train a model for speech recognition punctuation recovery based on the acquired text samples and audio samples for model training .
在一些实施例中,所述训练单元可进一步配置为:基于所获取的用于模型训练的文本样本和音频样本执行基于注意力(attention)的处理而生成多模态混合表示。In some embodiments, the training unit may be further configured to: perform attention-based processing based on the acquired text samples and audio samples for model training to generate a multimodal hybrid representation.
在一些实施例中,所述装置可进一步包括:文本转换单元,被配置为将文本样本转换成词汇嵌入序列,音频转换单元,被配置为将音频样本转换成声学嵌入序列,并且所述训练单元被配置为分别对于词汇嵌入序列和声学嵌入序列进行基于注意力的操作。In some embodiments, the apparatus may further comprise: a text conversion unit configured to convert text samples into a vocabulary embedding sequence, an audio conversion unit configured to convert audio samples into an acoustic embedding sequence, and the training unit are configured to perform attention-based operations on sequences of lexical embeddings and sequences of acoustic embeddings, respectively.
根据本公开的另一些实施例,提出了一种用于语音识别标点恢复的装置,包括:获取单元,被配置为获取语音识别的文本输出,以及标点恢复单元,被配置为对于所获取的文本输出应用根据本公开中所述的任一实施例所述的模型训练方法训练得到的标点模型,来恢复文本输出中的标点。According to some other embodiments of the present disclosure, a device for recovering punctuation in speech recognition is proposed, including: an acquisition unit configured to acquire text output from speech recognition, and a punctuation recovery unit configured to recover the acquired text The output uses the punctuation model trained according to the model training method described in any embodiment of the present disclosure to recover the punctuation in the text output.
根据本公开的又一些实施例,提供一种电子设备,包括:存储器;和耦接至所述存储器的处理器,所述存储器中存储有指令,所述指令当由所述处理器执行时,使得所述电子设备执行本公开中所述的任一实施例的方法。According to still other embodiments of the present disclosure, an electronic device is provided, including: a memory; and a processor coupled to the memory, the memory storing instructions, the instructions, when executed by the processor, Making the electronic device execute the method of any embodiment described in the present disclosure.
根据本公开的又一些实施例,提供一种计算机可读存储介质,其上存储有计算机程序,该程序由处理器执行时实现本公开中所述的任一实施例的方法。According to still some embodiments of the present disclosure, a computer-readable storage medium is provided, on which a computer program is stored, and when the program is executed by a processor, the method of any embodiment described in the present disclosure is implemented.
根据本公开的又一些实施例,提供计算机程序,包括:指令,指令当由处理器执行时使处理器执行本公开中所述的任一实施例的方法。According to still some embodiments of the present disclosure, a computer program is provided, including: instructions, which when executed by a processor cause the processor to perform the method of any embodiment described in the present disclosure.
根据本公开的一些实施例,提供一种计算机程序产品,包括指令,所述指令当由处理器执行时实现本公开中所述的任一实施例的方法。According to some embodiments of the present disclosure, there is provided a computer program product comprising instructions which, when executed by a processor, implement the method of any one of the embodiments described in the present disclosure.
根据本公开的一些实施例,提供了一种计算机程序,所述计算机程序包括的程序代码在由计算机执行时导致实现本公开中所述的任一实施例的方法。According to some embodiments of the present disclosure, there is provided a computer program, the program code included in the computer program causes the method of any embodiment described in the present disclosure to be implemented when executed by a computer.
以上描述仅为本公开的一些实施例以及对所运用技术原理的说明。本领域技术人员应当理解,本公开中所涉及的公开范围,并不限于上述技术特征的特定组合而成的技术方案,同时也应涵盖在不脱离上述公开构思的情况下,由上述技术特征或其等同特征进行任意组合而形成的其它技术方案。例如上述特征与本公开中公开的(但不限于)具有类似功能的技术特征进行互相替换而形成的技术方案。The above descriptions are only some embodiments of the present disclosure and illustrations of the applied technical principles. Those skilled in the art should understand that the disclosure scope involved in this disclosure is not limited to the technical solution formed by the specific combination of the above-mentioned technical features, but also covers the technical solutions formed by the above-mentioned technical features or Other technical solutions formed by any combination of equivalent features. For example, a technical solution formed by replacing the above-mentioned features with (but not limited to) technical features with similar functions disclosed in this disclosure.
在本文提供的描述中,阐述了许多特定细节。然而,理解的是,可以在没有这些特定细节的情况下实施本公开的实施例。在其他情况下,为了不模糊该描述的理解,没有对众所周知的方法、结构和技术进行详细展示。In the description provided herein, numerous specific details are set forth. However, it is understood that embodiments of the present disclosure may be practiced without these specific details. In other instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure the understanding of this description.
此外,虽然采用特定次序描绘了各操作,但是这不应当理解为要求这些操作以所示出的特定次序或以顺序次序执行来执行。在一定环境下,多任务和并行处理可能是有利的。同样地,虽然在上面论述中包含了若干具体实现细节,但是这些不应当被解释为对本公开的范围的限制。在单独的实施例的上下文中描述的某些特征还可以组合地实现在单个实施例中。相反地,在单个实施例的上下文中描述的各种特征也可以单独地或以任何合适的子组合的方式实现在多个实施例中。In addition, while operations are depicted in a particular order, this should not be understood as requiring that the operations be performed in the particular order shown or to be performed in sequential order. Under certain circumstances, multitasking and parallel processing may be advantageous. Likewise, while the above discussion contains several specific implementation details, these should not be construed as limitations on the scope of the disclosure. Certain features that are described in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination.
虽然已经通过示例对本公开的一些特定实施例进行了详细说明,但是本领域的技术人员应该理解,以上示例仅是为了进行说明,而不是为了限制本公开的范围。本领域的技术人员应该理解,可在不脱离本公开的范围和精神的情况下,对以上实施例进行修改。本公开的范围由所附权利要求来限定。Although some specific embodiments of the present disclosure have been described in detail through examples, those skilled in the art should understand that the above examples are for illustration only, rather than limiting the scope of the present disclosure. It will be appreciated by those skilled in the art that modifications may be made to the above embodiments without departing from the scope and spirit of the present disclosure. The scope of the present disclosure is defined by the appended claims.

Claims (20)

  1. 一种用于语音识别标点恢复的模型的训练方法,包括以下步骤:A training method for a model of speech recognition punctuation recovery, comprising the following steps:
    获取用于模型训练的文本样本和对应的音频样本,其中对于得自无音频文本的文本样本,其对应的音频样本为虚拟样本;以及Obtaining text samples and corresponding audio samples for model training, wherein for text samples obtained from text without audio, the corresponding audio samples are dummy samples; and
    基于所获取的用于模型训练的文本样本和对应的音频样本来训练用于语音识别标点恢复的模型。A model for punctuation recovery in speech recognition is trained based on the acquired text samples and corresponding audio samples for model training.
  2. 根据权利要求1所述的方法,其中,用于模型训练的文本样本和对应的音频样本是从有音频文本和无音频文本两者获取的,The method of claim 1 , wherein the text samples and corresponding audio samples used for model training are obtained from both audio text and non-audio text,
    其中,从有音频文本获取成对的文本样本和音频样本,并且从无音频文本,获取文本样本并且获取虚拟样本作为该文本样本对应的音频样本。Wherein, a paired text sample and an audio sample are obtained from a text with audio, and a text sample is obtained from a text without audio, and a dummy sample is obtained as an audio sample corresponding to the text sample.
  3. 根据权利要求1或2所述的方法,其中,虚拟样本为用于虚构文本样本的音频的预先设定的样本。A method as claimed in claim 1 or 2, wherein the virtual sample is a preset sample of audio for a fictional text sample.
  4. 根据权利要求1所述的方法,其中,训练用于语音识别标点恢复的模型包括:The method according to claim 1, wherein training a model for speech recognition punctuation recovery comprises:
    基于所获取的用于模型训练的文本样本和音频样本生成多模态混合表示,并且generate a multimodal hybrid representation based on the acquired text samples and audio samples for model training, and
    基于该多模态混合表示来进行模型训练。Model training is performed based on the multimodal hybrid representation.
  5. 根据权利要求4所述的方法,其中,多模态混合表示是通过基于所获取的用于模型训练的文本样本和音频样本执行基于注意力(attention)的处理而获得的多模态混合表示。The method of claim 4, wherein the multimodal hybrid representation is a multimodal hybrid representation obtained by performing attention-based processing based on the acquired text samples and audio samples for model training.
  6. 根据权利要求5所述的方法,其中,基于文本样本相关的文本信息对用于模型训练的音频样本执行基于注意力的处理。The method of claim 5, wherein attention-based processing is performed on audio samples for model training based on text information associated with the text samples.
  7. 根据权利要求6所述的方法,其中,文本样本相关的文本信息是将文本样本转换得到的词汇特征或者对于文本样本执行基于注意力的处理而得到的处理结果。The method according to claim 6, wherein the text information related to the text sample is a vocabulary feature obtained by converting the text sample or a processing result obtained by performing attention-based processing on the text sample.
  8. 根据权利要求1所述的方法,其中,The method according to claim 1, wherein,
    将文本样本转换成词汇嵌入序列,convert text samples into sequences of lexical embeddings,
    将音频样本转换成声学嵌入序列,Convert audio samples into sequences of acoustic embeddings,
    分别对于词汇嵌入序列和声学嵌入序列进行基于注意力的操作;并且Attention-based operations are performed on sequences of lexical embeddings and sequences of acoustic embeddings, respectively; and
    将经操作的词汇嵌入序列和声学嵌入序列进行组合以生成多模态混合表示。The manipulated sequence of lexical embeddings and acoustic embeddings are combined to generate a multimodal hybrid representation.
  9. 根据权利要求8所述的方法,其中,The method of claim 8, wherein,
    在音频样本为对应文本样本的文本内容的音频形式的情况下,基于从音频样本提取的声学特征来获得声学嵌入序列;Obtaining an acoustic embedding sequence based on acoustic features extracted from the audio sample, where the audio sample is an audio version of the text content of the corresponding text sample;
    在音频样本为虚拟样本的情况下,将该虚拟样本作为声学嵌入序列。In case the audio sample is a virtual sample, the virtual sample is used as an acoustically embedded sequence.
  10. 根据权利要求5-9中任一项所述的方法,其中,对于用于模型训练的文本样本执行的基于注意力的处理是基于自注意力(self-attention)的处理,以及对于用于模型训练的音频样本执行的基于注意力的处理是基于交叉注意力(cross-attention)的处理。The method according to any one of claims 5-9, wherein the attention-based processing performed on the text samples used for model training is a self-attention (self-attention)-based processing, and the The attention-based processing performed on the training audio samples is cross-attention based.
  11. 根据权利要求1-10中任一项所述的方法,其中,所获取用于模型训练的文本样本和对应的音频样本包括多语言的文本样本和对应的音频样本。The method according to any one of claims 1-10, wherein the acquired text samples and corresponding audio samples for model training include multilingual text samples and corresponding audio samples.
  12. 根据权利要求11所述的方法,其中,多语言的文本样本和对应音频样本经过均衡化处理以使得提高低资源语言样本的占比。The method of claim 11, wherein the multilingual text samples and corresponding audio samples are equalized to increase the proportion of low-resource language samples.
  13. 一种用于语音识别标点恢复的方法,包括以下步骤:A method for recovering punctuation in speech recognition, comprising the following steps:
    获取语音识别的文本输出,以及Get the text output of the speech recognition, and
    对于所获取的文本输出应用根据权利要求1-12中任一项所述的方法训练得到的标点模型,来恢复文本输出中的标点。Applying the punctuation model trained by the method according to any one of claims 1-12 to the obtained text output restores the punctuation in the text output.
  14. 一种用于语音识别标点恢复的模型的训练装置,包括:A training device for a model of speech recognition punctuation restoration, comprising:
    样本获取单元,被配置为获取用于模型训练的文本样本和对应的音频样本,其中对于得自无音频文本的文本样本,其对应的音频样本为虚拟样本;以及A sample acquisition unit configured to acquire text samples and corresponding audio samples for model training, wherein for text samples obtained from text without audio, the corresponding audio samples are virtual samples; and
    训练单元,被配置为基于所获取的用于模型训练的文本样本和对应的音频样本来训练 用于语音识别标点恢复的模型。The training unit is configured to train a model for speech recognition punctuation recovery based on the acquired text samples and corresponding audio samples for model training.
  15. 根据权利要求14所述的装置,其中,所述训练单元进一步包括混合表示生成单元,被配置为:The apparatus according to claim 14, wherein the training unit further comprises a hybrid representation generating unit configured to:
    基于所获取的用于模型训练的文本样本和音频样本执行基于注意力(attention)的处理而生成多模态混合表示。Multimodal hybrid representations are generated by performing attention-based processing based on the acquired text samples and audio samples for model training.
  16. 根据权利要求15所述的装置,其中,所述训练单元进一步包括:The apparatus of claim 15, wherein the training unit further comprises:
    文本转换单元,被配置为将文本样本转换成词汇嵌入序列,a text conversion unit configured to convert text samples into sequences of lexical embeddings,
    音频转换单元,被配置为将音频样本转换成声学嵌入序列,以及an audio conversion unit configured to convert the audio samples into an acoustically embedded sequence, and
    联合处理单元,被配置为分别对于词汇嵌入序列和声学嵌入序列进行基于注意力的操作;并且将经操作的词汇嵌入序列和声学嵌入序列进行组合以生成多模态混合表示。A joint processing unit configured to perform attention-based operations on the lexical embedding sequence and the acoustic embedding sequence respectively; and combine the operated lexical embedding sequence and the acoustic embedding sequence to generate a multimodal hybrid representation.
  17. 一种用于语音识别标点恢复的装置,包括:A device for restoring punctuation in speech recognition, comprising:
    获取单元,被配置为获取语音识别的文本输出,以及an acquisition unit configured to acquire text output for speech recognition, and
    标点恢复单元,被配置为对于所获取的文本输出应用根据权利要求1-12中任一项所述的方法训练得到的标点模型,来恢复文本输出中的标点。The punctuation recovery unit is configured to apply the punctuation model trained according to the method according to any one of claims 1-12 to the acquired text output to recover the punctuation in the text output.
  18. 一种电子设备,包括:An electronic device comprising:
    存储器;和memory; and
    耦接至所述存储器的处理器,所述存储器中存储有指令,所述指令当由所述处理器执行时,使得所述电子设备执行根据权利要求1-12中任一项所述的方法。A processor coupled to the memory, the memory storing instructions which, when executed by the processor, cause the electronic device to perform the method according to any one of claims 1-12 .
  19. 一种计算机可读存储介质,其上存储有计算机程序,该程序由处理器执行时实现根据权利要求1-12中任一项所述的方法。A computer-readable storage medium, on which a computer program is stored, and when the program is executed by a processor, the method according to any one of claims 1-12 is implemented.
  20. 一种计算机程序产品,包括指令,该指令在由处理器执行时导致实现根据权利要求1-12中任一项所述的方法。A computer program product comprising instructions which, when executed by a processor, cause the method according to any one of claims 1-12 to be carried out.
PCT/CN2022/125163 2021-11-11 2022-10-13 Method for punctuation recovery in speech recognition, and device and storage medium WO2023082931A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202111335102.5 2021-11-11
CN202111335102.5A CN114120975A (en) 2021-11-11 2021-11-11 Method, apparatus and storage medium for speech recognition punctuation recovery

Publications (1)

Publication Number Publication Date
WO2023082931A1 true WO2023082931A1 (en) 2023-05-19

Family

ID=80378660

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/125163 WO2023082931A1 (en) 2021-11-11 2022-10-13 Method for punctuation recovery in speech recognition, and device and storage medium

Country Status (2)

Country Link
CN (1) CN114120975A (en)
WO (1) WO2023082931A1 (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114120975A (en) * 2021-11-11 2022-03-01 北京有竹居网络技术有限公司 Method, apparatus and storage medium for speech recognition punctuation recovery
CN116229994B (en) * 2023-05-08 2023-07-21 北京爱数智慧科技有限公司 Construction method and device of label prediction model of Arabic language

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102231278A (en) * 2011-06-10 2011-11-02 安徽科大讯飞信息科技股份有限公司 Method and system for realizing automatic addition of punctuation marks in speech recognition
US20190043486A1 (en) * 2017-08-04 2019-02-07 EMR.AI Inc. Method to aid transcribing a dictated to written structured report
CN109859760A (en) * 2019-02-19 2019-06-07 成都富王科技有限公司 Phone robot voice recognition result bearing calibration based on deep learning
CN112712804A (en) * 2020-12-23 2021-04-27 哈尔滨工业大学(威海) Speech recognition method, system, medium, computer device, terminal and application
CN113284485A (en) * 2021-07-09 2021-08-20 中国科学院自动化研究所 End-to-end framework for unified Chinese and English mixed text generation and speech recognition
CN113450756A (en) * 2020-03-13 2021-09-28 Tcl科技集团股份有限公司 Training method of voice synthesis model and voice synthesis method
CN113488061A (en) * 2021-08-05 2021-10-08 国网江苏省电力有限公司 Distribution network dispatcher identity verification method and system based on improved Synth2Aug
WO2021215262A1 (en) * 2020-04-20 2021-10-28 株式会社Nttドコモ Punctuation mark delete model training device, punctuation mark delete model, and determination device
CN114120975A (en) * 2021-11-11 2022-03-01 北京有竹居网络技术有限公司 Method, apparatus and storage medium for speech recognition punctuation recovery

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102231278A (en) * 2011-06-10 2011-11-02 安徽科大讯飞信息科技股份有限公司 Method and system for realizing automatic addition of punctuation marks in speech recognition
US20190043486A1 (en) * 2017-08-04 2019-02-07 EMR.AI Inc. Method to aid transcribing a dictated to written structured report
CN109859760A (en) * 2019-02-19 2019-06-07 成都富王科技有限公司 Phone robot voice recognition result bearing calibration based on deep learning
CN113450756A (en) * 2020-03-13 2021-09-28 Tcl科技集团股份有限公司 Training method of voice synthesis model and voice synthesis method
WO2021215262A1 (en) * 2020-04-20 2021-10-28 株式会社Nttドコモ Punctuation mark delete model training device, punctuation mark delete model, and determination device
CN112712804A (en) * 2020-12-23 2021-04-27 哈尔滨工业大学(威海) Speech recognition method, system, medium, computer device, terminal and application
CN113284485A (en) * 2021-07-09 2021-08-20 中国科学院自动化研究所 End-to-end framework for unified Chinese and English mixed text generation and speech recognition
CN113488061A (en) * 2021-08-05 2021-10-08 国网江苏省电力有限公司 Distribution network dispatcher identity verification method and system based on improved Synth2Aug
CN114120975A (en) * 2021-11-11 2022-03-01 北京有竹居网络技术有限公司 Method, apparatus and storage medium for speech recognition punctuation recovery

Also Published As

Publication number Publication date
CN114120975A (en) 2022-03-01

Similar Documents

Publication Publication Date Title
US10380996B2 (en) Method and apparatus for correcting speech recognition result, device and computer-readable storage medium
JP7112536B2 (en) Method and apparatus for mining entity attention points in text, electronic device, computer-readable storage medium and computer program
WO2023082931A1 (en) Method for punctuation recovery in speech recognition, and device and storage medium
CN108985358B (en) Emotion recognition method, device, equipment and storage medium
US10592607B2 (en) Iterative alternating neural attention for machine reading
WO2022143058A1 (en) Voice recognition method and apparatus, storage medium, and electronic device
CN111489735B (en) Voice recognition model training method and device
WO2022037419A1 (en) Audio content recognition method and apparatus, and device and computer-readable medium
WO2022228041A1 (en) Translation model training method, apparatus, and device, and storage medium
US20240029709A1 (en) Voice generation method and apparatus, device, and computer readable medium
US11132996B2 (en) Method and apparatus for outputting information
CN111563390B (en) Text generation method and device and electronic equipment
WO2023082916A1 (en) Training method, speech translation method, device and computer-readable medium
CN111382261B (en) Abstract generation method and device, electronic equipment and storage medium
WO2022100481A1 (en) Text information translation method and apparatus, electronic device, and storage medium
WO2022237665A1 (en) Speech synthesis method and apparatus, electronic device, and storage medium
CN112509562A (en) Method, apparatus, electronic device and medium for text post-processing
JP2023550211A (en) Method and apparatus for generating text
CN110647613A (en) Courseware construction method, courseware construction device, courseware construction server and storage medium
CN111933119B (en) Method, apparatus, electronic device, and medium for generating voice recognition network
CN111161724B (en) Method, system, equipment and medium for Chinese audio-visual combined speech recognition
CN115983294B (en) Translation model training method, translation method and translation equipment
WO2023138361A1 (en) Image processing method and apparatus, and readable storage medium and electronic device
CN112364653A (en) Text analysis method, apparatus, server and medium for speech synthesis
WO2023011260A1 (en) Translation processing method and apparatus, device and medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22891728

Country of ref document: EP

Kind code of ref document: A1