CN116705058B

CN116705058B - Processing method of multimode voice task, electronic equipment and readable storage medium

Info

Publication number: CN116705058B
Application number: CN202310977700.5A
Authority: CN
Inventors: 孙建伟; 文成; 赵帅江; 邹伟; 韩阳; 李先刚
Original assignee: Seashell Housing Beijing Technology Co Ltd
Current assignee: Seashell Housing Beijing Technology Co Ltd
Priority date: 2023-08-04
Filing date: 2023-08-04
Publication date: 2023-10-27
Anticipated expiration: 2043-08-04
Also published as: CN116705058A

Abstract

The present disclosure provides a method for processing a multi-mode voice task, including: processing each speech frame of the speech portion in the multi-modal speech task to obtain syllable vector sequences corresponding to the plurality of speech frames; based on the language sequence of the multi-mode voice task, splicing a character vector sequence and a syllable vector sequence of a text part in the multi-mode voice task to obtain a spliced vector sequence, wherein each character vector in the character vector sequence and each syllable vector in the syllable vector sequence have the same dimensionality; and invoking the multi-mode language model to analyze the spliced vector sequence to generate interactive text for responding to the multi-mode voice task. The present disclosure also provides an electronic device and a readable storage medium.

Description

Processing method of multimode voice task, electronic equipment and readable storage medium

Technical Field

The disclosure relates to the technical field of deep learning, and in particular relates to a processing method of a multimode voice task, electronic equipment and a readable storage medium.

Background

Traditional processing of speech tasks, such as speech recognition and speech synthesis, typically relies on manually designed feature representations or supervised learning labeling samples, but the large amount of labeling data and manual feature engineering limits the scope of application of speech task processing, as well as the performance of models that perform speech processing tasks.

In the related technology, a plurality of voice unsupervised pre-training methods are also provided, and a voice model is constructed in an unsupervised learning mode, so that the dependence of the model training process on a large number of labeling samples is reduced. However, the speech model constructed by the non-supervised learning method only focuses on the dimension of the speech feature, so that only the prediction of the next frame of content of the speech data can be realized, and it is difficult to meet the processing requirements of multi-mode speech tasks, such as processing of speech and text spliced content.

Disclosure of Invention

To solve at least one of the foregoing problems, the present disclosure provides a method for processing a multi-mode voice task, an electronic device, and a readable storage medium.

According to one aspect of the present disclosure, there is provided a method for processing a multi-mode voice task, including: processing each speech frame of a speech portion in a multi-modal speech task to obtain syllable vector sequences corresponding to a plurality of the speech frames; based on the word order of the multi-mode voice task, splicing a character vector sequence of a text part in the multi-mode voice task and the syllable vector sequence to obtain a spliced vector sequence, wherein each character vector in the character vector sequence and each syllable vector in the syllable vector sequence have the same dimensionality; and invoking a multi-mode language model to analyze the spliced vector sequence and generating an interactive text for responding to the multi-mode voice task.

In some embodiments, the processing each speech frame of the speech portion in the multimodal speech task to obtain a syllable vector sequence corresponding to a plurality of the speech frames includes: extracting voice characteristics of each voice frame, wherein the voice characteristics are at least used for representing semantic information and expression styles of the voice frames; mapping the voice frame to a corresponding clustering center based on the voice characteristics, and taking the code of the clustering center as a clustering label of the voice frame; performing dimension reduction processing on each voice frame according to the clustering labels to obtain a plurality of syllable vectors; and arranging the syllable vectors according to the language sequence of the voice part to form the syllable vector sequence.

In some embodiments, the invoking the multimodal language model to analyze the stitched vector sequence to generate interactive text for responding to the multimodal speech task includes: invoking the multimode language model to analyze the spliced vector sequence, and predicting an interactive character vector sequence required by responding to the spliced vector sequence; and performing character reduction on the interactive character vector sequence according to the expression style of the multimode voice task to generate the interactive text for responding to the multimode voice task.

In some embodiments, further comprising: processing each character of the text part to obtain a plurality of character vectors corresponding to each character; and arranging the character vectors according to the word order of the text part to form the character vector sequence.

In some embodiments, further comprising: training a language model with a plurality of sample vector sequences to construct the multimodal language model for processing a plurality of language forms, wherein the sample vector sequences include at least syllable sample vector sequences, character sample vector sequences, and concatenation sample vector sequences, and the language forms include at least a speech form, a text form, and a phonetic text concatenation form.

In some embodiments, the training the language model with multiple sample vector sequences to construct the multimodal language model for processing multiple language forms includes: respectively performing unsupervised training on the language model by using a syllable sample vector sequence and a character sample vector sequence to obtain a single-mode language model, wherein the single-mode language model has the capability of processing the syllable vector sequence and the character vector sequence; splicing the syllable sample vector sequence and the character sample vector sequence according to a target language order to obtain a spliced sample vector sequence; invoking the single-mode language model to process the spliced sample vector sequence to obtain a processing result; and adjusting the weight of the single-mode language model according to the deviation value between the processing result and the expected result until the deviation value corresponding to the processing result is smaller than or equal to a preset threshold value, and taking the single-mode language model after weight adjustment as the multi-mode language model, wherein the multi-mode language model has the capability of processing the syllable vector sequence, the character vector sequence and the spliced vector sequence.

In some embodiments, further comprising: responding to the multimode voice task in a voice form, calling a multimode language model to analyze syllable vector sequences corresponding to the multimode voice task, and generating an interactive text for responding to the multimode voice task data; or in response to the multimode voice task being in a text form, invoking a multimode language model to analyze a character vector sequence corresponding to the multimode voice task, and generating an interactive text for responding to the multimode voice task.

According to another aspect of the present disclosure, there is provided an electronic device including a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor executing the program to implement a method for processing a multi-mode speech task according to any one of the embodiments.

According to a further aspect of the present disclosure there is provided a readable storage medium, characterized in that the readable storage medium stores a computer program adapted to be loaded by a processor for performing the method of processing a multimodal speech task as described in any of the embodiments above.

According to a further aspect of the present disclosure there is provided a computer program product comprising a computer program/instruction which, when executed by a processor, implements a method of processing a multimodal speech task as described in any of the embodiments above.

Drawings

The accompanying drawings, which are included to provide a further understanding of the disclosure and are incorporated in and constitute a part of this specification, illustrate exemplary embodiments of the disclosure and together with the description serve to explain the principles of the disclosure.

Fig. 1 is a flowchart of a method of processing a multimodal speech task in accordance with an exemplary embodiment of the present disclosure.

Fig. 2 is a schematic diagram of a processing architecture of a multimodal speech task in an exemplary embodiment of the disclosure.

Fig. 3 is a block diagram of a processing device for multimodal speech tasks in accordance with an exemplary embodiment of the present disclosure.

Detailed Description

The present disclosure is described in further detail below with reference to the drawings and the embodiments. It is to be understood that the specific embodiments described herein are merely illustrative of the relevant content and not limiting of the present disclosure. It should be further noted that, for convenience of description, only a portion relevant to the present disclosure is shown in the drawings.

In addition, embodiments of the present disclosure and features of the embodiments may be combined with each other without conflict. The technical aspects of the present disclosure will be described in detail below with reference to the accompanying drawings in conjunction with embodiments.

Unless otherwise indicated, the exemplary implementations/embodiments shown are to be understood as providing exemplary features of various details of some ways in which the technical concepts of the present disclosure may be practiced. Thus, unless otherwise indicated, features of the various implementations/embodiments may be additionally combined, separated, interchanged, and/or rearranged without departing from the technical concepts of the present disclosure.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. Furthermore, when the terms "comprises" and/or "comprising," and variations thereof, are used in the present specification, the presence of stated features, integers, steps, operations, elements, components, and/or groups thereof is described, but the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof is not precluded. It is also noted that, as used herein, the terms "substantially," "about," and other similar terms are used as approximation terms and not as degree terms, and as such, are used to explain the inherent deviations of measured, calculated, and/or provided values that would be recognized by one of ordinary skill in the art.

Fig. 1 is a flowchart of a method of processing a multimodal speech task in accordance with an exemplary embodiment of the present disclosure. The processing method S100 of the multi-mode voice task of the present disclosure will be described in detail with reference to fig. 1.

Step S102, processing each voice frame of the voice part in the multi-mode voice task to obtain syllable vector sequences corresponding to a plurality of voice frames.

A multimodal speech task is data to be processed entered by a user in multiple language forms, such as speech form, text form, speech-text splice form, etc. The task processing method in the related art is only suitable for predicting and expanding text-form input data, but is not suitable for processing input data in various expression forms such as voice form, voice text splicing form and the like. The method supports a richer expression form of input data, so that a user can input multimode voice tasks, the man-machine interaction experience of the user is improved, and the applicable scene of the man-machine interaction process is expanded. More specifically, a multimodal speech task may have only a text portion or a speech portion, or may have both a speech portion and a text portion, depending on its language expression.

The voice part is a part for transmitting information through sound waves in a multimode voice task and is provided with a plurality of voice frames, each voice frame has expressed semantic information, and a certain expression style is presented according to personalized features of a user. In other words, each speech frame corresponds to speech features that are at least reflected in the expressed semantic information and in the presented expression style. In the present disclosure, we use speech framing as a discrete unit of speech part to facilitate the data processing of the speech part.

In order to weaken the influence of the language form of the multimode voice task on the subsequent task processing, the data in the multimode voice task needs to be processed in a unified mode, including processing of unified dimension, expression form and the like. Based on this, the present disclosure converts each speech frame into a corresponding syllable vector, and sequentially arranges each syllable vector in the order of the words of the speech part to form a syllable vector sequence.

Syllable vectors have lower dimensional information compared with speech framing, are convenient to analyze and calculate, and are beneficial to improving task processing efficiency. Likewise, characters of the text portion in the multimodal speech task are also processed into corresponding character vectors, and the dimensions of the character vectors are the same as those of syllable vectors, facilitating concatenation of the two. When the voice part and the text part are converted into vectors with the same dimension and form, the language form of the input is not concerned excessively when the multi-mode voice task is processed later, and the input with richer language forms can be supported. Of course, the remaining language forms, if convertible into vectors having the same dimensions and form, fall within the scope of the present disclosure.

Step S104, based on the language sequence of the multi-mode voice task, the character vector sequence and the syllable vector sequence of the text part in the multi-mode voice task are spliced to obtain a spliced vector sequence.

The word order is used for representing the expression order presented by the multimode voice task, and generally accords with the language expression habit of a user and the daily expression habit of human beings. Speech framing or character with a word order can express the interactive needs that the user wants to specify so that the multi-modal speech task is not a simple stack of multiple speech framing or characters.

The spliced vector sequence is the result of splicing the character vector sequence and the syllable vector sequence according to the language order. The spliced vector sequence obtained by splicing can completely express the expressed semantics of the multimode voice task in the form of vectors.

Of course, each character vector in the character vector sequence and each syllable vector in the syllable vector sequence have the same dimension so as to ensure that the two have the splicing condition.

And S106, calling the multimode language model to analyze the spliced vector sequence, and generating an interactive text for responding to the multimode voice task.

The multi-mode language model is a neural network model for executing analysis tasks on the spliced vector sequences, has self-learning capability, can realize tasks such as analysis, prediction, expansion and the like on input data by training to know the expression habits of human beings, and further outputs feedback results to achieve the aim of human-computer interaction. The multi-mode language model is different from the language model of the related technology, and can carry out indiscriminate analysis on vector sequences corresponding to data in various language forms; whereas the language model existing in the related art can only implement analysis of input data in a single form, such as text form or voice form.

The interactive text is the final output result of the multimode language model and is generated based on the personalized expression characteristics contained in the multimode voice task and accords with the human communication habit. Moreover, the content presented by the interactive text can realize the feedback requirement on the multimode voice task, and solve the problems required by the multimode voice task, including writing articles, answering and confusing, daily communication and the like.

Fig. 2 is a schematic diagram of a processing architecture of a multimodal speech task in an exemplary embodiment of the disclosure. The method of processing a multi-modal speech task will be described in more detail below in conjunction with fig. 2.

In some embodiments, the specific implementation procedure of step S102 is: extracting voice characteristics of each voice frame; based on the voice characteristics, mapping voice framing to a corresponding clustering center, and taking the code of the clustering center as a clustering label of the voice framing; according to the clustering labels, carrying out dimension reduction on each voice frame to obtain a plurality of syllable vectors; and arranging the syllable vectors according to the language order of the voice part to form a syllable vector sequence.

The speech features may, for example, be mel-spectrum features or the like, containing semantic information of speech frames and expression styles of users, specifically expressed as frequency information, timbre information, pitch information, semantic content, etc.

The cluster centers are a plurality of voice categories divided according to the characteristics of the voice parts, and each cluster center can correspond to a plurality of voice frames. For example, the speech part has 100 speech frames and 10 clustering centers are set, and then according to the speech features of the speech frames, each speech frame can be mapped to a corresponding clustering center, so that the speech frames belonging to the same clustering center present the same or similar speech features. In addition, each cluster center is provided with a category code, and the category codes of different cluster centers are different to show distinction.

The clustering label is an identifier for representing the category to which the voice frame belongs, and the voice frames mapped to the same clustering center have the same clustering label. For example, the class code of the clustering center mapped by the voice framing is used as the clustering label, and the clustering label can be set as other content instead of being listed.

When each voice frame is mapped to a corresponding cluster center, the method can be realized by adopting a k-means algorithm, so that the voice frames in the same cluster center have the same or similar voice characteristics.

When the dimension reduction processing is performed on the clustering labels of each voice frame, an embedding method can be adopted to convert the clustering labels corresponding to each voice frame into discretized token (namely syllable vector) with the same dimension as the text vector. Syllable vectors are digital representations of the voice features corresponding to voice framing, which are convenient for training and analyzing models and splicing input contents in different language forms.

Of course, the token corresponding to the speech frame and the token corresponding to the character only have the same expression form and dimension, but the output nodes are mutually independent, and the characterized features are mutually independent. For example, the speech part has ten cluster centers of 1 to 10, the text part has five cluster centers of 11 to 15, each cluster center has different characteristic, the number of the cluster centers corresponding to different language forms is different, and the two cluster centers are independent.

In some embodiments, the specific implementation manner of step S106 is: calling a multi-mode language model to analyze the spliced vector sequence, and predicting an interactive character vector sequence required by the response spliced vector sequence; and performing character reduction on the interactive character vector sequence according to the expression style of the multimode voice task to generate an interactive text for responding to the multimode voice task.

The interactive character vector sequence is the analysis result of the multi-mode language model on the spliced vector sequence, and the corresponding semantic content is the content of the interactive text, but is output in the form of the vector sequence when being output. Thus, the input and output of the multimode language model are in the form of vector sequences, the analysis efficiency of the multimode language model is improved, and the calculated amount of the multimode language model is reduced.

In some embodiments, the method for processing a multi-mode voice task further includes: processing each character of the text part to obtain a plurality of character vectors corresponding to each character; and arranging the character vectors according to the word order of the text part to form a character vector sequence.

In some embodiments, the method for processing a multi-mode voice task further includes: the language model is trained with a plurality of sample vector sequences to construct a multimodal language model for processing a plurality of language forms.

The sample vector sequence at least comprises syllable sample vector sequence, character sample vector sequence and spliced sample vector sequence, and the language model is trained through various types of samples, so that the model is suitable for processing multi-mode voice tasks with various language types. Referring to the foregoing, the language forms include at least a voice form, a text form, and a voice text splice form.

Specifically, the process of constructing the multimodal language model is mainly embodied as: performing unsupervised training on the language model by using the syllable sample vector sequence and the character sample vector sequence respectively to obtain a single-mode language model, wherein the single-mode language model has the capability of processing the syllable vector sequence and the character vector sequence; splicing the syllable sample vector sequence and the character sample vector sequence according to the target language order to obtain a spliced sample vector sequence; calling a single-mode language model to process the spliced sample vector sequence to obtain a processing result; and adjusting the weight of the single-mode language model according to the deviation value between the processing result and the expected result until the deviation value corresponding to the processing result is smaller than or equal to a preset threshold value, and taking the single-mode language model after weight adjustment as a multi-mode language model, wherein the multi-mode language model has the capability of processing syllable vector sequences, character vector sequences and spliced vector sequences.

In other words, when constructing the multimode language model, firstly performing unsupervised pre-training on the multimode language model to construct a single-mode language model; and then, the single-mode language model is weighted and optimized in a mode of splicing sample vector sequences and manual prompting, so that the single-mode language model has the capability of processing input data in multiple language forms, and a multi-mode language model is formed.

By means of the non-supervision pre-training mode, the defect that a large amount of labeled sample data are needed to train the language model is overcome, and dependence of the multi-mode language model generation process on manual prompt is reduced.

More specifically, when training a single-mode language model, the training can be realized by adopting pre-training language methods WavLM and hubert, mainly based on the voice characteristics and the unsupervised clustering labels, and the single-mode language model obtained by training can realize the mapping of the voice characteristics and the clustering labels.

First, the resulting character sample vector sequence is input into a speech model to obtain a speech model with the ability to analyze the character vector sequence. Further, initializing a voice model through the weight of the voice model with the capability of analyzing the character vector sequence; inputting the obtained syllable sample vector sequence into the voice model, learning the relation between the clustering labels by the voice model in an autoregressive mode, and predicting the next clustering label based on the currently input label; comparing the predicted next cluster label with the cluster label at the corresponding position in the voice-segment sample vector sequence, feeding back to the voice model, and adjusting the weight; and if the predicted cluster label is consistent with the cluster label at the corresponding position of the syllable sample vector sequence, proving that the model has the capability of analyzing the syllable vector sequence, and obtaining the single-mode language model.

Since the features characterized by vectors of different language forms are different, the nodes of the single-mode language model outputting the predictive labels of syllable vector sequences and the nodes of the predictive labels of character vector sequences are different.

Further, the spliced sample vector sequence is used as input, and the single-mode language model is predicted and trained, so that the multi-mode language model can be obtained. For example, the sample task is "predict voice ' Beijing today's issue of high temperature warning" where "predict voice ' … … ' is text part and" Beijing today's issue of high temperature warning "is voice part. Then, converting the voice part in the sample task into syllable sample vector sequence, and converting the text part into character sample vector sequence; splicing the two according to the word sequence of the sample task, inputting the spliced sample vector sequence obtained by splicing into a single-mode language model, and outputting an interactive character vector sequence related to the content of the voice part by the single-mode language model; furthermore, the interactive character vector sequence is restored to an interactive text of human language, if the content of the interactive text is the same as or similar to the expected output result 'Beijing-present release high-temperature early warning', the model is proved to be capable of being used as a multimode language model for predicting the spliced vector sequence, otherwise, the model weight is required to be continuously optimized. It should be noted that the desired output result may be manually prompted, and is not limited herein.

In some embodiments, the method for processing a multi-mode voice task may further include: responding to the multimode voice task in a voice form, calling a multimode language model to analyze syllable vector sequences corresponding to the multimode voice task, and generating an interactive text for responding to the multimode voice task; or in response to the multimode voice task being in a text form, invoking the multimode language model to analyze a character vector sequence corresponding to the multimode voice task, and generating an interactive text for responding to the multimode voice task.

According to the processing method of the multimode voice task, the contents in the input multimode voice task are subjected to vectorization dimension reduction processing, so that parts in different language forms are unified into the expression of a vector sequence, the influence of the language form in the input data on an analysis result is weakened, and the applicable scene for processing the voice task is expanded. In addition, a single-mode language model is built in an unsupervised mode, and the generation of the multi-mode language model is carried out on the basis of the single-mode language model, so that the requirement for a large amount of manual annotation data is overcome, and the realizability of multi-mode language tasks is ensured.

Fig. 3 is a block diagram of a processing device for multimodal speech tasks in accordance with an exemplary embodiment of the present disclosure. As shown in fig. 3, the present disclosure proposes a processing model 1000 of a multi-modal speech task, comprising: a syllable vector sequence extraction module 1002, configured to process each speech frame of the speech part in the multi-mode speech task to obtain a syllable vector sequence corresponding to a plurality of speech frames; the stitching module 1004 is configured to stitch a character vector sequence and a syllable vector sequence of a text portion in the multimodal speech task based on a word order of the multimodal speech task to obtain a stitched vector sequence, where each character vector in the character vector sequence and each syllable vector in the syllable vector sequence have the same dimension; and an interactive text generation module 1006 for calling the multimodal language model to analyze the spliced vector sequence and generating an interactive text for responding to the multimodal speech task.

Each module in the processing model 1000 of the multi-mode voice task is proposed for implementing each step of the processing method of the multi-mode voice task, so the implementation principle and the implementation process thereof can refer to the foregoing and will not be repeated.

The apparatus 1000 may include corresponding modules that perform the steps of the flowcharts discussed above. Thus, each step or several steps in the flowcharts described above may be performed by respective modules, and the apparatus may include one or more of these modules. A module may be one or more hardware modules specifically configured to perform the respective steps, or be implemented by a processor configured to perform the respective steps, or be stored within a computer-readable medium for implementation by a processor, or be implemented by some combination.

The hardware architecture may be implemented using a bus architecture. The bus architecture may include any number of interconnecting buses and bridges depending on the specific application of the hardware and the overall design constraints. Bus 1100 connects together various circuits including one or more processors 1200, memory 1300, and/or hardware modules. Bus 1100 may also connect various other circuits 1400, such as peripherals, voltage regulators, power management circuits, external antennas, and the like.

Bus 1100 may be an industry standard architecture (ISA, industry Standard Architecture) bus, a peripheral component interconnect (PCI, peripheral Component) bus, or an extended industry standard architecture (EISA, extended Industry Standard Component) bus, among others. The buses may be divided into address buses, data buses, control buses, etc. For ease of illustration, only one connection line is shown in the figure, but not only one bus or one type of bus.

Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing specific logical functions or steps of the process, and further implementations are included within the scope of the preferred embodiment of the present disclosure in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the embodiments of the present disclosure. The processor performs the various methods and processes described above. For example, method embodiments in the present disclosure may be implemented as a software program tangibly embodied on a machine-readable medium, such as a memory. In some embodiments, part or all of the software program may be loaded and/or installed via memory and/or a communication interface. One or more of the steps of the methods described above may be performed when a software program is loaded into memory and executed by a processor. Alternatively, in other embodiments, the processor may be configured to perform one of the methods described above in any other suitable manner (e.g., by means of firmware).

Logic and/or steps represented in the flowcharts or otherwise described herein may be embodied in any readable storage medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions.

For the purposes of this description, a "readable storage medium" can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device. More specific examples (a non-exhaustive list) of the readable storage medium would include the following: an electrical connection (electronic device) having one or more wires, a portable computer diskette (magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable read-only memory (CDROM). In addition, the readable storage medium may even be paper or other suitable medium on which the program can be printed, as the program can be electronically captured, via, for instance, optical scanning of the paper or other medium, then compiled, interpreted, or otherwise processed in a suitable manner if necessary, and then stored in a memory.

It should be understood that portions of the present disclosure may be implemented in hardware, software, or a combination thereof. In the above-described embodiments, the various steps or methods may be implemented in software stored in a memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, may be implemented using any one or combination of the following techniques, as is well known in the art: discrete logic circuits having logic gates for implementing logic functions on data signals, application specific integrated circuits having suitable combinational logic gates, programmable Gate Arrays (PGAs), field Programmable Gate Arrays (FPGAs), and the like.

Those of ordinary skill in the art will appreciate that all or part of the steps implementing the method of the above embodiment may be implemented by a program to instruct related hardware, and the program may be stored in a readable storage medium, where the program when executed includes one or a combination of the steps of the method embodiment.

Furthermore, each functional unit in each embodiment of the present disclosure may be integrated into one processing module, or each unit may exist alone physically, or two or more units may be integrated into one module. The integrated modules may be implemented in hardware or in software functional modules. The integrated modules may also be stored in a readable storage medium if implemented in the form of software functional modules and sold or used as a stand-alone product. The storage medium may be a read-only memory, a magnetic disk or optical disk, etc.

It will be appreciated by those skilled in the art that the above-described embodiments are merely for clarity of illustration of the disclosure, and are not intended to limit the scope of the disclosure. Other variations or modifications will be apparent to persons skilled in the art from the foregoing disclosure, and such variations or modifications are intended to be within the scope of the present disclosure.

Claims

1. A method for processing a multi-mode speech task, comprising:

processing each speech frame of a speech portion in a multi-modal speech task to obtain syllable vector sequences corresponding to a plurality of the speech frames;

based on the word order of the multi-mode voice task, splicing a character vector sequence of a text part in the multi-mode voice task and the syllable vector sequence to obtain a spliced vector sequence, wherein each character vector in the character vector sequence and each syllable vector in the syllable vector sequence have the same dimensionality; and

calling a multimode language model to analyze the spliced vector sequence and generating an interactive text for responding to the multimode voice task;

training a language model by using a plurality of sample vector sequences to construct the multimode language model for processing a plurality of language forms, wherein the sample vector sequences at least comprise syllable sample vector sequences, character sample vector sequences and spliced sample vector sequences, and the language forms at least comprise a voice form, a text form and a voice and text splicing form;

wherein training a language model with a plurality of sample vector sequences to construct the multimodal language model for processing a plurality of language forms comprises:

respectively performing unsupervised training on the language model by using a syllable sample vector sequence and a character sample vector sequence to obtain a single-mode language model, wherein the single-mode language model has the capability of processing the syllable vector sequence and the character vector sequence;

splicing the syllable sample vector sequence and the character sample vector sequence according to a target language order to obtain a spliced sample vector sequence;

invoking the single-mode language model to process the spliced sample vector sequence to obtain a processing result; and

and adjusting the weight of the single-mode language model according to the deviation value between the processing result and the expected result until the deviation value corresponding to the processing result is smaller than or equal to a preset threshold value, and taking the single-mode language model after weight adjustment as the multi-mode language model, wherein the multi-mode language model has the capability of processing the syllable vector sequence, the character vector sequence and the spliced vector sequence.

2. The method according to claim 1, wherein processing each speech frame of a speech part in the multi-modal speech task to obtain syllable vector sequences corresponding to a plurality of the speech frames comprises:

extracting voice characteristics of each voice frame, wherein the voice characteristics are at least used for representing semantic information and expression styles of the voice frames;

mapping the voice frame to a corresponding clustering center based on the voice characteristics, and taking the code of the clustering center as a clustering label of the voice frame;

performing dimension reduction processing on each voice frame according to the clustering labels to obtain a plurality of syllable vectors; and

and arranging the syllable vectors according to the language order of the voice part to form the syllable vector sequence.

3. The method of claim 1, wherein invoking the multimodal language model to analyze the sequence of stitched vectors to generate interactive text for responding to the multimodal speech task comprises:

invoking the multimode language model to analyze the spliced vector sequence, and predicting an interactive character vector sequence required by responding to the spliced vector sequence; and

and carrying out character reduction on the interactive character vector sequence according to the expression style of the multimode voice task, and generating the interactive text for responding to the multimode voice task.

4. The method for processing a multi-mode voice task according to claim 1, further comprising:

processing each character of the text part to obtain a plurality of character vectors corresponding to each character; and

and arranging the character vectors according to the word order of the text part to form the character vector sequence.

5. The method for processing a multi-mode voice task according to any one of claims 1 to 4, further comprising:

and responding to the multimode voice task in a voice form, calling a multimode language model to analyze syllable vector sequences corresponding to the multimode voice task, and generating an interactive text for responding to the multimode voice task.

6. The method for processing a multi-mode voice task according to any one of claims 1 to 4, further comprising:

and calling a multi-mode language model to analyze a character vector sequence corresponding to the multi-mode voice task in response to the multi-mode voice task in a text form, and generating an interactive text for responding to the multi-mode voice task.

7. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor when executing the program to implement the method of processing a multimodal speech task according to any of claims 1 to 6.

8. A readable storage medium, characterized in that it stores a computer program adapted to be loaded by a processor for executing the method of processing a multimodal speech task according to any of claims 1 to 6.