CN116705058B - Processing method of multimode voice task, electronic equipment and readable storage medium - Google Patents

Processing method of multimode voice task, electronic equipment and readable storage medium Download PDF

Info

Publication number
CN116705058B
CN116705058B CN202310977700.5A CN202310977700A CN116705058B CN 116705058 B CN116705058 B CN 116705058B CN 202310977700 A CN202310977700 A CN 202310977700A CN 116705058 B CN116705058 B CN 116705058B
Authority
CN
China
Prior art keywords
vector sequence
voice
mode
task
character
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202310977700.5A
Other languages
Chinese (zh)
Other versions
CN116705058A (en
Inventor
孙建伟
文成
赵帅江
邹伟
韩阳
李先刚
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Seashell Housing Beijing Technology Co Ltd
Original Assignee
Seashell Housing Beijing Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Seashell Housing Beijing Technology Co Ltd filed Critical Seashell Housing Beijing Technology Co Ltd
Priority to CN202310977700.5A priority Critical patent/CN116705058B/en
Publication of CN116705058A publication Critical patent/CN116705058A/en
Application granted granted Critical
Publication of CN116705058B publication Critical patent/CN116705058B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Machine Translation (AREA)

Abstract

The present disclosure provides a method for processing a multi-mode voice task, including: processing each speech frame of the speech portion in the multi-modal speech task to obtain syllable vector sequences corresponding to the plurality of speech frames; based on the language sequence of the multi-mode voice task, splicing a character vector sequence and a syllable vector sequence of a text part in the multi-mode voice task to obtain a spliced vector sequence, wherein each character vector in the character vector sequence and each syllable vector in the syllable vector sequence have the same dimensionality; and invoking the multi-mode language model to analyze the spliced vector sequence to generate interactive text for responding to the multi-mode voice task. The present disclosure also provides an electronic device and a readable storage medium.

Description

Processing method of multimode voice task, electronic equipment and readable storage medium
Technical Field
The disclosure relates to the technical field of deep learning, and in particular relates to a processing method of a multimode voice task, electronic equipment and a readable storage medium.
Background
Traditional processing of speech tasks, such as speech recognition and speech synthesis, typically relies on manually designed feature representations or supervised learning labeling samples, but the large amount of labeling data and manual feature engineering limits the scope of application of speech task processing, as well as the performance of models that perform speech processing tasks.
In the related technology, a plurality of voice unsupervised pre-training methods are also provided, and a voice model is constructed in an unsupervised learning mode, so that the dependence of the model training process on a large number of labeling samples is reduced. However, the speech model constructed by the non-supervised learning method only focuses on the dimension of the speech feature, so that only the prediction of the next frame of content of the speech data can be realized, and it is difficult to meet the processing requirements of multi-mode speech tasks, such as processing of speech and text spliced content.
Disclosure of Invention
To solve at least one of the foregoing problems, the present disclosure provides a method for processing a multi-mode voice task, an electronic device, and a readable storage medium.
According to one aspect of the present disclosure, there is provided a method for processing a multi-mode voice task, including: processing each speech frame of a speech portion in a multi-modal speech task to obtain syllable vector sequences corresponding to a plurality of the speech frames; based on the word order of the multi-mode voice task, splicing a character vector sequence of a text part in the multi-mode voice task and the syllable vector sequence to obtain a spliced vector sequence, wherein each character vector in the character vector sequence and each syllable vector in the syllable vector sequence have the same dimensionality; and invoking a multi-mode language model to analyze the spliced vector sequence and generating an interactive text for responding to the multi-mode voice task.
In some embodiments, the processing each speech frame of the speech portion in the multimodal speech task to obtain a syllable vector sequence corresponding to a plurality of the speech frames includes: extracting voice characteristics of each voice frame, wherein the voice characteristics are at least used for representing semantic information and expression styles of the voice frames; mapping the voice frame to a corresponding clustering center based on the voice characteristics, and taking the code of the clustering center as a clustering label of the voice frame; performing dimension reduction processing on each voice frame according to the clustering labels to obtain a plurality of syllable vectors; and arranging the syllable vectors according to the language sequence of the voice part to form the syllable vector sequence.
In some embodiments, the invoking the multimodal language model to analyze the stitched vector sequence to generate interactive text for responding to the multimodal speech task includes: invoking the multimode language model to analyze the spliced vector sequence, and predicting an interactive character vector sequence required by responding to the spliced vector sequence; and performing character reduction on the interactive character vector sequence according to the expression style of the multimode voice task to generate the interactive text for responding to the multimode voice task.
In some embodiments, further comprising: processing each character of the text part to obtain a plurality of character vectors corresponding to each character; and arranging the character vectors according to the word order of the text part to form the character vector sequence.
In some embodiments, further comprising: training a language model with a plurality of sample vector sequences to construct the multimodal language model for processing a plurality of language forms, wherein the sample vector sequences include at least syllable sample vector sequences, character sample vector sequences, and concatenation sample vector sequences, and the language forms include at least a speech form, a text form, and a phonetic text concatenation form.
In some embodiments, the training the language model with multiple sample vector sequences to construct the multimodal language model for processing multiple language forms includes: respectively performing unsupervised training on the language model by using a syllable sample vector sequence and a character sample vector sequence to obtain a single-mode language model, wherein the single-mode language model has the capability of processing the syllable vector sequence and the character vector sequence; splicing the syllable sample vector sequence and the character sample vector sequence according to a target language order to obtain a spliced sample vector sequence; invoking the single-mode language model to process the spliced sample vector sequence to obtain a processing result; and adjusting the weight of the single-mode language model according to the deviation value between the processing result and the expected result until the deviation value corresponding to the processing result is smaller than or equal to a preset threshold value, and taking the single-mode language model after weight adjustment as the multi-mode language model, wherein the multi-mode language model has the capability of processing the syllable vector sequence, the character vector sequence and the spliced vector sequence.
In some embodiments, further comprising: responding to the multimode voice task in a voice form, calling a multimode language model to analyze syllable vector sequences corresponding to the multimode voice task, and generating an interactive text for responding to the multimode voice task data; or in response to the multimode voice task being in a text form, invoking a multimode language model to analyze a character vector sequence corresponding to the multimode voice task, and generating an interactive text for responding to the multimode voice task.
According to another aspect of the present disclosure, there is provided an electronic device including a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor executing the program to implement a method for processing a multi-mode speech task according to any one of the embodiments.
According to a further aspect of the present disclosure there is provided a readable storage medium, characterized in that the readable storage medium stores a computer program adapted to be loaded by a processor for performing the method of processing a multimodal speech task as described in any of the embodiments above.
According to a further aspect of the present disclosure there is provided a computer program product comprising a computer program/instruction which, when executed by a processor, implements a method of processing a multimodal speech task as described in any of the embodiments above.
Drawings
The accompanying drawings, which are included to provide a further understanding of the disclosure and are incorporated in and constitute a part of this specification, illustrate exemplary embodiments of the disclosure and together with the description serve to explain the principles of the disclosure.
Fig. 1 is a flowchart of a method of processing a multimodal speech task in accordance with an exemplary embodiment of the present disclosure.
Fig. 2 is a schematic diagram of a processing architecture of a multimodal speech task in an exemplary embodiment of the disclosure.
Fig. 3 is a block diagram of a processing device for multimodal speech tasks in accordance with an exemplary embodiment of the present disclosure.
Detailed Description
The present disclosure is described in further detail below with reference to the drawings and the embodiments. It is to be understood that the specific embodiments described herein are merely illustrative of the relevant content and not limiting of the present disclosure. It should be further noted that, for convenience of description, only a portion relevant to the present disclosure is shown in the drawings.
In addition, embodiments of the present disclosure and features of the embodiments may be combined with each other without conflict. The technical aspects of the present disclosure will be described in detail below with reference to the accompanying drawings in conjunction with embodiments.
Unless otherwise indicated, the exemplary implementations/embodiments shown are to be understood as providing exemplary features of various details of some ways in which the technical concepts of the present disclosure may be practiced. Thus, unless otherwise indicated, features of the various implementations/embodiments may be additionally combined, separated, interchanged, and/or rearranged without departing from the technical concepts of the present disclosure.
The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. Furthermore, when the terms "comprises" and/or "comprising," and variations thereof, are used in the present specification, the presence of stated features, integers, steps, operations, elements, components, and/or groups thereof is described, but the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof is not precluded. It is also noted that, as used herein, the terms "substantially," "about," and other similar terms are used as approximation terms and not as degree terms, and as such, are used to explain the inherent deviations of measured, calculated, and/or provided values that would be recognized by one of ordinary skill in the art.
Fig. 1 is a flowchart of a method of processing a multimodal speech task in accordance with an exemplary embodiment of the present disclosure. The processing method S100 of the multi-mode voice task of the present disclosure will be described in detail with reference to fig. 1.
Step S102, processing each voice frame of the voice part in the multi-mode voice task to obtain syllable vector sequences corresponding to a plurality of voice frames.
A multimodal speech task is data to be processed entered by a user in multiple language forms, such as speech form, text form, speech-text splice form, etc. The task processing method in the related art is only suitable for predicting and expanding text-form input data, but is not suitable for processing input data in various expression forms such as voice form, voice text splicing form and the like. The method supports a richer expression form of input data, so that a user can input multimode voice tasks, the man-machine interaction experience of the user is improved, and the applicable scene of the man-machine interaction process is expanded. More specifically, a multimodal speech task may have only a text portion or a speech portion, or may have both a speech portion and a text portion, depending on its language expression.
The voice part is a part for transmitting information through sound waves in a multimode voice task and is provided with a plurality of voice frames, each voice frame has expressed semantic information, and a certain expression style is presented according to personalized features of a user. In other words, each speech frame corresponds to speech features that are at least reflected in the expressed semantic information and in the presented expression style. In the present disclosure, we use speech framing as a discrete unit of speech part to facilitate the data processing of the speech part.
In order to weaken the influence of the language form of the multimode voice task on the subsequent task processing, the data in the multimode voice task needs to be processed in a unified mode, including processing of unified dimension, expression form and the like. Based on this, the present disclosure converts each speech frame into a corresponding syllable vector, and sequentially arranges each syllable vector in the order of the words of the speech part to form a syllable vector sequence.
Syllable vectors have lower dimensional information compared with speech framing, are convenient to analyze and calculate, and are beneficial to improving task processing efficiency. Likewise, characters of the text portion in the multimodal speech task are also processed into corresponding character vectors, and the dimensions of the character vectors are the same as those of syllable vectors, facilitating concatenation of the two. When the voice part and the text part are converted into vectors with the same dimension and form, the language form of the input is not concerned excessively when the multi-mode voice task is processed later, and the input with richer language forms can be supported. Of course, the remaining language forms, if convertible into vectors having the same dimensions and form, fall within the scope of the present disclosure.
Step S104, based on the language sequence of the multi-mode voice task, the character vector sequence and the syllable vector sequence of the text part in the multi-mode voice task are spliced to obtain a spliced vector sequence.
The word order is used for representing the expression order presented by the multimode voice task, and generally accords with the language expression habit of a user and the daily expression habit of human beings. Speech framing or character with a word order can express the interactive needs that the user wants to specify so that the multi-modal speech task is not a simple stack of multiple speech framing or characters.
The spliced vector sequence is the result of splicing the character vector sequence and the syllable vector sequence according to the language order. The spliced vector sequence obtained by splicing can completely express the expressed semantics of the multimode voice task in the form of vectors.
Of course, each character vector in the character vector sequence and each syllable vector in the syllable vector sequence have the same dimension so as to ensure that the two have the splicing condition.
And S106, calling the multimode language model to analyze the spliced vector sequence, and generating an interactive text for responding to the multimode voice task.
The multi-mode language model is a neural network model for executing analysis tasks on the spliced vector sequences, has self-learning capability, can realize tasks such as analysis, prediction, expansion and the like on input data by training to know the expression habits of human beings, and further outputs feedback results to achieve the aim of human-computer interaction. The multi-mode language model is different from the language model of the related technology, and can carry out indiscriminate analysis on vector sequences corresponding to data in various language forms; whereas the language model existing in the related art can only implement analysis of input data in a single form, such as text form or voice form.
The interactive text is the final output result of the multimode language model and is generated based on the personalized expression characteristics contained in the multimode voice task and accords with the human communication habit. Moreover, the content presented by the interactive text can realize the feedback requirement on the multimode voice task, and solve the problems required by the multimode voice task, including writing articles, answering and confusing, daily communication and the like.
Fig. 2 is a schematic diagram of a processing architecture of a multimodal speech task in an exemplary embodiment of the disclosure. The method of processing a multi-modal speech task will be described in more detail below in conjunction with fig. 2.
In some embodiments, the specific implementation procedure of step S102 is: extracting voice characteristics of each voice frame; based on the voice characteristics, mapping voice framing to a corresponding clustering center, and taking the code of the clustering center as a clustering label of the voice framing; according to the clustering labels, carrying out dimension reduction on each voice frame to obtain a plurality of syllable vectors; and arranging the syllable vectors according to the language order of the voice part to form a syllable vector sequence.
The speech features may, for example, be mel-spectrum features or the like, containing semantic information of speech frames and expression styles of users, specifically expressed as frequency information, timbre information, pitch information, semantic content, etc.
The cluster centers are a plurality of voice categories divided according to the characteristics of the voice parts, and each cluster center can correspond to a plurality of voice frames. For example, the speech part has 100 speech frames and 10 clustering centers are set, and then according to the speech features of the speech frames, each speech frame can be mapped to a corresponding clustering center, so that the speech frames belonging to the same clustering center present the same or similar speech features. In addition, each cluster center is provided with a category code, and the category codes of different cluster centers are different to show distinction.
The clustering label is an identifier for representing the category to which the voice frame belongs, and the voice frames mapped to the same clustering center have the same clustering label. For example, the class code of the clustering center mapped by the voice framing is used as the clustering label, and the clustering label can be set as other content instead of being listed.
When each voice frame is mapped to a corresponding cluster center, the method can be realized by adopting a k-means algorithm, so that the voice frames in the same cluster center have the same or similar voice characteristics.
When the dimension reduction processing is performed on the clustering labels of each voice frame, an embedding method can be adopted to convert the clustering labels corresponding to each voice frame into discretized token (namely syllable vector) with the same dimension as the text vector. Syllable vectors are digital representations of the voice features corresponding to voice framing, which are convenient for training and analyzing models and splicing input contents in different language forms.
Of course, the token corresponding to the speech frame and the token corresponding to the character only have the same expression form and dimension, but the output nodes are mutually independent, and the characterized features are mutually independent. For example, the speech part has ten cluster centers of 1 to 10, the text part has five cluster centers of 11 to 15, each cluster center has different characteristic, the number of the cluster centers corresponding to different language forms is different, and the two cluster centers are independent.
In some embodiments, the specific implementation manner of step S106 is: calling a multi-mode language model to analyze the spliced vector sequence, and predicting an interactive character vector sequence required by the response spliced vector sequence; and performing character reduction on the interactive character vector sequence according to the expression style of the multimode voice task to generate an interactive text for responding to the multimode voice task.
The interactive character vector sequence is the analysis result of the multi-mode language model on the spliced vector sequence, and the corresponding semantic content is the content of the interactive text, but is output in the form of the vector sequence when being output. Thus, the input and output of the multimode language model are in the form of vector sequences, the analysis efficiency of the multimode language model is improved, and the calculated amount of the multimode language model is reduced.
In some embodiments, the method for processing a multi-mode voice task further includes: processing each character of the text part to obtain a plurality of character vectors corresponding to each character; and arranging the character vectors according to the word order of the text part to form a character vector sequence.
In some embodiments, the method for processing a multi-mode voice task further includes: the language model is trained with a plurality of sample vector sequences to construct a multimodal language model for processing a plurality of language forms.
The sample vector sequence at least comprises syllable sample vector sequence, character sample vector sequence and spliced sample vector sequence, and the language model is trained through various types of samples, so that the model is suitable for processing multi-mode voice tasks with various language types. Referring to the foregoing, the language forms include at least a voice form, a text form, and a voice text splice form.
Specifically, the process of constructing the multimodal language model is mainly embodied as: performing unsupervised training on the language model by using the syllable sample vector sequence and the character sample vector sequence respectively to obtain a single-mode language model, wherein the single-mode language model has the capability of processing the syllable vector sequence and the character vector sequence; splicing the syllable sample vector sequence and the character sample vector sequence according to the target language order to obtain a spliced sample vector sequence; calling a single-mode language model to process the spliced sample vector sequence to obtain a processing result; and adjusting the weight of the single-mode language model according to the deviation value between the processing result and the expected result until the deviation value corresponding to the processing result is smaller than or equal to a preset threshold value, and taking the single-mode language model after weight adjustment as a multi-mode language model, wherein the multi-mode language model has the capability of processing syllable vector sequences, character vector sequences and spliced vector sequences.
In other words, when constructing the multimode language model, firstly performing unsupervised pre-training on the multimode language model to construct a single-mode language model; and then, the single-mode language model is weighted and optimized in a mode of splicing sample vector sequences and manual prompting, so that the single-mode language model has the capability of processing input data in multiple language forms, and a multi-mode language model is formed.
By means of the non-supervision pre-training mode, the defect that a large amount of labeled sample data are needed to train the language model is overcome, and dependence of the multi-mode language model generation process on manual prompt is reduced.
More specifically, when training a single-mode language model, the training can be realized by adopting pre-training language methods WavLM and hubert, mainly based on the voice characteristics and the unsupervised clustering labels, and the single-mode language model obtained by training can realize the mapping of the voice characteristics and the clustering labels.
First, the resulting character sample vector sequence is input into a speech model to obtain a speech model with the ability to analyze the character vector sequence. Further, initializing a voice model through the weight of the voice model with the capability of analyzing the character vector sequence; inputting the obtained syllable sample vector sequence into the voice model, learning the relation between the clustering labels by the voice model in an autoregressive mode, and predicting the next clustering label based on the currently input label; comparing the predicted next cluster label with the cluster label at the corresponding position in the voice-segment sample vector sequence, feeding back to the voice model, and adjusting the weight; and if the predicted cluster label is consistent with the cluster label at the corresponding position of the syllable sample vector sequence, proving that the model has the capability of analyzing the syllable vector sequence, and obtaining the single-mode language model.
Since the features characterized by vectors of different language forms are different, the nodes of the single-mode language model outputting the predictive labels of syllable vector sequences and the nodes of the predictive labels of character vector sequences are different.
Further, the spliced sample vector sequence is used as input, and the single-mode language model is predicted and trained, so that the multi-mode language model can be obtained. For example, the sample task is "predict voice ' Beijing today's issue of high temperature warning" where "predict voice ' … … ' is text part and" Beijing today's issue of high temperature warning "is voice part. Then, converting the voice part in the sample task into syllable sample vector sequence, and converting the text part into character sample vector sequence; splicing the two according to the word sequence of the sample task, inputting the spliced sample vector sequence obtained by splicing into a single-mode language model, and outputting an interactive character vector sequence related to the content of the voice part by the single-mode language model; furthermore, the interactive character vector sequence is restored to an interactive text of human language, if the content of the interactive text is the same as or similar to the expected output result 'Beijing-present release high-temperature early warning', the model is proved to be capable of being used as a multimode language model for predicting the spliced vector sequence, otherwise, the model weight is required to be continuously optimized. It should be noted that the desired output result may be manually prompted, and is not limited herein.
In some embodiments, the method for processing a multi-mode voice task may further include: responding to the multimode voice task in a voice form, calling a multimode language model to analyze syllable vector sequences corresponding to the multimode voice task, and generating an interactive text for responding to the multimode voice task; or in response to the multimode voice task being in a text form, invoking the multimode language model to analyze a character vector sequence corresponding to the multimode voice task, and generating an interactive text for responding to the multimode voice task.
According to the processing method of the multimode voice task, the contents in the input multimode voice task are subjected to vectorization dimension reduction processing, so that parts in different language forms are unified into the expression of a vector sequence, the influence of the language form in the input data on an analysis result is weakened, and the applicable scene for processing the voice task is expanded. In addition, a single-mode language model is built in an unsupervised mode, and the generation of the multi-mode language model is carried out on the basis of the single-mode language model, so that the requirement for a large amount of manual annotation data is overcome, and the realizability of multi-mode language tasks is ensured.
Fig. 3 is a block diagram of a processing device for multimodal speech tasks in accordance with an exemplary embodiment of the present disclosure. As shown in fig. 3, the present disclosure proposes a processing model 1000 of a multi-modal speech task, comprising: a syllable vector sequence extraction module 1002, configured to process each speech frame of the speech part in the multi-mode speech task to obtain a syllable vector sequence corresponding to a plurality of speech frames; the stitching module 1004 is configured to stitch a character vector sequence and a syllable vector sequence of a text portion in the multimodal speech task based on a word order of the multimodal speech task to obtain a stitched vector sequence, where each character vector in the character vector sequence and each syllable vector in the syllable vector sequence have the same dimension; and an interactive text generation module 1006 for calling the multimodal language model to analyze the spliced vector sequence and generating an interactive text for responding to the multimodal speech task.
Each module in the processing model 1000 of the multi-mode voice task is proposed for implementing each step of the processing method of the multi-mode voice task, so the implementation principle and the implementation process thereof can refer to the foregoing and will not be repeated.
The apparatus 1000 may include corresponding modules that perform the steps of the flowcharts discussed above. Thus, each step or several steps in the flowcharts described above may be performed by respective modules, and the apparatus may include one or more of these modules. A module may be one or more hardware modules specifically configured to perform the respective steps, or be implemented by a processor configured to perform the respective steps, or be stored within a computer-readable medium for implementation by a processor, or be implemented by some combination.
The hardware architecture may be implemented using a bus architecture. The bus architecture may include any number of interconnecting buses and bridges depending on the specific application of the hardware and the overall design constraints. Bus 1100 connects together various circuits including one or more processors 1200, memory 1300, and/or hardware modules. Bus 1100 may also connect various other circuits 1400, such as peripherals, voltage regulators, power management circuits, external antennas, and the like.
Bus 1100 may be an industry standard architecture (ISA, industry Standard Architecture) bus, a peripheral component interconnect (PCI, peripheral Component) bus, or an extended industry standard architecture (EISA, extended Industry Standard Component) bus, among others. The buses may be divided into address buses, data buses, control buses, etc. For ease of illustration, only one connection line is shown in the figure, but not only one bus or one type of bus.
Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing specific logical functions or steps of the process, and further implementations are included within the scope of the preferred embodiment of the present disclosure in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the embodiments of the present disclosure. The processor performs the various methods and processes described above. For example, method embodiments in the present disclosure may be implemented as a software program tangibly embodied on a machine-readable medium, such as a memory. In some embodiments, part or all of the software program may be loaded and/or installed via memory and/or a communication interface. One or more of the steps of the methods described above may be performed when a software program is loaded into memory and executed by a processor. Alternatively, in other embodiments, the processor may be configured to perform one of the methods described above in any other suitable manner (e.g., by means of firmware).
Logic and/or steps represented in the flowcharts or otherwise described herein may be embodied in any readable storage medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions.
For the purposes of this description, a "readable storage medium" can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device. More specific examples (a non-exhaustive list) of the readable storage medium would include the following: an electrical connection (electronic device) having one or more wires, a portable computer diskette (magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable read-only memory (CDROM). In addition, the readable storage medium may even be paper or other suitable medium on which the program can be printed, as the program can be electronically captured, via, for instance, optical scanning of the paper or other medium, then compiled, interpreted, or otherwise processed in a suitable manner if necessary, and then stored in a memory.
It should be understood that portions of the present disclosure may be implemented in hardware, software, or a combination thereof. In the above-described embodiments, the various steps or methods may be implemented in software stored in a memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, may be implemented using any one or combination of the following techniques, as is well known in the art: discrete logic circuits having logic gates for implementing logic functions on data signals, application specific integrated circuits having suitable combinational logic gates, programmable Gate Arrays (PGAs), field Programmable Gate Arrays (FPGAs), and the like.
Those of ordinary skill in the art will appreciate that all or part of the steps implementing the method of the above embodiment may be implemented by a program to instruct related hardware, and the program may be stored in a readable storage medium, where the program when executed includes one or a combination of the steps of the method embodiment.
Furthermore, each functional unit in each embodiment of the present disclosure may be integrated into one processing module, or each unit may exist alone physically, or two or more units may be integrated into one module. The integrated modules may be implemented in hardware or in software functional modules. The integrated modules may also be stored in a readable storage medium if implemented in the form of software functional modules and sold or used as a stand-alone product. The storage medium may be a read-only memory, a magnetic disk or optical disk, etc.
It will be appreciated by those skilled in the art that the above-described embodiments are merely for clarity of illustration of the disclosure, and are not intended to limit the scope of the disclosure. Other variations or modifications will be apparent to persons skilled in the art from the foregoing disclosure, and such variations or modifications are intended to be within the scope of the present disclosure.

Claims (8)

1. A method for processing a multi-mode speech task, comprising:
processing each speech frame of a speech portion in a multi-modal speech task to obtain syllable vector sequences corresponding to a plurality of the speech frames;
based on the word order of the multi-mode voice task, splicing a character vector sequence of a text part in the multi-mode voice task and the syllable vector sequence to obtain a spliced vector sequence, wherein each character vector in the character vector sequence and each syllable vector in the syllable vector sequence have the same dimensionality; and
calling a multimode language model to analyze the spliced vector sequence and generating an interactive text for responding to the multimode voice task;
training a language model by using a plurality of sample vector sequences to construct the multimode language model for processing a plurality of language forms, wherein the sample vector sequences at least comprise syllable sample vector sequences, character sample vector sequences and spliced sample vector sequences, and the language forms at least comprise a voice form, a text form and a voice and text splicing form;
wherein training a language model with a plurality of sample vector sequences to construct the multimodal language model for processing a plurality of language forms comprises:
respectively performing unsupervised training on the language model by using a syllable sample vector sequence and a character sample vector sequence to obtain a single-mode language model, wherein the single-mode language model has the capability of processing the syllable vector sequence and the character vector sequence;
splicing the syllable sample vector sequence and the character sample vector sequence according to a target language order to obtain a spliced sample vector sequence;
invoking the single-mode language model to process the spliced sample vector sequence to obtain a processing result; and
and adjusting the weight of the single-mode language model according to the deviation value between the processing result and the expected result until the deviation value corresponding to the processing result is smaller than or equal to a preset threshold value, and taking the single-mode language model after weight adjustment as the multi-mode language model, wherein the multi-mode language model has the capability of processing the syllable vector sequence, the character vector sequence and the spliced vector sequence.
2. The method according to claim 1, wherein processing each speech frame of a speech part in the multi-modal speech task to obtain syllable vector sequences corresponding to a plurality of the speech frames comprises:
extracting voice characteristics of each voice frame, wherein the voice characteristics are at least used for representing semantic information and expression styles of the voice frames;
mapping the voice frame to a corresponding clustering center based on the voice characteristics, and taking the code of the clustering center as a clustering label of the voice frame;
performing dimension reduction processing on each voice frame according to the clustering labels to obtain a plurality of syllable vectors; and
and arranging the syllable vectors according to the language order of the voice part to form the syllable vector sequence.
3. The method of claim 1, wherein invoking the multimodal language model to analyze the sequence of stitched vectors to generate interactive text for responding to the multimodal speech task comprises:
invoking the multimode language model to analyze the spliced vector sequence, and predicting an interactive character vector sequence required by responding to the spliced vector sequence; and
and carrying out character reduction on the interactive character vector sequence according to the expression style of the multimode voice task, and generating the interactive text for responding to the multimode voice task.
4. The method for processing a multi-mode voice task according to claim 1, further comprising:
processing each character of the text part to obtain a plurality of character vectors corresponding to each character; and
and arranging the character vectors according to the word order of the text part to form the character vector sequence.
5. The method for processing a multi-mode voice task according to any one of claims 1 to 4, further comprising:
and responding to the multimode voice task in a voice form, calling a multimode language model to analyze syllable vector sequences corresponding to the multimode voice task, and generating an interactive text for responding to the multimode voice task.
6. The method for processing a multi-mode voice task according to any one of claims 1 to 4, further comprising:
and calling a multi-mode language model to analyze a character vector sequence corresponding to the multi-mode voice task in response to the multi-mode voice task in a text form, and generating an interactive text for responding to the multi-mode voice task.
7. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor when executing the program to implement the method of processing a multimodal speech task according to any of claims 1 to 6.
8. A readable storage medium, characterized in that it stores a computer program adapted to be loaded by a processor for executing the method of processing a multimodal speech task according to any of claims 1 to 6.
CN202310977700.5A 2023-08-04 2023-08-04 Processing method of multimode voice task, electronic equipment and readable storage medium Active CN116705058B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310977700.5A CN116705058B (en) 2023-08-04 2023-08-04 Processing method of multimode voice task, electronic equipment and readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310977700.5A CN116705058B (en) 2023-08-04 2023-08-04 Processing method of multimode voice task, electronic equipment and readable storage medium

Publications (2)

Publication Number Publication Date
CN116705058A CN116705058A (en) 2023-09-05
CN116705058B true CN116705058B (en) 2023-10-27

Family

ID=87837849

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310977700.5A Active CN116705058B (en) 2023-08-04 2023-08-04 Processing method of multimode voice task, electronic equipment and readable storage medium

Country Status (1)

Country Link
CN (1) CN116705058B (en)

Citations (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH0573087A (en) * 1991-09-13 1993-03-26 Matsushita Electric Ind Co Ltd Speech recognizing method
CN105788589A (en) * 2016-05-04 2016-07-20 腾讯科技(深圳)有限公司 Audio data processing method and device
CN109117472A (en) * 2018-11-12 2019-01-01 新疆大学 A kind of Uighur name entity recognition method based on deep learning
CN110335587A (en) * 2019-06-14 2019-10-15 平安科技(深圳)有限公司 Phoneme synthesizing method, system, terminal device and readable storage medium storing program for executing
CN111222330A (en) * 2019-12-26 2020-06-02 中国电力科学研究院有限公司 Chinese event detection method and system
CN111862953A (en) * 2019-12-05 2020-10-30 北京嘀嘀无限科技发展有限公司 Training method of voice recognition model, voice recognition method and device
CN112287688A (en) * 2020-09-17 2021-01-29 昆明理工大学 English-Burmese bilingual parallel sentence pair extraction method and device integrating pre-training language model and structural features
CN112397047A (en) * 2020-12-11 2021-02-23 平安科技(深圳)有限公司 Speech synthesis method, device, electronic equipment and readable storage medium
WO2021093449A1 (en) * 2019-11-14 2021-05-20 腾讯科技(深圳)有限公司 Wakeup word detection method and apparatus employing artificial intelligence, device, and medium
CN112927677A (en) * 2021-03-29 2021-06-08 北京大米科技有限公司 Speech synthesis method and device
CN113392638A (en) * 2021-06-11 2021-09-14 北京世纪好未来教育科技有限公司 Text evaluation method, device, equipment and medium
CN114187914A (en) * 2021-12-17 2022-03-15 广东电网有限责任公司 Voice recognition method and system
CN114333772A (en) * 2021-12-03 2022-04-12 腾讯科技(深圳)有限公司 Speech recognition method, device, equipment, readable storage medium and product
CN114360502A (en) * 2021-11-03 2022-04-15 腾讯科技(深圳)有限公司 Processing method of voice recognition model, voice recognition method and device
CN114783407A (en) * 2022-06-21 2022-07-22 平安科技(深圳)有限公司 Speech synthesis model training method, device, computer equipment and storage medium
CN114842825A (en) * 2022-04-20 2022-08-02 杭州倒映有声科技有限公司 Emotion migration voice synthesis method and system
CN115410550A (en) * 2022-06-02 2022-11-29 柯登峰 Fine-grained rhythm-controllable emotion voice synthesis method, system and storage medium
WO2023273578A1 (en) * 2021-06-30 2023-01-05 北京有竹居网络技术有限公司 Speech recognition method and apparatus, and medium and device
WO2023273610A1 (en) * 2021-06-30 2023-01-05 北京有竹居网络技术有限公司 Speech recognition method and apparatus, medium, and electronic device

Patent Citations (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH0573087A (en) * 1991-09-13 1993-03-26 Matsushita Electric Ind Co Ltd Speech recognizing method
CN105788589A (en) * 2016-05-04 2016-07-20 腾讯科技(深圳)有限公司 Audio data processing method and device
CN109117472A (en) * 2018-11-12 2019-01-01 新疆大学 A kind of Uighur name entity recognition method based on deep learning
CN110335587A (en) * 2019-06-14 2019-10-15 平安科技(深圳)有限公司 Phoneme synthesizing method, system, terminal device and readable storage medium storing program for executing
WO2021093449A1 (en) * 2019-11-14 2021-05-20 腾讯科技(深圳)有限公司 Wakeup word detection method and apparatus employing artificial intelligence, device, and medium
CN111862953A (en) * 2019-12-05 2020-10-30 北京嘀嘀无限科技发展有限公司 Training method of voice recognition model, voice recognition method and device
CN111222330A (en) * 2019-12-26 2020-06-02 中国电力科学研究院有限公司 Chinese event detection method and system
CN112287688A (en) * 2020-09-17 2021-01-29 昆明理工大学 English-Burmese bilingual parallel sentence pair extraction method and device integrating pre-training language model and structural features
WO2022121176A1 (en) * 2020-12-11 2022-06-16 平安科技(深圳)有限公司 Speech synthesis method and apparatus, electronic device, and readable storage medium
CN112397047A (en) * 2020-12-11 2021-02-23 平安科技(深圳)有限公司 Speech synthesis method, device, electronic equipment and readable storage medium
CN112927677A (en) * 2021-03-29 2021-06-08 北京大米科技有限公司 Speech synthesis method and device
CN113392638A (en) * 2021-06-11 2021-09-14 北京世纪好未来教育科技有限公司 Text evaluation method, device, equipment and medium
WO2023273578A1 (en) * 2021-06-30 2023-01-05 北京有竹居网络技术有限公司 Speech recognition method and apparatus, and medium and device
WO2023273610A1 (en) * 2021-06-30 2023-01-05 北京有竹居网络技术有限公司 Speech recognition method and apparatus, medium, and electronic device
CN114360502A (en) * 2021-11-03 2022-04-15 腾讯科技(深圳)有限公司 Processing method of voice recognition model, voice recognition method and device
CN114333772A (en) * 2021-12-03 2022-04-12 腾讯科技(深圳)有限公司 Speech recognition method, device, equipment, readable storage medium and product
CN114187914A (en) * 2021-12-17 2022-03-15 广东电网有限责任公司 Voice recognition method and system
CN114842825A (en) * 2022-04-20 2022-08-02 杭州倒映有声科技有限公司 Emotion migration voice synthesis method and system
CN115410550A (en) * 2022-06-02 2022-11-29 柯登峰 Fine-grained rhythm-controllable emotion voice synthesis method, system and storage medium
CN114783407A (en) * 2022-06-21 2022-07-22 平安科技(深圳)有限公司 Speech synthesis model training method, device, computer equipment and storage medium

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
中文语音合成系统中的文本标准化方法;陈志刚, 胡国平, 王熙法;中文信息学报(04);44-51 *
基于可变长音素序列拼接单元的维吾尔语语音合成技术研究;周艳;艾斯卡尔;;四川理工学院学报(自然科学版)(02);64-68 *

Also Published As

Publication number Publication date
CN116705058A (en) 2023-09-05

Similar Documents

Publication Publication Date Title
CN110211563B (en) Chinese speech synthesis method, device and storage medium for scenes and emotion
CN108630203B (en) Voice interaction device, processing method thereof, and program
CN110032742B (en) Response sentence generating apparatus, method and storage medium, and voice interaction system
CN106486121B (en) Voice optimization method and device applied to intelligent robot
CN110634487A (en) Bilingual mixed speech recognition method, device, equipment and storage medium
CN110223671B (en) Method, device, system and storage medium for predicting prosodic boundary of language
CN110910903B (en) Speech emotion recognition method, device, equipment and computer readable storage medium
CN110010136B (en) Training and text analysis method, device, medium and equipment for prosody prediction model
CN111192568A (en) Speech synthesis method and speech synthesis device
CN112860871B (en) Natural language understanding model training method, natural language understanding method and device
KR101097186B1 (en) System and method for synthesizing voice of multi-language
CN116303966A (en) Dialogue behavior recognition system based on prompt learning
CN116343747A (en) Speech synthesis method, speech synthesis device, electronic device, and storage medium
CN115688937A (en) Model training method and device
CN113823259B (en) Method and device for converting text data into phoneme sequence
CN113129862B (en) Voice synthesis method, system and server based on world-tacotron
CN114254649A (en) Language model training method and device, storage medium and equipment
CN113823265A (en) Voice recognition method and device and computer equipment
CN116705058B (en) Processing method of multimode voice task, electronic equipment and readable storage medium
CN112257432A (en) Self-adaptive intention identification method and device and electronic equipment
CN111968646A (en) Voice recognition method and device
CN117133269A (en) Speech synthesis method, device, electronic equipment and storage medium
Chen et al. Integrated automatic expression prediction and speech synthesis from text
KR100806287B1 (en) Method for predicting sentence-final intonation and Text-to-Speech System and method based on the same
CN111489742B (en) Acoustic model training method, voice recognition device and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant