CN117854492A

CN117854492A - Intelligent interaction method, system, equipment and storage medium based on large model

Info

Publication number: CN117854492A
Application number: CN202410104210.9A
Authority: CN
Inventors: 晏超; 吴伯庸; 蒲浩然; 王金超; 刘盛中
Original assignee: Yuncong Technology Group Co Ltd
Current assignee: Yuncong Technology Group Co Ltd
Priority date: 2024-01-24
Filing date: 2024-01-24
Publication date: 2024-04-09

Abstract

The invention relates to the technical field of artificial intelligence, in particular to an intelligent interaction method, an intelligent interaction system, intelligent interaction equipment and an intelligent interaction storage medium based on a large model, and aims to solve the technical problem of poor man-machine interaction experience caused by lack of higher-layer semantics and finer voice characteristic information of a lower layer. The voice information is extracted through the voice big model, the semantics of a higher layer are reserved, and meanwhile, the voice-text alignment module is used for obtaining the first text token feature sequence aligned with the audio feature sequence, so that the finer voice characteristics of a lower layer can be obtained during the processing of the language big model, the intention of a user is better identified, and the experience of human-computer interaction of the user is improved.

Description

Intelligent interaction method, system, equipment and storage medium based on large model

Technical Field

The invention relates to the technical field of artificial intelligence, and particularly provides an intelligent interaction method, system, equipment and storage medium based on a large model.

Background

The main man-machine interaction system in the current stage is commonly completed by a plurality of algorithm modules such as ASR speech recognition, LLM large language model, TTS speech synthesis system and the like in combination with complex algorithm flow and exception handling logic development, and is widely applied to various intelligent man-machine interaction scenes. The ASR speech recognition algorithm module is used for recognizing the speech input of the person into text content; the LLM large language model receives the text output of the ASR module, and outputs a corresponding text response after user intention recognition and semantic understanding are carried out; and finally, the text content generated by the LLM large language model is sent to a TTS language synthesis module, and high-quality, natural and anthropomorphic voices are generated and played.

Although the current algorithm models formed by human-computer interaction are optimized and performance improved independently under respective modes, the method still faces several problems which are difficult to solve:

1. the ASR has poor self-adaptive capacity and robustness to scenes (far-near field and environmental noise), speakers (accent and speech speed), dialects and the like, so that the recognition effect is poor, and interaction failure is caused in the first step;

2. because the text semantic information contained in the training data of the ASR is less, the ASR model usually adopts an external NGRAM or RNN language model to improve the accuracy of transcription, but the external language model is still constructed based on the text content of a single language, and is only suitable for the language and the scene with characteristics, and has low universality;

3. the LLM large language model only has text single-mode information, and during interaction, auxiliary language information contained in voice is lacking, so that user intention cannot be understood correctly, generated content south-beam North rut is caused, and actual interaction experience is influenced;

4. the text front-end module for TTS voice synthesis converts the text into a correct reading method, pronunciation and rhythm pause, and a large amount of manpower is needed for maintenance and abnormal repair, but new word layers of the text are endless, the text content is rich and various, the prediction accuracy is greatly influenced, and the synthesis intelligibility and the naturalness are reduced;

5. the algorithm models are mutually independent, and the generated recognition errors are amplified in cascade in the interaction flow, so that the overall interaction effect is reduced.

Therefore, in the man-machine interaction scheme, due to the lack of higher-layer semantics and finer-layer voice characteristic information, the man-machine interaction experience is often poor, and the technical problem to be solved is needed.

Disclosure of Invention

The invention is proposed to overcome the above-mentioned drawbacks, and to solve or at least partially solve the technical problem of poor human-computer interaction experience caused by lack of higher-level semantics and lower-level fine voice characteristic information in the traditional human-computer interaction scheme.

In a first aspect, the present invention provides a large model-based intelligent interaction method, including:

extracting an audio feature sequence of voice information through a voice big model and transmitting the audio feature sequence to a voice-text alignment module;

acquiring a first text token feature sequence aligned with the audio feature sequence through the voice-text alignment module;

a second text token feature sequence corresponding to the input prompt word text is obtained through a tokenization module;

acquiring a third text token feature sequence based on the first text token feature sequence and the second text token feature sequence;

the language big model determines a processing strategy for the third text token feature sequence based on the prompt word type.

In one embodiment, the tokenization module refers to encoding any piece of text into a sequence of text token features.

In one embodiment, the speech-text alignment module is configured to complete an alignment mapping of the audio feature sequence to the text token feature sequence.

In one embodiment, the first text token feature sequence and the second text token feature sequence are spliced to obtain a third text token feature sequence.

In one embodiment, the hint word type includes at least one of: the identification type and the audio output type.

In one embodiment, if the prompt word type is the recognition type, the third text token feature sequence is subjected to deserialization operation through the language big model, and text reply content is output.

In one embodiment, if the prompt word type is an audio output type, the fourth text token feature sequence is obtained after the third text token feature sequence is processed by the language big model.

In one embodiment, the audio feature sequence aligned with the fourth text token feature sequence is acquired by a text-to-speech alignment module; the text-voice alignment module is used for realizing alignment mapping from the text token feature sequence to the audio feature sequence.

In one embodiment, the target speech information is output based on the aligned audio feature sequence.

In one embodiment, the large speech model refers to a deep learning model trained from a large amount of speech data, wherein the training method comprises at least one of: the method is supervised, self-supervised and semi-supervised;

the language big model is a deep learning model trained by a large amount of text data; wherein the training method comprises at least one of the following: has supervision, self-supervision and semi-supervision.

In one embodiment, a frame-level high-dimensional feature sequence of input speech is extracted as an audio feature sequence by a speech large model.

In one embodiment, the audio feature sequence carries audio content information, audio emotion information, audio prosody information, audio voiceprint information, audio scene information, audio event information.

In a second aspect, the present invention provides a large model-based intelligent interaction system comprising: the device comprises a first acquisition module, an alignment module, a second acquisition module, a third acquisition module and a processing module;

the first acquisition module is used for extracting an audio feature sequence of the voice information through the voice big model and transmitting the audio feature sequence to the voice-text alignment module;

the alignment module is used for acquiring a first text token feature sequence aligned with the audio feature sequence through the voice-text alignment module;

the second acquisition module is used for acquiring a second text token feature sequence corresponding to the input prompt word text through the tokenization module;

the third acquisition module is used for acquiring a third text token feature sequence based on the first text token feature sequence and the second text token feature sequence;

and the processing module is used for determining a processing strategy for the third text token feature sequence based on the prompt word type by the language big model.

In a third aspect, a computer device is provided, including a processor and a storage device, where the memory stores a program, and the processor implements the large model-based intelligent interaction method according to any one of the above solutions when executing the program.

In a fourth aspect, a computer readable storage medium is provided, in which a program is stored, the program when executed implementing the intelligent interaction method based on a large model according to any one of the above-mentioned methods.

The technical scheme provided by the invention has at least one or more of the following beneficial effects:

the technical scheme for implementing the invention comprises the following steps: extracting an audio feature sequence of voice information through a voice big model and transmitting the audio feature sequence to a voice-text alignment module; acquiring a first text token feature sequence aligned with the audio feature sequence through the voice-text alignment module; a second text token feature sequence corresponding to the input prompt word text is obtained through a tokenization module; acquiring a third text token feature sequence based on the first text token feature sequence and the second text token feature sequence; the language big model determines a processing strategy for the third text token feature sequence based on the prompt word type. The voice information is extracted through the voice big model, the semantics of a higher layer are reserved, and meanwhile, the voice-text alignment module is used for obtaining the first text token feature sequence aligned with the audio feature sequence, so that the finer voice characteristics of a lower layer can be obtained during the processing of the language big model, the intention of a user is better identified, and the experience of human-computer interaction of the user is improved.

Further, the scheme can skip the intermediate results such as the recognition result of ASR and the front-end conversion of TTS text, is not limited by the accuracy of the two modules any more, and can directly perform the end-to-end voice-to-voice interaction, thereby integrally improving the interaction effect.

Further, through different prompt texts (prompt words), the end-to-end multi-mode human-computer interaction system can simultaneously perform tasks such as single-mode text interaction, ASR Speech recognition, voiceprint recognition, emotion recognition and the like, so that the tasks of the NLP text and the Speech are unified under the framework of the system.

Drawings

The present disclosure will become more readily understood with reference to the accompanying drawings. As will be readily appreciated by those skilled in the art: the drawings are for illustrative purposes only and are not intended to limit the scope of the present invention. Moreover, like numerals in the figures are used to designate like parts, wherein:

FIG. 1 is a flow diagram of the main steps of a large model-based intelligent interaction method according to one embodiment of the present invention;

FIG. 2 is a schematic block diagram of the main structure of a large model-based intelligent interaction system according to one embodiment of the present invention.

Detailed Description

Some embodiments of the invention are described below with reference to the accompanying drawings. It should be understood by those skilled in the art that these embodiments are merely for explaining the technical principles of the present invention, and are not intended to limit the scope of the present invention.

In the description of the present invention, a "module," "processor" may include hardware, software, or a combination of both. A module may comprise hardware circuitry, various suitable sensors, communication ports, memory, or software components, such as program code, or a combination of software and hardware. The processor may be a central processor, a microprocessor, an image processor, a digital signal processor, or any other suitable processor. The processor has data and/or signal processing functions. The processor may be implemented in software, hardware, or a combination of both. Non-transitory computer readable storage media include any suitable medium that can store program code, such as magnetic disks, hard disks, optical disks, flash memory, read-only memory, random access memory, and the like. The term "a and/or B" means all possible combinations of a and B, such as a alone, B alone or a and B. The term "at least one A or B" or "at least one of A and B" has a meaning similar to "A and/or B" and may include A alone, B alone or A and B. The singular forms "a", "an" and "the" include plural referents.

Noun interpretation:

large speech model: refers to a deep learning model trained with a large amount of speech data, wherein the training method comprises at least one of: the method is supervised, self-supervised and semi-supervised; generally speaking, a large speech model is a deep learning model with a large parameter scale, and the purpose of understanding rich speech bottom information such as speech content, emotion, scene/event, rhythm, voiceprint and the like is achieved by training on a large amount of speech data. The parameter scale is large, for example: billions of parameters or more, etc.

Language big model: a deep learning model trained by a large amount of text data; wherein the training method comprises at least one of the following: has supervision, self-supervision and semi-supervision. Generally speaking, a language large model refers to a deep learning model with a larger parameter scale, such as: billions of parameters or more, etc.

ASR speech recognition: refers to automatic continuous speech recognition, recognizing text information contained in a piece of audio.

TTS speech synthesis: refers to a text-to-speech synthesis system, which can synthesize any piece of text content into complete playable audio.

Token tokenization: refers to encoding text into a token sequence id that is employed in training a large language model.

A speech-text alignment module: for completing an alignment mapping of an audio feature sequence to a text token feature sequence, for example: and finishing the alignment mapping from the high-dimensional audio feature sequence to the text token feature sequence at the voice frame level.

And a tokenization module: refers to encoding any piece of text into a sequence of text token features.

Text-to-speech alignment module: for implementing an alignment mapping of text token feature sequences to audio feature sequences.

Vocoder module: and outputting complete voice audio based on the audio feature sequence output by the text-to-voice alignment module.

FIG. 1 is a flow diagram of the main steps of a large model-based intelligent interaction method according to one embodiment of the present invention; as shown in fig. 1, the method mainly comprises the following steps S101-S105:

step S101, extracting an audio feature sequence of voice information through a voice big model and transmitting the audio feature sequence to a voice-text alignment module;

in this embodiment, the audio feature sequence of the voice information is extracted through the large voice model. Specifically: the input of the voice mode is original audio, and a high-dimensional characteristic sequence at a frame level is extracted through a voice big model to serve as an audio characteristic sequence, for example: T-D audio feature sequence, where "T" represents length and "D" represents dimension.

In this embodiment, the audio feature sequence carries audio content information, audio emotion information, audio prosody information, audio voiceprint information, audio scene information, and audio event information.

In this embodiment, the large speech model is a deep learning model trained by a large amount of speech data, where the training method includes at least one of the following: has supervision, self-supervision and semi-supervision.

Because the large voice model is trained by a large amount of voice data, the voice information is subjected to audio feature extraction by the large voice model to obtain higher-layer semantics (such as audio content information, audio emotion information, audio prosody information, audio voiceprint information, audio scene information and audio event information).

Step S102, obtaining a text token feature sequence aligned with the audio feature sequence through the voice-text alignment module;

in this embodiment, the voice-text alignment module is configured to complete alignment mapping from the audio feature sequence to the text token feature sequence. The voice-text alignment module is required to be independently set and trained and is used for receiving the audio feature sequence extracted from the voice large model and outputting the audio feature sequence as a corresponding text token feature sequence.

Specifically, based on the obtained t×d audio feature sequence, the text token feature sequence is converted into a t×1-dimensional text token feature sequence through alignment of a speech-text alignment module.

Because the voice big model performs audio feature extraction on the voice information to obtain higher-layer semantics, the aligned text token feature sequence also contains finer voice characteristics of a lower layer, so that the subsequent language big model can better recognize user intention, and the user man-machine interaction experience is improved.

Step S103, obtaining a second text token feature sequence corresponding to the input prompt word text through a tokenization module;

in this embodiment, the tokenizing module is configured to encode an arbitrary section of text into a token feature sequence.

Specifically, the text mode is input as a Prompt word, for example: { please identify text content in audio }, { please identify speaker in audio }, { please identify emotion in audio }, { please generate reply audio based on audio content }, { please directly reply text }, etc., after tokenization, a Length 1-dimensional text token feature sequence is obtained.

Step S104, a third text token feature sequence is obtained based on the first text token feature sequence and the second text token feature sequence;

in this embodiment, the first text token feature sequence and the second text token feature sequence are spliced to obtain a third text token feature sequence.

Specifically, a text token feature sequence in Length 1 dimension and a text token feature sequence in T1 dimension are spliced to obtain a text token feature sequence in length+t 1 dimension.

Step S105, the language big model determines a processing strategy for the third text token feature sequence based on the prompt word type.

In this embodiment, the language big model refers to a deep learning model trained by a large amount of text data; wherein the training method comprises at least one of the following: has supervision, self-supervision and semi-supervision.

In this embodiment, the language big model determines a prompt word type based on the prompt word text association information, where the prompt word type includes at least one of the following: the identification type and the audio output type.

Specifically, the language big model determines a processing strategy for a text token feature sequence of "(length+t) x 1 dimension based on a Prompt type (Prompt word type);

if the prompt word type is an identification type (for example { please identify emotion in audio }, { please directly reply text }), performing anti-serialization operation on the text token feature sequence of (length+T) ×1 dimension through the language big model, and outputting text reply content.

If the prompt word type is an audio output type (for example { please generate reply audio based on audio content }), processing a text token feature sequence of (length+T) ×1 dimension through the language big model, and then obtaining a fourth text token feature sequence;

acquiring an audio feature sequence aligned with the fourth text token feature sequence through a text-to-speech alignment module; the text-voice alignment module is used for realizing alignment mapping from the text token feature sequence to the audio feature sequence. For example, by a text-to-speech alignment module, a T1-dimensional audio feature sequence aligned with the fourth text token feature sequence is obtained.

The text-voice alignment module is required to be independently set and trained and is used for receiving a text token feature sequence output by the language big model and outputting the text token feature sequence as an audio feature sequence.

Through the alignment of the text and the voice, correct pronunciation and prosody sequence prediction can be completed directly based on priori knowledge of a language big model, and a large amount of manual maintenance of a front-end text regularization, pronunciation and prosody prediction model is not needed.

In this embodiment, the target voice information is output based on the aligned audio feature sequence.

Specifically, based on the aligned T1-dimensional audio feature sequence, target voice information is output through a vocoder.

In this embodiment, the first text token feature sequence, the second text token feature sequence, and the third text token feature sequence are all presented in the form of a token sequence.

In the field of natural language processing, machine learning models typically have Token as their input unit. Token may be understood as the smallest unit in text where Token may be a word, number, punctuation, single letter, or any single element that can be a text analysis.

By fusion alignment of the end-to-end voice big model and the language big model, information sharing among modes is realized, so that end-to-end output of voice-to-voice is completed, an intermediate result of ASR recognition is not needed in the interaction process, and cascade errors are reduced.

The invention also provides a main structure block diagram schematic diagram of the intelligent interaction system based on the large model of one embodiment. As shown in fig. 2, the intelligent interaction system based on the large model according to one embodiment of the present invention mainly includes: a first acquisition module 201, an alignment module 202, a second acquisition module 203, a third acquisition module 204, a processing module 205;

a first obtaining module 201, configured to extract an audio feature sequence of the voice information through the large voice model and transmit the audio feature sequence to a voice-text alignment module;

an alignment module 202, configured to obtain, by using the speech-text alignment module, a first text token feature sequence aligned with the audio feature sequence;

a second obtaining module 203, configured to obtain, by using a tokenizing module, a second text token feature sequence corresponding to the input prompt word text;

a third obtaining module 204, configured to obtain a third text token feature sequence based on the first text token feature sequence and the second text token feature sequence;

a processing module 205, configured to determine, by using the language big model, a processing policy for the third text token feature sequence based on the prompt word type.

In some embodiments, one or more of the first acquisition module 201, the alignment module 202, the second acquisition module 203, the third acquisition module 204, and the processing module 205 may be combined together into one module. In one embodiment, the specific implementation functions may be described with reference to the steps of the above examples.

The foregoing big model-based intelligent interaction system is used for executing the big model-based intelligent interaction method embodiment shown in fig. 2, and the technical principles of the two are similar, the technical problems to be solved and the technical effects to be produced are similar, and those skilled in the art can clearly understand that, for convenience and brevity of description, the specific working process and related description of the big model-based intelligent interaction system may refer to what is described in the big model-based intelligent interaction method embodiment, and will not be repeated herein.

It will be appreciated by those skilled in the art that the present invention may implement all or part of the above-described methods according to the above-described embodiments, or may be implemented by means of a computer program for instructing relevant hardware, where the computer program may be stored in a computer readable storage medium, and where the computer program may implement the steps of the above-described embodiments of the method when executed by a processor. Wherein the computer program comprises computer program code which may be in source code form, object code form, executable file or some intermediate form etc. The computer readable storage medium may include: any entity or device, medium, usb disk, removable hard disk, magnetic disk, optical disk, computer memory, read-only memory, random access memory, electrical carrier wave signals, telecommunications signals, software distribution media, and the like capable of carrying the computer program code. It should be noted that the computer readable storage medium may include content that is subject to appropriate increases and decreases as required by jurisdictions and by jurisdictions in which such computer readable storage medium does not include electrical carrier signals and telecommunications signals.

Further, the present invention also provides a computer apparatus, which includes a processor and a storage device, where the storage device may be configured to store a program for executing the large model-based intelligent interaction method of the above method embodiment, and the processor may be configured to execute the program in the storage device, where the program includes, but is not limited to, a program for executing the large model-based intelligent interaction method of the above method embodiment. For convenience of explanation, only those portions of the embodiments of the present invention that are relevant to the embodiments of the present invention are shown, and specific technical details are not disclosed, please refer to the method portions of the embodiments of the present invention. The computer device may be a computer device formed including various electronic devices.

Further, the invention also provides a computer readable storage medium. In one computer-readable storage medium embodiment according to the present invention, the computer-readable storage medium may be configured to store a program for performing the large model-based intelligent interaction method of the above-described method embodiment, which may be loaded and executed by a processor to implement the large model-based intelligent interaction method described above. For convenience of explanation, only those portions of the embodiments of the present invention that are relevant to the embodiments of the present invention are shown, and specific technical details are not disclosed, please refer to the method portions of the embodiments of the present invention. The computer readable storage medium may be a storage device including various electronic devices, and optionally, the computer readable storage medium in the embodiments of the present invention is a non-transitory computer readable storage medium.

Further, it should be understood that, since the respective modules are merely set for illustrating the functional units of the system of the present invention, the physical devices corresponding to the modules may be the processor itself, or a part of software in the processor, a part of hardware, or a part of a combination of software and hardware. Accordingly, the number of individual modules in the figures is merely illustrative.

Those skilled in the art will appreciate that the various modules in the system may be adaptively split or combined. Such splitting or combining of specific modules does not cause the technical solution to deviate from the principle of the present invention, and therefore, the technical solution after splitting or combining falls within the protection scope of the present invention.

Thus far, the technical solution of the present invention has been described in connection with the preferred embodiments shown in the drawings, but it is easily understood by those skilled in the art that the scope of protection of the present invention is not limited to these specific embodiments. Equivalent modifications and substitutions for related technical features may be made by those skilled in the art without departing from the principles of the present invention, and such modifications and substitutions will be within the scope of the present invention.

Claims

1. An intelligent interaction method based on a large model is characterized by comprising the following steps:

2. The method of claim 1, wherein the tokenizing module encodes any piece of text into a sequence of token characteristics.

3. The method of claim 1, wherein the speech-to-text alignment module is configured to perform an alignment mapping of the audio feature sequence to the text token feature sequence.

4. The method of claim 1, wherein the first text token feature sequence and the second text token feature sequence are spliced to obtain a third text token feature sequence.

5. The method of claim 1, wherein the hint word type includes at least one of: the identification type and the audio output type.

6. The method of claim 5, wherein if the prompt word type is an identification type, performing a deserialization operation on the third text token feature sequence through the language big model, and outputting text reply content.

7. The method of claim 5, wherein if the prompt word type is an audio output type, then processing the third text token feature sequence through the language big model to obtain a fourth text token feature sequence.

8. The method of claim 7, wherein the audio feature sequence aligned with the fourth text token feature sequence is obtained by a text-to-speech alignment module; the text-voice alignment module is used for realizing alignment mapping from the text token feature sequence to the audio feature sequence.

9. The method of claim 8, wherein target speech information is output based on the aligned audio feature sequence.

10. The method of claim 1, wherein the large speech model is a deep learning model trained from a large amount of speech data, wherein the training method comprises at least one of: the method is supervised, self-supervised and semi-supervised;

11. The method according to claim 1 or 10, characterized in that the frame-level high-dimensional feature sequence of the input speech is extracted as an audio feature sequence by a speech large model.

12. The method of claim 11, wherein the sequence of audio features carries audio content information, audio emotion information, audio prosody information, audio voiceprint information, audio scene information, audio event information.

13. An intelligent interactive system based on a large model, comprising: the device comprises a first acquisition module, an alignment module, a second acquisition module, a third acquisition module and a processing module;

14. A computer device comprising a processor and a memory, wherein the memory has a program stored therein, characterized in that the processor implements the method of any of claims 1 to 12 when executing the program.

15. A computer readable storage medium storing a program, characterized in that the program when executed implements the method of any one of claims 1 to 12.