CN118280370A

CN118280370A - Digital person generation method and system through voice real-time interaction

Info

Publication number: CN118280370A
Application number: CN202410465020.XA
Authority: CN
Inventors: 虞钉钉; 贾皓文; 曹培; 刘江; 盘枫翔
Original assignee: Huayuan Computing Technology Shanghai Co ltd
Current assignee: Huayuan Computing Technology Shanghai Co ltd
Priority date: 2024-04-17
Filing date: 2024-04-17
Publication date: 2024-07-02

Abstract

The application provides a digital person generation method and a system for real-time interaction through voice. The technical scheme comprises the following steps: firstly, inputting real-time interactive audio; secondly, adopting a streaming voice recognition model to perform audio recognition; then, inputting the recognized characters into a large model, and outputting corresponding results by the large model in a streaming mode; then, the result is processed and then input into a voice synthesis model for semi-stream output; and finally, inputting the output voice into a corresponding rendering model to obtain a final corresponding interaction result. The technical scheme of the application utilizes the streaming technology and combines the corresponding text algorithm, thereby realizing smooth voice real-time interaction of digital people.

Description

Digital person generation method and system through voice real-time interaction

Technical Field

The invention relates to the field of generation, in particular to a digital person generation method through voice real-time interaction.

Background

With the advent of Chat GPT, generated Artificial Intelligence (AIGC) has attracted a great deal of continuing attention. The generation type artificial intelligence is mainly used for generating related contents through the prior AI technology. Common include: text generation, speech generation, image generation, and video generation. AIGC cover many different problems and applications, one common application being digital people. Digital persons can be divided into two main categories of 2D and 3D digital persons according to different dimensions, and the invention relates to one category of 2D digital persons.

A 2D digital person is understood to be a digital person, mainly in the form of a 2D image presentation, which usually requires a piece of video, in which the persons are cloned by means of corresponding algorithmic techniques. With the development of AI technology, existing 2D digital man-made technologies have been rapidly developed, and current 2D digital man-made technologies can be mainly divided into the following three types: an antagonism network, a neural rendering field, and a diffusion model are generated. All three techniques can produce high quality 2D digital persons, but how to achieve real-time voice interactive digital persons remains a challenge. For the problem, the application provides a digital person generating method capable of realizing smooth voice real-time interaction by voice real-time interaction through a streaming technology and a corresponding text algorithm.

Disclosure of Invention

Aiming at the problems, the invention provides a digital person generating method for real-time interaction through voice, which realizes the digital person for real-time interaction through voice by using a streaming algorithm technology and combining a corresponding text processing algorithm.

In order to achieve the purpose, the technical solution of the present invention is a digital person generating method for real-time interaction through voice, comprising:

Acquiring input real-time interactive audio;

adopting a streaming voice recognition model to recognize the interactive audio to obtain recognized words;

Inputting the recognized characters into a large model, and outputting corresponding results in a streaming mode;

inputting the result into a voice synthesis model after processing, and outputting in a semi-stream mode;

And inputting the output voice into a corresponding rendering model to obtain a final corresponding interaction result.

Further, real-time interactive audio is input, wherein the input audio can be collected in real time through corresponding radio equipment, and meanwhile, the real-time collected audio can be processed in real time to convert sound into single-channel audio data with fixed sampling rate. The sampling rate may be understood as how many points of data are collected for 1s, and as an example, a typical sampling rate of 16000 would include 16000 data per 1s of audio. A fixed sampling rate means that the number of data points sampled per second is fixed.

Further, the streaming voice recognition model is adopted for audio recognition, wherein the voice recognition model is a trained depth model, and processed audio data can be recognized in real time.

Further, the streaming voice recognition means that voice can be input into the model in real time according to a very small time segment, the model can recognize according to the very small voice segment, and meanwhile, the model can record the voice data which are already input, so that the model can recognize the current audio data more accurately through the audio data which are input before. When the audio stream is finished, the model can identify the complete audio stream again, and the final identification result is calibrated once.

Further, the text content identified in the previous step is input into a corresponding large model, and the large model is subjected to streaming output. Streaming output means that a large model will stream back results in a prescribed number of words, typically 1 word, 2 words, etc. The large model of the step is a language model which is trained on a large amount of data and has a large amount of parameters. After inputting the corresponding text to the large model, the large model returns the corresponding answer.

Further, after the results of the foregoing steps are processed, the speech synthesis model is input, and semi-stream output is performed.

Further, the result in the foregoing step is that the large model returned content is obtained in a streaming manner, the large model streaming returned result is stored first, then judgment is made, when the stored result meets the requirement, the part of content is input into a corresponding speech synthesis model, the speech of the part of content is synthesized, and the speech is output to the next link.

Further, since the large model results are outputted in the streaming manner, the execution operation is continuously performed, and the stored results are continuously updated.

Further, if the storage result meets the requirement, judging from three aspects: the length of the currently stored text; whether punctuation marks exist in the stored content or not; whether the stored content covers the complete phrase.

Further, the output voice is input to a corresponding rendering model, and a final corresponding interaction result is obtained. The output voice is input to a corresponding rendering model in a streaming mode, and the rendering model outputs a final result in a streaming mode. The rendering model here is a model that can generate a corresponding video from speech, which is typically a depth model.

Based on the above-mentioned method for generating digital person through real-time interaction of voice, in order to better implement the present invention, a system for generating digital person through real-time interaction of voice is further provided, which is characterized in that it comprises: the system comprises a voice recognition module, a large model module, a voice synthesis module and a rendering module;

The voice recognition module is used for recording the input voice fragments, the length of the audio stream recorded by the model has the maximum limit, sentence division of the recorded audio stream can be carried out according to silence of the audio stream, and only the last sentence audio information after division is reserved;

The large model module is used for answering according to the input questions, the related configured knowledge and the like and outputting the result in a streaming mode;

The voice synthesis module firstly obtains the content of the last link in a streaming mode, stores the content, judges the next link when the latest obtained streaming result has punctuation marks meeting the requirements, inputs the part of content into a corresponding voice synthesis model when the stored result meets the requirements, synthesizes the voice of the part of content and outputs the voice to the next link;

the output voice is input to the corresponding rendering model in a streaming mode, and the rendering model outputs a final result in a streaming mode.

The beneficial effects are that:

In the audio recognition process, we use a streaming speech recognition model. Specifically, the speech is cut into extremely small pieces and input one by one into the speech recognition model. Meanwhile, the model memorizes and utilizes the voice fragment information input before, so that the recognition accuracy of the current voice fragment is improved. Since the result of the large model is outputted in a streaming manner, the above-mentioned execution operation is continuously performed and the stored recognition result is updated in real time. The whole recognition process is more efficient and accurate, and the method is suitable for various application scenes of real-time voice recognition.

Drawings

FIG. 1 is a flow chart of a method of digital person generation with real-time interaction by voice;

FIG. 2 is a block diagram of one embodiment of a digital person generation system interacting in real-time through speech;

Fig. 3 is a flow chart of a voice semi-streaming output provided by a digital person generation method of voice real-time interaction.

Detailed Description

Exemplary embodiments will be described in more detail below with reference to the accompanying drawings. The following examples will assist those skilled in the art in further understanding the present invention, but are not intended to limit the invention in any way. It should be noted that variations and modifications could be made by those skilled in the art without departing from the inventive concept. These are all within the scope of the present invention.

The embodiment of the application provides a digital person generating method through voice real-time interaction, and the method is exemplified by the embodiment and the attached drawings. In connection with fig. 1, the application proposes a digital person generation method by means of real-time interaction of speech, said method comprising the following steps:

Acquiring input real-time interactive audio;

It is assumed that the audio stream is obtained in real time by the device, and is converted into single-channel data, and then x ₁,x₂…x_i, where x _i is denoted as the i-th sampling point data, and the single-channel audio stream is further converted into audio data y ₁,y₂ … with a fixed sampling rate by an interpolation algorithm.

For the processed audio stream y ₁,y₂ …, it is segmented in very small time segments, such as 80ms, when the collected audio stream is greater than 80ms, it is segmented into a plurality of audio segments of 80ms and one audio segment of less than 80ms, and then the last segment is also processed to 80ms by supplementing zero data points. For example, when the collected audio stream is 90ms only, it is divided into 80ms and 10ms, and then the audio of 10ms is added into an audio clip of 80ms by zero addition. And then inputting the speech recognition result into the trained depth model for speech recognition. The above steps are repeatedly performed until the current audio stream ends. Meanwhile, when the audio streams are input into the depth model, the model records the input audio streams, and the overall recognition effect of the current latest audio stream can be improved through the previous audio streams. When the audio stream is finished, the model carries out overall recognition on the current audio stream, and the recognition result is calibrated. It is particularly noted that the length of the audio stream recorded by the model has the greatest limitation, and sentence division of the recorded audio stream is generally performed according to silence of the audio stream, so that the previous audio contained in the recorded audio stream is deleted, and only the last sentence audio information after division is retained.

The recognized characters are input into a large model, the large model answers according to the input questions, the related configured knowledge and the like, and the result is output in a streaming mode. Most commonly, the output is 1 word long.

And inputting the result into a voice synthesis model, and outputting in a semi-stream mode. The step can firstly acquire the content of the last link in a streaming mode, store the content, judge the next link when the latest acquired streaming result has punctuation marks meeting the requirements, input the part of content into a corresponding voice synthesis model when the stored result meets the requirements, synthesize the voice of the part of content and output the voice to the next link.

The method specifically comprises the following steps:

If the storage result meets the requirement, judging from three aspects: first, the length of the currently stored text; the second point is to store whether punctuation marks exist in the content; third, store if the content covers the complete phrase.

If the storage length of the storage content meets the limit, inputting the storage content into a speech synthesis model;

If the length of the stored content does not meet the limit, dividing the stored content to ensure that the stored content covers the complete phrase, and inputting each subsection after division into a speech synthesis model in turn after the length limit is met.

In this step, the speech synthesis is not performed at word level, such as 1 word or 2 words, because the speech needs to maintain a certain consistency and naturalness before and after. Speech synthesis in1 word or 2 words affects overall speech front-to-back consistency.

And inputting the output voice into a corresponding rendering model to obtain a final result. Specifically, the output speech is input to the corresponding rendering model in a streaming manner, and the final result is output in a streaming manner. The rendering model is a model for generating a corresponding video according to the voice, and a model such as a depth model can be adopted.

The embodiment provides a digital person generating system that performs real-time interaction through voice, which is characterized by comprising: the system comprises a voice recognition module, a large model module, a voice synthesis module and a rendering module;

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, those skilled in the art may modify the technical solutions described in the foregoing embodiments or make equivalent substitutions for some or all of the technical features thereof; such modifications and substitutions do not depart from the spirit of the invention, and are intended to be included within the scope of the appended claims and description.

Claims

1. A digital person generation method for real-time interaction by voice, comprising:

Acquiring input real-time interactive audio;

2. The digital person generation method of real-time interaction by voice according to claim 1, comprising: the real-time interactive audio is collected in real time through corresponding radio equipment;

And processing the collected audio in real time, converting the collected audio into single-channel audio data with fixed sampling rate.

3. The digital person generation method of real-time interaction by voice according to claim 1, comprising: the speech fragments described in the previous steps are recorded by using a streaming speech recognition model.

4. A digital person generation method according to claim 3, wherein the speech recognition model records previously inputted speech segments, specifically comprising: the length of the audio stream recorded by the model has the greatest limit, sentence division of the recorded audio stream is performed according to silence pause of the audio stream, and only the last sentence audio information after division is reserved.

5. The method for generating digital person through real-time interaction of voice according to claim 1, wherein the result is input into a voice synthesis model after being processed, and semi-stream output is performed, and the input result is stored and judged.

6. The method for generating digital person through real-time voice interaction according to claim 5, wherein the input result is judged from the following three aspects: the length of the currently stored text; whether punctuation marks exist in the stored content or not; whether the stored content covers the complete phrase.

7. The method for generating digital person through voice real-time interaction according to claim 6, wherein the specific judgment logic is: the streaming result is stored, and when the punctuation mark meeting the requirement exists in the latest stored result, the next step judgment is carried out.

8. The method for generating a digital person through real-time voice interaction according to claim 7, wherein the next step of judging comprises:

If the storage length of the storage content meets the limit, inputting the storage content into the voice synthesis model;

If the length of the storage content does not meet the limit, dividing the storage content to ensure that the storage content covers the complete phrase, and inputting each subsection after division into the voice synthesis model in sequence.

9. The method for generating digital person through real-time interaction of voice according to claim 1, wherein the inputting the output voice into the corresponding rendering model to obtain the final corresponding interaction result comprises: the rendering model is a model that can generate corresponding video from speech.

10. A digital person generation system that interacts in real time through speech, comprising: the system comprises a voice recognition module, a large model module, a voice synthesis module and a rendering module;

The voice recognition module is used for recording input voice fragments, the length of an audio stream recorded by the model has the maximum limit, sentence division of the recorded audio stream can be carried out according to silence of the audio stream, and only the last sentence audio information after division is reserved;

the large model module answers the large model according to the input questions, the related configured knowledge and the like, and outputs the result in a streaming mode;

The voice synthesis module firstly obtains the content of the previous link in a streaming mode, stores the content, judges the next link when the latest obtained streaming result has punctuation marks meeting the requirements, inputs the part of content into a corresponding voice synthesis model when the stored result meets the requirements, synthesizes the voice of the part of content and outputs the voice to the next link;