CN118280370A - Digital person generation method and system through voice real-time interaction - Google Patents
Digital person generation method and system through voice real-time interaction Download PDFInfo
- Publication number
- CN118280370A CN118280370A CN202410465020.XA CN202410465020A CN118280370A CN 118280370 A CN118280370 A CN 118280370A CN 202410465020 A CN202410465020 A CN 202410465020A CN 118280370 A CN118280370 A CN 118280370A
- Authority
- CN
- China
- Prior art keywords
- voice
- model
- real
- result
- audio
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 230000003993 interaction Effects 0.000 title claims abstract description 31
- 238000000034 method Methods 0.000 title claims abstract description 29
- 230000015572 biosynthetic process Effects 0.000 claims abstract description 24
- 238000003786 synthesis reaction Methods 0.000 claims abstract description 24
- 238000009877 rendering Methods 0.000 claims abstract description 23
- 230000002452 interceptive effect Effects 0.000 claims abstract description 10
- 238000005070 sampling Methods 0.000 claims description 7
- 239000012634 fragment Substances 0.000 claims description 6
- 238000005516 engineering process Methods 0.000 abstract description 7
- 238000013473 artificial intelligence Methods 0.000 description 4
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000006467 substitution reaction Methods 0.000 description 2
- 230000008485 antagonism Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 238000009792 diffusion process Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 230000001537 neural effect Effects 0.000 description 1
- 230000000717 retained effect Effects 0.000 description 1
- 230000001502 supplementing effect Effects 0.000 description 1
Landscapes
- Two-Way Televisions, Distribution Of Moving Picture Or The Like (AREA)
Abstract
The application provides a digital person generation method and a system for real-time interaction through voice. The technical scheme comprises the following steps: firstly, inputting real-time interactive audio; secondly, adopting a streaming voice recognition model to perform audio recognition; then, inputting the recognized characters into a large model, and outputting corresponding results by the large model in a streaming mode; then, the result is processed and then input into a voice synthesis model for semi-stream output; and finally, inputting the output voice into a corresponding rendering model to obtain a final corresponding interaction result. The technical scheme of the application utilizes the streaming technology and combines the corresponding text algorithm, thereby realizing smooth voice real-time interaction of digital people.
Description
Technical Field
The invention relates to the field of generation, in particular to a digital person generation method through voice real-time interaction.
Background
With the advent of Chat GPT, generated Artificial Intelligence (AIGC) has attracted a great deal of continuing attention. The generation type artificial intelligence is mainly used for generating related contents through the prior AI technology. Common include: text generation, speech generation, image generation, and video generation. AIGC cover many different problems and applications, one common application being digital people. Digital persons can be divided into two main categories of 2D and 3D digital persons according to different dimensions, and the invention relates to one category of 2D digital persons.
A 2D digital person is understood to be a digital person, mainly in the form of a 2D image presentation, which usually requires a piece of video, in which the persons are cloned by means of corresponding algorithmic techniques. With the development of AI technology, existing 2D digital man-made technologies have been rapidly developed, and current 2D digital man-made technologies can be mainly divided into the following three types: an antagonism network, a neural rendering field, and a diffusion model are generated. All three techniques can produce high quality 2D digital persons, but how to achieve real-time voice interactive digital persons remains a challenge. For the problem, the application provides a digital person generating method capable of realizing smooth voice real-time interaction by voice real-time interaction through a streaming technology and a corresponding text algorithm.
Disclosure of Invention
Aiming at the problems, the invention provides a digital person generating method for real-time interaction through voice, which realizes the digital person for real-time interaction through voice by using a streaming algorithm technology and combining a corresponding text processing algorithm.
In order to achieve the purpose, the technical solution of the present invention is a digital person generating method for real-time interaction through voice, comprising:
Acquiring input real-time interactive audio;
adopting a streaming voice recognition model to recognize the interactive audio to obtain recognized words;
Inputting the recognized characters into a large model, and outputting corresponding results in a streaming mode;
inputting the result into a voice synthesis model after processing, and outputting in a semi-stream mode;
And inputting the output voice into a corresponding rendering model to obtain a final corresponding interaction result.
Further, real-time interactive audio is input, wherein the input audio can be collected in real time through corresponding radio equipment, and meanwhile, the real-time collected audio can be processed in real time to convert sound into single-channel audio data with fixed sampling rate. The sampling rate may be understood as how many points of data are collected for 1s, and as an example, a typical sampling rate of 16000 would include 16000 data per 1s of audio. A fixed sampling rate means that the number of data points sampled per second is fixed.
Further, the streaming voice recognition model is adopted for audio recognition, wherein the voice recognition model is a trained depth model, and processed audio data can be recognized in real time.
Further, the streaming voice recognition means that voice can be input into the model in real time according to a very small time segment, the model can recognize according to the very small voice segment, and meanwhile, the model can record the voice data which are already input, so that the model can recognize the current audio data more accurately through the audio data which are input before. When the audio stream is finished, the model can identify the complete audio stream again, and the final identification result is calibrated once.
Further, the text content identified in the previous step is input into a corresponding large model, and the large model is subjected to streaming output. Streaming output means that a large model will stream back results in a prescribed number of words, typically 1 word, 2 words, etc. The large model of the step is a language model which is trained on a large amount of data and has a large amount of parameters. After inputting the corresponding text to the large model, the large model returns the corresponding answer.
Further, after the results of the foregoing steps are processed, the speech synthesis model is input, and semi-stream output is performed.
Further, the result in the foregoing step is that the large model returned content is obtained in a streaming manner, the large model streaming returned result is stored first, then judgment is made, when the stored result meets the requirement, the part of content is input into a corresponding speech synthesis model, the speech of the part of content is synthesized, and the speech is output to the next link.
Further, since the large model results are outputted in the streaming manner, the execution operation is continuously performed, and the stored results are continuously updated.
Further, if the storage result meets the requirement, judging from three aspects: the length of the currently stored text; whether punctuation marks exist in the stored content or not; whether the stored content covers the complete phrase.
Further, the output voice is input to a corresponding rendering model, and a final corresponding interaction result is obtained. The output voice is input to a corresponding rendering model in a streaming mode, and the rendering model outputs a final result in a streaming mode. The rendering model here is a model that can generate a corresponding video from speech, which is typically a depth model.
Based on the above-mentioned method for generating digital person through real-time interaction of voice, in order to better implement the present invention, a system for generating digital person through real-time interaction of voice is further provided, which is characterized in that it comprises: the system comprises a voice recognition module, a large model module, a voice synthesis module and a rendering module;
The voice recognition module is used for recording the input voice fragments, the length of the audio stream recorded by the model has the maximum limit, sentence division of the recorded audio stream can be carried out according to silence of the audio stream, and only the last sentence audio information after division is reserved;
The large model module is used for answering according to the input questions, the related configured knowledge and the like and outputting the result in a streaming mode;
The voice synthesis module firstly obtains the content of the last link in a streaming mode, stores the content, judges the next link when the latest obtained streaming result has punctuation marks meeting the requirements, inputs the part of content into a corresponding voice synthesis model when the stored result meets the requirements, synthesizes the voice of the part of content and outputs the voice to the next link;
the output voice is input to the corresponding rendering model in a streaming mode, and the rendering model outputs a final result in a streaming mode.
The beneficial effects are that:
In the audio recognition process, we use a streaming speech recognition model. Specifically, the speech is cut into extremely small pieces and input one by one into the speech recognition model. Meanwhile, the model memorizes and utilizes the voice fragment information input before, so that the recognition accuracy of the current voice fragment is improved. Since the result of the large model is outputted in a streaming manner, the above-mentioned execution operation is continuously performed and the stored recognition result is updated in real time. The whole recognition process is more efficient and accurate, and the method is suitable for various application scenes of real-time voice recognition.
Drawings
FIG. 1 is a flow chart of a method of digital person generation with real-time interaction by voice;
FIG. 2 is a block diagram of one embodiment of a digital person generation system interacting in real-time through speech;
Fig. 3 is a flow chart of a voice semi-streaming output provided by a digital person generation method of voice real-time interaction.
Detailed Description
Exemplary embodiments will be described in more detail below with reference to the accompanying drawings. The following examples will assist those skilled in the art in further understanding the present invention, but are not intended to limit the invention in any way. It should be noted that variations and modifications could be made by those skilled in the art without departing from the inventive concept. These are all within the scope of the present invention.
The embodiment of the application provides a digital person generating method through voice real-time interaction, and the method is exemplified by the embodiment and the attached drawings. In connection with fig. 1, the application proposes a digital person generation method by means of real-time interaction of speech, said method comprising the following steps:
Acquiring input real-time interactive audio;
adopting a streaming voice recognition model to recognize the interactive audio to obtain recognized words;
Inputting the recognized characters into a large model, and outputting corresponding results in a streaming mode;
inputting the result into a voice synthesis model after processing, and outputting in a semi-stream mode;
And inputting the output voice into a corresponding rendering model to obtain a final corresponding interaction result.
It is assumed that the audio stream is obtained in real time by the device, and is converted into single-channel data, and then x 1,x2…xi, where x i is denoted as the i-th sampling point data, and the single-channel audio stream is further converted into audio data y 1,y2 … with a fixed sampling rate by an interpolation algorithm.
For the processed audio stream y 1,y2 …, it is segmented in very small time segments, such as 80ms, when the collected audio stream is greater than 80ms, it is segmented into a plurality of audio segments of 80ms and one audio segment of less than 80ms, and then the last segment is also processed to 80ms by supplementing zero data points. For example, when the collected audio stream is 90ms only, it is divided into 80ms and 10ms, and then the audio of 10ms is added into an audio clip of 80ms by zero addition. And then inputting the speech recognition result into the trained depth model for speech recognition. The above steps are repeatedly performed until the current audio stream ends. Meanwhile, when the audio streams are input into the depth model, the model records the input audio streams, and the overall recognition effect of the current latest audio stream can be improved through the previous audio streams. When the audio stream is finished, the model carries out overall recognition on the current audio stream, and the recognition result is calibrated. It is particularly noted that the length of the audio stream recorded by the model has the greatest limitation, and sentence division of the recorded audio stream is generally performed according to silence of the audio stream, so that the previous audio contained in the recorded audio stream is deleted, and only the last sentence audio information after division is retained.
The recognized characters are input into a large model, the large model answers according to the input questions, the related configured knowledge and the like, and the result is output in a streaming mode. Most commonly, the output is 1 word long.
And inputting the result into a voice synthesis model, and outputting in a semi-stream mode. The step can firstly acquire the content of the last link in a streaming mode, store the content, judge the next link when the latest acquired streaming result has punctuation marks meeting the requirements, input the part of content into a corresponding voice synthesis model when the stored result meets the requirements, synthesize the voice of the part of content and output the voice to the next link.
The method specifically comprises the following steps:
If the storage result meets the requirement, judging from three aspects: first, the length of the currently stored text; the second point is to store whether punctuation marks exist in the content; third, store if the content covers the complete phrase.
If the storage length of the storage content meets the limit, inputting the storage content into a speech synthesis model;
If the length of the stored content does not meet the limit, dividing the stored content to ensure that the stored content covers the complete phrase, and inputting each subsection after division into a speech synthesis model in turn after the length limit is met.
In this step, the speech synthesis is not performed at word level, such as 1 word or 2 words, because the speech needs to maintain a certain consistency and naturalness before and after. Speech synthesis in1 word or 2 words affects overall speech front-to-back consistency.
And inputting the output voice into a corresponding rendering model to obtain a final result. Specifically, the output speech is input to the corresponding rendering model in a streaming manner, and the final result is output in a streaming manner. The rendering model is a model for generating a corresponding video according to the voice, and a model such as a depth model can be adopted.
The embodiment provides a digital person generating system that performs real-time interaction through voice, which is characterized by comprising: the system comprises a voice recognition module, a large model module, a voice synthesis module and a rendering module;
The voice recognition module is used for recording the input voice fragments, the length of the audio stream recorded by the model has the maximum limit, sentence division of the recorded audio stream can be carried out according to silence of the audio stream, and only the last sentence audio information after division is reserved;
The large model module is used for answering according to the input questions, the related configured knowledge and the like and outputting the result in a streaming mode;
The voice synthesis module firstly obtains the content of the last link in a streaming mode, stores the content, judges the next link when the latest obtained streaming result has punctuation marks meeting the requirements, inputs the part of content into a corresponding voice synthesis model when the stored result meets the requirements, synthesizes the voice of the part of content and outputs the voice to the next link;
the output voice is input to the corresponding rendering model in a streaming mode, and the rendering model outputs a final result in a streaming mode.
Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, those skilled in the art may modify the technical solutions described in the foregoing embodiments or make equivalent substitutions for some or all of the technical features thereof; such modifications and substitutions do not depart from the spirit of the invention, and are intended to be included within the scope of the appended claims and description.
Claims (10)
1. A digital person generation method for real-time interaction by voice, comprising:
Acquiring input real-time interactive audio;
adopting a streaming voice recognition model to recognize the interactive audio to obtain recognized words;
Inputting the recognized characters into a large model, and outputting corresponding results in a streaming mode;
inputting the result into a voice synthesis model after processing, and outputting in a semi-stream mode;
And inputting the output voice into a corresponding rendering model to obtain a final corresponding interaction result.
2. The digital person generation method of real-time interaction by voice according to claim 1, comprising: the real-time interactive audio is collected in real time through corresponding radio equipment;
And processing the collected audio in real time, converting the collected audio into single-channel audio data with fixed sampling rate.
3. The digital person generation method of real-time interaction by voice according to claim 1, comprising: the speech fragments described in the previous steps are recorded by using a streaming speech recognition model.
4. A digital person generation method according to claim 3, wherein the speech recognition model records previously inputted speech segments, specifically comprising: the length of the audio stream recorded by the model has the greatest limit, sentence division of the recorded audio stream is performed according to silence pause of the audio stream, and only the last sentence audio information after division is reserved.
5. The method for generating digital person through real-time interaction of voice according to claim 1, wherein the result is input into a voice synthesis model after being processed, and semi-stream output is performed, and the input result is stored and judged.
6. The method for generating digital person through real-time voice interaction according to claim 5, wherein the input result is judged from the following three aspects: the length of the currently stored text; whether punctuation marks exist in the stored content or not; whether the stored content covers the complete phrase.
7. The method for generating digital person through voice real-time interaction according to claim 6, wherein the specific judgment logic is: the streaming result is stored, and when the punctuation mark meeting the requirement exists in the latest stored result, the next step judgment is carried out.
8. The method for generating a digital person through real-time voice interaction according to claim 7, wherein the next step of judging comprises:
If the storage length of the storage content meets the limit, inputting the storage content into the voice synthesis model;
If the length of the storage content does not meet the limit, dividing the storage content to ensure that the storage content covers the complete phrase, and inputting each subsection after division into the voice synthesis model in sequence.
9. The method for generating digital person through real-time interaction of voice according to claim 1, wherein the inputting the output voice into the corresponding rendering model to obtain the final corresponding interaction result comprises: the rendering model is a model that can generate corresponding video from speech.
10. A digital person generation system that interacts in real time through speech, comprising: the system comprises a voice recognition module, a large model module, a voice synthesis module and a rendering module;
The voice recognition module is used for recording input voice fragments, the length of an audio stream recorded by the model has the maximum limit, sentence division of the recorded audio stream can be carried out according to silence of the audio stream, and only the last sentence audio information after division is reserved;
the large model module answers the large model according to the input questions, the related configured knowledge and the like, and outputs the result in a streaming mode;
The voice synthesis module firstly obtains the content of the previous link in a streaming mode, stores the content, judges the next link when the latest obtained streaming result has punctuation marks meeting the requirements, inputs the part of content into a corresponding voice synthesis model when the stored result meets the requirements, synthesizes the voice of the part of content and outputs the voice to the next link;
the output voice is input to the corresponding rendering model in a streaming mode, and the rendering model outputs a final result in a streaming mode.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202410465020.XA CN118280370A (en) | 2024-04-17 | 2024-04-17 | Digital person generation method and system through voice real-time interaction |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202410465020.XA CN118280370A (en) | 2024-04-17 | 2024-04-17 | Digital person generation method and system through voice real-time interaction |
Publications (1)
Publication Number | Publication Date |
---|---|
CN118280370A true CN118280370A (en) | 2024-07-02 |
Family
ID=91633485
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202410465020.XA Pending CN118280370A (en) | 2024-04-17 | 2024-04-17 | Digital person generation method and system through voice real-time interaction |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN118280370A (en) |
-
2024
- 2024-04-17 CN CN202410465020.XA patent/CN118280370A/en active Pending
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11908451B2 (en) | Text-based virtual object animation generation method, apparatus, storage medium, and terminal | |
CN111477216A (en) | Training method and system for pronunciation understanding model of conversation robot | |
CN112116903B (en) | Speech synthesis model generation method and device, storage medium and electronic equipment | |
CN115761075A (en) | Face image generation method, device, equipment, medium and product | |
CN115293132B (en) | Dialog of virtual scenes a treatment method device, electronic apparatus, and storage medium | |
CN112863489B (en) | Speech recognition method, apparatus, device and medium | |
WO2021169825A1 (en) | Speech synthesis method and apparatus, device and storage medium | |
CN113205793B (en) | Audio generation method and device, storage medium and electronic equipment | |
CN110808028B (en) | Embedded voice synthesis method and device, controller and medium | |
CN111883107A (en) | Speech synthesis and feature extraction model training method, device, medium and equipment | |
CN114842826A (en) | Training method of speech synthesis model, speech synthesis method and related equipment | |
CN116597857A (en) | Method, system, device and storage medium for driving image by voice | |
CN114360491B (en) | Speech synthesis method, device, electronic equipment and computer readable storage medium | |
CN115938352A (en) | Model obtaining method, mouth shape coefficient generating device, mouth shape coefficient generating equipment and mouth shape coefficient generating medium | |
CN113782042B (en) | Speech synthesis method, vocoder training method, device, equipment and medium | |
CN117727290A (en) | Speech synthesis method, device, equipment and readable storage medium | |
CN117710543A (en) | Digital person-based video generation and interaction method, device, storage medium, and program product | |
CN116582726B (en) | Video generation method, device, electronic equipment and storage medium | |
CN112785667A (en) | Video generation method, device, medium and electronic equipment | |
CN111415662A (en) | Method, apparatus, device and medium for generating video | |
CN118280370A (en) | Digital person generation method and system through voice real-time interaction | |
Kadam et al. | A Survey of Audio Synthesis and Lip-syncing for Synthetic Video Generation | |
CN113223513A (en) | Voice conversion method, device, equipment and storage medium | |
CN113505612B (en) | Multi-user dialogue voice real-time translation method, device, equipment and storage medium | |
US12112402B2 (en) | Method, electronic device, and computer program product for processing target object |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |