CN114519997A - Processing method and device for video synthesis based on personalized voice - Google Patents

Processing method and device for video synthesis based on personalized voice Download PDF

Info

Publication number
CN114519997A
CN114519997A CN202210146223.3A CN202210146223A CN114519997A CN 114519997 A CN114519997 A CN 114519997A CN 202210146223 A CN202210146223 A CN 202210146223A CN 114519997 A CN114519997 A CN 114519997A
Authority
CN
China
Prior art keywords
data
processed
video
video object
personalized
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210146223.3A
Other languages
Chinese (zh)
Inventor
王心莹
姚广
朱彦
余意
杨杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hunan Happly Sunshine Interactive Entertainment Media Co Ltd
Original Assignee
Hunan Happly Sunshine Interactive Entertainment Media Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hunan Happly Sunshine Interactive Entertainment Media Co Ltd filed Critical Hunan Happly Sunshine Interactive Entertainment Media Co Ltd
Priority to CN202210146223.3A priority Critical patent/CN114519997A/en
Publication of CN114519997A publication Critical patent/CN114519997A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/10Office automation; Time management
    • G06Q10/103Workflow collaboration or project management
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/033Voice editing, e.g. manipulating the voice of the synthesiser
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/57Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for processing of video signals

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Business, Economics & Management (AREA)
  • Human Resources & Organizations (AREA)
  • Human Computer Interaction (AREA)
  • Health & Medical Sciences (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Computational Linguistics (AREA)
  • Strategic Management (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Signal Processing (AREA)
  • Data Mining & Analysis (AREA)
  • Economics (AREA)
  • Marketing (AREA)
  • Operations Research (AREA)
  • Quality & Reliability (AREA)
  • Tourism & Hospitality (AREA)
  • General Business, Economics & Management (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Two-Way Televisions, Distribution Of Moving Picture Or The Like (AREA)

Abstract

The application discloses a processing method and device for video synthesis based on personalized voice, wherein data to be processed and a video object in an authorized state are obtained, intelligent auditing is carried out on the data to be processed through an intelligent auditing technology to obtain an auditing result, if the auditing result is legal data to be processed, a sound ray of the video object is simulated through a preset simulation rule to obtain target voice, and the target voice and a video corresponding to the video object are synthesized to obtain a personalized video. By the scheme, personalized voice does not need to be intercepted and spliced word by word manually, the personalized voice is synthesized by simulating the sound ray of the video object by a deep learning method, and the video corresponding to the subsequent video object is synthesized, so that the effect, diversity and interestingness of personalized voice and video synthesis are improved. In addition, whether the content input by the user is in compliance is checked through an intelligent checking technology, and manual checking is not needed, so that the checking efficiency is improved.

Description

Processing method and device for video synthesis based on personalized voice
Technical Field
The present application relates to the field of audio and video processing technologies, and in particular, to a processing method and apparatus for video synthesis based on personalized speech.
Background
With the improvement of computing power, the collection of internet data and artificial intelligence have entered a new development stage, and the artificial intelligence is gradually changing into a human-computer interaction mode. An important part in the human-computer interaction process is that images such as real people are simulated to interact with users, and personalized voice and video synthesis can be realized by combining voice synthesis and voice conversion technologies.
In the prior art, if a character in a video speaks the text content specified by a user, an editor needs to find the same content as the pronunciation of each character in the text in all the words spoken by the character in the video and splice the words into a word.
However, the voice intercepted and spliced manually has the conditions of voice incoherence, hard speaking and the like, so that the effect of synthesizing the personalized voice and the video is poor and the interestingness is low.
Disclosure of Invention
In view of this, the present application discloses a processing method and apparatus for video synthesis based on personalized speech, which aims to improve the effect, diversity and interest of personalized speech and video synthesis. In addition, whether the content input by the user is in compliance is checked through an intelligent checking technology, and manual checking is not needed, so that the checking efficiency is improved.
In order to achieve the purpose, the technical scheme disclosed by the method is as follows:
the first aspect of the present application discloses a processing method for video synthesis based on personalized speech, the method comprising:
acquiring data to be processed and a video object in an authorized state; the data to be processed comprises personalized text data and/or personalized audio data;
performing intelligent audit on the data to be processed by an intelligent audit technology to obtain an audit result; the intelligent audit is used for detecting the legality of the data to be processed;
if the auditing result is legal data to be processed, simulating the sound ray of the video object by a preset simulation rule to obtain target voice; the target voice is personalized voice which is consistent with the sound ray of the video object and is generated according to the legal data to be processed;
and synthesizing the target voice and the video corresponding to the video object to obtain the personalized video.
Preferably, the acquiring the data to be processed and the video object in the authorized state includes:
and acquiring the data to be processed and the video object in the authorized state from a preset information source.
Preferably, the intelligently auditing the data to be processed through an intelligent auditing technology to obtain an auditing result includes:
Auditing the data content in the data to be processed by an intelligent auditing technology;
when the data content in the data to be processed is monitored to be consistent with the preset illegal data content, determining the data to be processed as illegal data to be processed;
and when the data content in the data to be processed is monitored to be inconsistent with the preset illegal data content, determining that the data to be processed is legal data to be processed.
Preferably, when the to-be-processed data includes personalized text data, if the audit result is legal to-be-processed data, simulating the sound ray of the video object by using a preset simulation rule to obtain a target voice, including:
if the auditing result is legal data to be processed, extracting sound characteristic information corresponding to the video object through a voiceprint recognition model, and embedding and coding the sound characteristic information into a vector with fixed dimensionality; the vector of the fixed dimension is used for representing the sound characteristic of the video object;
predicting the personalized text data and the vector of the fixed dimension through a preset cyclic sequence to a sequence feature prediction network to obtain a first Mel frequency spectrogram corresponding to the video object; the first Mel frequency spectrogram is obtained by the preset cyclic sequence to the sequence feature prediction network;
And converting the first Mel frequency spectrogram into a first time series sound oscillogram through a voice synthesizer, and obtaining target voice based on the first time series sound oscillogram.
Preferably, if the audit result is legal data to be processed, simulating the sound ray of the video object by using a preset simulation rule to obtain a target voice, including:
if the auditing result is legal data to be processed, acquiring a plurality of sound characteristic information corresponding to the video object;
synthesizing a plurality of sound characteristic information corresponding to the video object through a preset non-autoregressive speech synthesis model architecture to obtain a second Mel frequency spectrogram; the second Mel frequency spectrogram is a Mel frequency spectrogram obtained by the preset non-autoregressive speech synthesis model architecture;
and converting the second Mel frequency spectrogram into a second time series sound oscillogram through a voice synthesizer, and obtaining the target voice based on the second time series sound oscillogram.
Preferably, the method further comprises the following steps:
and if the data to be processed is illegal data to be processed, generating illegal prompt information.
Preferably, the synthesizing the target voice and the video corresponding to the video object to obtain a personalized video includes:
And synthesizing the target voice and the video corresponding to the video object according to a preset special effect to obtain a personalized video.
A second aspect of the present application discloses a processing apparatus for video synthesis based on personalized speech, the apparatus comprising:
the acquisition unit is used for acquiring data to be processed and a video object in an authorized state; the data to be processed comprises personalized text data and/or personalized audio data;
the auditing unit is used for intelligently auditing the data to be processed through an intelligent auditing technology to obtain an auditing result; the intelligent audit is used for detecting the legality of the data to be processed;
the simulation unit is used for simulating the sound ray of the video object through a preset simulation rule to obtain target voice if the auditing result is legal data to be processed; the target voice is personalized voice which is consistent with the sound ray of the video object and is generated according to the legal data to be processed;
and the synthesis unit is used for synthesizing the target voice and the video corresponding to the video object to obtain the personalized video.
Preferably, the obtaining unit is specifically configured to:
And acquiring the data to be processed and the video object in an authorized state from a preset information source.
Preferably, the auditing unit includes:
the auditing module is used for auditing the data content in the data to be processed by an intelligent auditing technology;
the first determining module is used for determining that the data to be processed is illegal data to be processed when the data content in the data to be processed is monitored to be consistent with the preset illegal data content;
and the second determining module is used for determining that the data to be processed is legal data to be processed when the data content in the data to be processed is monitored to be inconsistent with the preset illegal data content.
According to the technical scheme, the method and the device for processing the video synthesis based on the personalized voice are used for obtaining data to be processed and a video object in an authorized state, wherein the data to be processed comprises personalized text data and/or personalized audio data, the data to be processed is intelligently checked through an intelligent checking technology to obtain a checking result, the intelligent checking is used for detecting the legality of the data to be processed, if the checking result is legal data to be processed, sound rays of the video object are simulated through preset simulation rules to obtain target voice, the target voice is personalized voice which is generated according to the legal data to be processed and is consistent with sound rays of the video object, and the target voice and videos corresponding to the video object are synthesized to obtain personalized videos. By the scheme, personalized voice does not need to be intercepted and spliced word by word manually, the personalized voice is synthesized by simulating the sound ray of the video object by a deep learning method, and the video corresponding to the subsequent video object is synthesized, so that the effect, diversity and interestingness of personalized voice and video synthesis are improved. In addition, whether the content input by the user is in compliance is checked through an intelligent checking technology, and manual checking is not needed, so that the checking efficiency is improved.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, it is obvious that the drawings in the following description are only embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.
Fig. 1 is a schematic flowchart of a processing method for video synthesis based on personalized speech according to an embodiment of the present application;
fig. 2 is a schematic structural diagram of a processing apparatus for video synthesis based on personalized speech according to an embodiment of the present application;
fig. 3 is a schematic structural diagram of an electronic device according to an embodiment of the present disclosure.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
In this application, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in the process, method, article, or apparatus that comprises the element.
As can be seen from the background art, in the prior art, if a character in a video is made to speak the text content specified by a user, an editor needs to find out the contents with the same pronunciation as the characters in the text in all the words spoken by the character in the video and splice them into a word segment. However, the voice intercepted and spliced manually has the conditions of voice incoherence, hard speaking and the like, so that the effect of synthesizing the personalized voice and the video is poor and the interestingness is low.
In order to solve the above problems, an embodiment of the present application discloses a processing method and device for video synthesis based on personalized voices, which do not need to manually intercept and splice personalized voices word by word, only need to simulate sound rays of video objects by a deep learning method to synthesize personalized voices, and synthesize videos corresponding to subsequent video objects, thereby improving the effect, diversity and interestingness of personalized voice and video synthesis. In addition, whether the content input by the user is in compliance is checked through an intelligent checking technology, and manual checking is not needed, so that the checking efficiency is improved. The specific implementation is illustrated by the following examples.
Referring to fig. 1, a processing method and an apparatus for video synthesis based on personalized speech disclosed in an embodiment of the present application are mainly disclosed, and the processing method and the apparatus for video synthesis based on personalized speech mainly include the following steps:
s101: acquiring data to be processed and a video object in an authorized state; the data to be processed comprises personalized text data and/or personalized audio data.
In S101, the video object includes a star, a virtual character such as a cartoon, a cartoon character, and the like.
The user's decision to speak the video object may be personalized text data (e.g., text content entered by the user) and/or personalized audio data (e.g., audio data recorded by the user).
The user selects the video character in an authorized state or the virtual character such as animation cartoon and the like, wherein the video character is authorized by the user, and the virtual character such as the animation cartoon and the like is also copyrighted.
And obtaining the personal authorization of the video character, the copyright of the virtual character such as animation cartoon and the like, and obtaining the video object in the authorized state.
The specific video object is determined by a technician according to actual conditions, and the application is not particularly limited.
For example, a star agrees that we use his voice to synthesize part of the audio. In general, there may be a limitation in the range of use of the portion of audio, such as only being used to generate a blessing video, and so on.
The user may decide what the video object is to say. The words to be spoken by the video object may be obtained by means of a user entering personalized text data and/or by means of a user entering personalized audio data.
And acquiring the data to be processed and the video object in an authorized state from a preset information source.
The preset information source may be an Application (APP) or an activity page that analyzes links. For example, the predetermined information sources, i.e., the event portals of the event pages, may be the evening of the evening, a year-over-year live event, a star-related episode of art, an e-commerce platform, etc.
S102: performing intelligent audit on the data to be processed by an intelligent audit technology to obtain an audit result; the intelligent audit is used for detecting the validity of the data to be processed, if the audit result is legal data to be processed, S103 is executed, and if the audit result is illegal data to be processed, S105 is executed.
In S102, an intelligent auditing technique is used to intelligently audit the data to be processed (personalized text data or personalized audio data input by the user), and accurately identify legal data to be processed and illegal data to be processed (such as data of banning and irrigation).
The intelligent auditing technology may be a text content analysis algorithm, or other analysis algorithms, and the determination of the specific intelligent auditing technology is set by a technician according to actual conditions, which is not specifically limited in the present application. The intelligent auditing technology of the present application preferably employs a text content analysis algorithm.
And automatically auditing whether the audio or the text input by the user is in compliance or not by intelligent auditing technology. The content input by the user is detected, and the risk and brand damage caused by violation are effectively avoided. And whether the content input by the user is in compliance is checked through an intelligent checking technology, manual checking is not needed, and the efficiency is greatly improved.
The intelligent auditing technology supports the identification capability of variants such as pinyin, harmonious sound, character splitting, character shape and proximity, shadow and the like.
Specifically, the process of performing intelligent audit on the data to be processed by an intelligent audit technology to obtain an audit result is shown as A1-A3.
A1: and auditing the data content in the data to be processed by an intelligent auditing technology.
A2: and when the data content in the data to be processed is monitored to be consistent with the preset illegal data content, determining the data to be processed as illegal data to be processed.
The preset illegal data content comprises prestored contents such as banning, irrigation and the like.
The determination of the preset illegal data content is set by a technician according to actual conditions, and the application is not limited specifically.
For convenience of understanding, when it is monitored that the data content in the data to be processed is consistent with the preset illegal data content, the process of determining that the data to be processed is illegal data to be processed is described by way of example here:
for example, the preset illegal data content stores forbidden content in advance, and when it is monitored that the forbidden content related to the data content in the data to be processed is consistent with the forbidden content in the preset illegal data content, the data to be processed is determined to be illegal data to be processed.
A3: and when the data content in the data to be processed is not consistent with the preset illegal data content, determining that the data to be processed is legal data to be processed.
S103: simulating the sound ray of the video object by a preset simulation rule to obtain target voice; the target voice is personalized voice which is consistent with the sound ray of the video object and is generated according to legal data to be processed.
In S103, when the data to be processed includes the personalized text data and the auditing result is valid data to be processed, the voice same as the sound ray in the video object is generated by simulating the sound ray of the video object and according to the content of the personalized text data and/or the personalized audio data input by the user.
For example, the user selects a certain star as the video object, and the personalized text data input by the user is "happy new year, healthy body", the sound ray of the star is simulated, the voice of the star like the middle sound ray is generated, and the blessing video of "happy new year, healthy body" is spoken with the voice of the star like the sound ray.
Specifically, the process of simulating the sound ray of the video object through a preset simulation rule to obtain the target voice comprises a first mode and a second mode:
the specific process of the first mode is shown as B1-B3.
B1: if the auditing result is legal data to be processed, extracting sound characteristic information corresponding to the video object through a voiceprint recognition model, and embedding and coding the sound characteristic information into a vector with fixed dimensionality; the vector of the fixed dimension is used for representing the sound characteristic of the video object;
and establishing a voiceprint recognition model through a voice characteristic encoder to obtain the voice potential characteristics of the video object.
A voiceprint recognition model is a model that achieves the purpose of distinguishing unknown sounds by analyzing the characteristics of one or more speech signals. The theoretical basis for voiceprint recognition is that each sound has a unique characteristic by which it is possible to effectively distinguish between different human voices.
B2: predicting personalized text data and vectors of fixed dimensions through a preset cyclic sequence to a sequence feature prediction network to obtain a first Mel frequency spectrogram corresponding to a video object; the first Mel frequency spectrum graph is obtained by presetting a cyclic sequence to a sequence feature prediction network.
The network from the cyclic sequence to the sequence feature prediction is preset, namely the network from the cyclic sequence to the sequence feature prediction with attention, tacontron 2.
End-to-end speech synthesis is enabled based on the attention-directed cyclic sequence-to-sequence feature prediction network, tacontron 2.
And generating a first Mel frequency spectrogram corresponding to the video object by presetting a cyclic sequence to a sequence feature prediction network and using a character sequence (personalized text data of data to be processed) input by a user and a vector with a fixed dimension (sound feature information of the video object).
B3: and converting the first Mel frequency spectrogram into a first time series sound oscillogram through a voice synthesizer, and obtaining the target voice based on the first time series sound oscillogram.
The first Mel frequency spectrogram (spectral domain) is converted into a first time series sound wave form diagram (time domain) through a voice synthesizer, and the synthesis of the target voice is completed.
The specific process of the second mode is shown as C1-C3.
C1: and if the auditing result is legal data to be processed, acquiring a plurality of sound characteristic information corresponding to the video object.
For example, two hundred sound characteristic information corresponding to the video object may be trained in advance, so as to provide higher accuracy for the subsequently obtained target voice.
C2: synthesizing a plurality of sound characteristic information corresponding to the video object through a preset non-autoregressive speech synthesis model architecture to obtain a second Mel frequency spectrogram; the second mel frequency spectrum diagram is obtained by presetting a non-autoregressive speech synthesis model architecture.
The predetermined non-autoregressive speech synthesis model architecture is the non-autoregressive speech synthesis model architecture FastSpeech 2.
C3: and converting the second Mel frequency spectrogram into a second time series sound oscillogram through a voice synthesizer, and obtaining the target voice based on the second time series sound oscillogram.
And converting the second Mel frequency spectrogram (spectral domain) into a second time series sound waveform diagram (time domain) through a voice synthesizer to complete the synthesis of the target voice.
S104: and synthesizing the target voice and the video corresponding to the video object to obtain the personalized video.
In S104, synthesizing the target voice and the video corresponding to the video object according to the preset special effect, so as to obtain a personalized video.
The preset special effects include special effects such as a Chinese character, a sound effect and background music, the specific preset characteristics are determined by technical personnel according to actual conditions, and the method is not particularly limited in the application.
Combining the target voice with the video recorded by the video object, and adding special effects such as Chinese characters, sound effect, background music and the like to synthesize a complete, customized, smooth and real personalized video such as a blessing video and the like.
The application can be applied to live broadcasting in evening, star-related heddles, blessings sent on special festivals, fueling and drumming videos on special activities, and the like.
The Artificial Intelligence (AI) technology is used for supporting mass production of blessing videos in a short time and sending personalized and customized blessing videos to people in China.
The personalized video supports downloading, sharing, configuring into a red envelope cover and the like.
S105: and generating illegal prompt information.
In S105, an illegal prompt message is generated to remind the user that the verification result is illegal to-be-processed data, and if the verification result is illegal to-be-processed data, the personalized video cannot be synthesized.
In the embodiment of the application, personalized voice does not need to be intercepted manually and spliced word by word, the personalized voice is synthesized by simulating the sound ray of the video object by a deep learning method, and the video corresponding to the subsequent video object is synthesized, so that the effect, diversity and interestingness of personalized voice and video synthesis are improved. In addition, whether the content input by the user is in compliance is checked through an intelligent checking technology, and manual participation in checking is not needed, so that the checking efficiency is improved.
Based on the processing method for video synthesis based on personalized speech disclosed in fig. 1 in the foregoing embodiment, an embodiment of the present application further correspondingly discloses a processing device for video synthesis based on personalized speech, and as shown in fig. 2, the processing device for video synthesis based on personalized speech includes an obtaining unit 201, an auditing unit 202, a simulating unit 203, and a synthesizing unit 204.
An obtaining unit 201, configured to obtain data to be processed and a video object in an authorized state; the data to be processed comprises personalized text data and/or personalized audio data.
The auditing unit 202 is configured to perform intelligent auditing on the data to be processed through an intelligent auditing technology to obtain an auditing result; the intelligent audit is used for detecting the validity of the data to be processed.
The simulation unit 203 is configured to, if the audit result is valid data to be processed, simulate the sound ray of the video object according to a preset simulation rule to obtain a target voice; the target voice is personalized voice which is consistent with the sound ray of the video object and is generated according to legal data to be processed.
And a synthesizing unit 204, configured to synthesize the target voice and the video corresponding to the video object to obtain a personalized video.
Further, the obtaining unit 201 is specifically configured to obtain the data to be processed and the video object in the authorized state from a preset information source.
Further, the auditing unit 202 includes an auditing module, a first determining module, and a second determining module.
And the auditing module is used for auditing the data content in the data to be processed by an intelligent auditing technology.
The first determining module is used for determining that the data to be processed is illegal when the data content in the data to be processed is monitored to be consistent with the preset illegal data content.
And the second determining module is used for determining that the data to be processed is legal data to be processed when the data content in the data to be processed is monitored to be inconsistent with the preset illegal data content.
Further, the simulation unit 203 includes an extraction module, a prediction module, and a first conversion module.
The extraction module is used for extracting the sound characteristic information corresponding to the video object through the voiceprint recognition model and embedding and coding the sound characteristic information into a vector with fixed dimensionality if the auditing result is legal data to be processed; the fixed-dimension vectors are used to characterize the sound features of the video objects.
The prediction module is used for predicting personalized text data and vectors with fixed dimensions through a preset cyclic sequence to a sequence feature prediction network to obtain a first Mel frequency spectrogram corresponding to the video object; the first Mel frequency spectrum diagram is obtained by presetting a cyclic sequence to a sequence characteristic prediction network.
The first conversion module is used for converting the first Mel frequency spectrogram into a first time series sound oscillogram through the voice synthesizer, and obtaining the target voice based on the first time series sound oscillogram.
Further, the simulation unit 203 includes an obtaining module, a synthesizing module, and a second converting module.
And the acquisition module is used for acquiring a plurality of sound characteristic information corresponding to the video object if the checking result is legal data to be processed.
The synthesis module is used for synthesizing a plurality of sound characteristic information corresponding to the video object through a preset non-autoregressive speech synthesis model architecture to obtain a second Mel frequency spectrogram; the second mel frequency spectrum diagram is a mel frequency spectrum diagram obtained by presetting a non-autoregressive speech synthesis model architecture.
And the second conversion module is used for converting the second Mel frequency spectrogram into a second time series sound oscillogram through the voice synthesizer, and obtaining the target voice based on the second time series sound oscillogram.
Further, the processing device for video synthesis based on personalized speech also comprises a generating unit.
And the generating unit is used for generating illegal prompt information if the data to be processed is illegal data to be processed.
Further, the synthesis unit 204 is specifically configured to synthesize the target voice and the video corresponding to the video object according to a preset special effect, so as to obtain a personalized video.
In the embodiment of the application, personalized voice does not need to be intercepted manually and spliced word by word, the personalized voice is synthesized by simulating the sound ray of the video object by a deep learning method, and the video corresponding to the subsequent video object is synthesized, so that the effect, diversity and interestingness of personalized voice and video synthesis are improved. In addition, whether the content input by the user is in compliance is checked through an intelligent checking technology, and manual checking is not needed, so that the checking efficiency is improved.
The embodiment of the application also provides a storage medium, wherein the storage medium comprises a stored instruction, and when the instruction runs, the device where the storage medium is located is controlled to execute the processing method for video synthesis based on personalized voice.
An electronic device is provided in an embodiment of the present application, and a schematic structural diagram of the electronic device is shown in fig. 3, which specifically includes a memory 301 and one or more instructions 302, where the one or more instructions 302 are stored in the memory 301, and are configured to be executed by the one or more processors 303 to execute the one or more instructions 302 to perform the processing method for video synthesis based on personalized speech.
While, for purposes of simplicity of explanation, the foregoing method embodiments have been described as a series of acts or combination of acts, it will be appreciated by those skilled in the art that the present application is not limited by the order of acts or acts described, as some steps may occur in other orders or concurrently with other steps in accordance with the application. Further, those skilled in the art will recognize that the embodiments described in this specification are preferred embodiments and that acts or modules referred to are not necessarily required for this application.
It should be noted that, in the present specification, the embodiments are all described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments may be referred to each other. For the system-class embodiment, since it is basically similar to the method embodiment, the description is simple, and for the relevant points, reference may be made to the partial description of the method embodiment.
The steps in the method of each embodiment of the present application may be sequentially adjusted, combined, and deleted according to actual needs.
Finally, it should also be noted that, in this document, relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.
The foregoing is only a preferred embodiment of the present application and it should be noted that those skilled in the art can make several improvements and modifications without departing from the principle of the present application, and these improvements and modifications should also be considered as the protection scope of the present application.

Claims (10)

1. A processing method for video synthesis based on personalized speech, the method comprising:
acquiring data to be processed and a video object in an authorized state; the data to be processed comprises personalized text data and/or personalized audio data;
performing intelligent audit on the data to be processed by an intelligent audit technology to obtain an audit result; the intelligent audit is used for detecting the validity of the data to be processed;
if the auditing result is legal data to be processed, simulating the sound ray of the video object by a preset simulation rule to obtain target voice; the target voice is personalized voice which is consistent with the sound ray of the video object and is generated according to the legal data to be processed;
and synthesizing the target voice and the video corresponding to the video object to obtain the personalized video.
2. The method of claim 1, wherein the obtaining the pending data and the video object in the authorized state comprises:
and acquiring the data to be processed and the video object in the authorized state from a preset information source.
3. The method according to claim 1, wherein the performing an intelligent audit on the data to be processed by an intelligent audit technique to obtain an audit result includes:
Auditing the data content in the data to be processed by an intelligent auditing technology;
when the data content in the data to be processed is monitored to be consistent with the preset illegal data content, determining the data to be processed as illegal data to be processed;
and when the data content in the data to be processed is monitored to be inconsistent with the preset illegal data content, determining that the data to be processed is legal data to be processed.
4. The method according to claim 3, wherein when the to-be-processed data includes personalized text data, if the audit result is legal to-be-processed data, simulating a sound ray of the video object by using a preset simulation rule to obtain a target voice, including:
if the auditing result is legal data to be processed, extracting sound characteristic information corresponding to the video object through a voiceprint recognition model, and embedding and coding the sound characteristic information into a vector with fixed dimensionality; the vector of the fixed dimension is used for representing the sound characteristic of the video object;
predicting the personalized text data and the vector of the fixed dimension through a preset cyclic sequence to a sequence feature prediction network to obtain a first Mel frequency spectrogram corresponding to the video object; the first Mel frequency spectrum diagram is a Mel frequency spectrum diagram obtained from the preset cyclic sequence to the sequence feature prediction network;
And converting the first Mel frequency spectrogram into a first time series sound oscillogram through a voice synthesizer, and obtaining target voice based on the first time series sound oscillogram.
5. The method according to claim 3, wherein if the audit result is valid data to be processed, simulating a sound ray of the video object by a preset simulation rule to obtain a target voice, comprises:
if the auditing result is legal data to be processed, acquiring a plurality of sound characteristic information corresponding to the video object;
synthesizing a plurality of sound characteristic information corresponding to the video object through a preset non-autoregressive speech synthesis model architecture to obtain a second Mel frequency spectrogram; the second Mel frequency spectrogram is obtained through the preset non-autoregressive speech synthesis model architecture;
and converting the second Mel frequency spectrogram into a second time series sound oscillogram through a voice synthesizer, and obtaining the target voice based on the second time series sound oscillogram.
6. The method of claim 3, further comprising:
and if the data to be processed is illegal data to be processed, generating illegal prompt information.
7. The method of claim 1, wherein the synthesizing the target speech with the video corresponding to the video object to obtain a personalized video comprises:
and synthesizing the target voice and the video corresponding to the video object according to a preset special effect to obtain a personalized video.
8. A processing apparatus for video synthesis based on personalized speech, the apparatus comprising:
the acquisition unit is used for acquiring data to be processed and a video object in an authorized state; the data to be processed comprises personalized text data and/or personalized audio data;
the auditing unit is used for intelligently auditing the data to be processed through an intelligent auditing technology to obtain an auditing result; the intelligent audit is used for detecting the legality of the data to be processed;
the simulation unit is used for simulating the sound ray of the video object through a preset simulation rule to obtain target voice if the auditing result is legal data to be processed; the target voice is personalized voice which is consistent with the sound ray of the video object and is generated according to the legal data to be processed;
and the synthesis unit is used for synthesizing the target voice and the video corresponding to the video object to obtain the personalized video.
9. The apparatus according to claim 8, wherein the obtaining unit is specifically configured to:
and acquiring the data to be processed and the video object in an authorized state from a preset information source.
10. The apparatus according to claim 8, wherein the auditing unit includes:
the auditing module is used for auditing the data content in the data to be processed by an intelligent auditing technology;
the first determining module is used for determining that the data to be processed is illegal data to be processed when the data content in the data to be processed is monitored to be consistent with the preset illegal data content;
and the second determining module is used for determining that the data to be processed is legal data to be processed when the data content in the data to be processed is monitored to be inconsistent with the preset illegal data content.
CN202210146223.3A 2022-02-17 2022-02-17 Processing method and device for video synthesis based on personalized voice Pending CN114519997A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210146223.3A CN114519997A (en) 2022-02-17 2022-02-17 Processing method and device for video synthesis based on personalized voice

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210146223.3A CN114519997A (en) 2022-02-17 2022-02-17 Processing method and device for video synthesis based on personalized voice

Publications (1)

Publication Number Publication Date
CN114519997A true CN114519997A (en) 2022-05-20

Family

ID=81598188

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210146223.3A Pending CN114519997A (en) 2022-02-17 2022-02-17 Processing method and device for video synthesis based on personalized voice

Country Status (1)

Country Link
CN (1) CN114519997A (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103117057A (en) * 2012-12-27 2013-05-22 安徽科大讯飞信息科技股份有限公司 Application method of special human voice synthesis technique in mobile phone cartoon dubbing
CN110674255A (en) * 2019-09-24 2020-01-10 湖南快乐阳光互动娱乐传媒有限公司 Text content auditing method and device
CN111369969A (en) * 2020-02-20 2020-07-03 湖南芒果听见科技有限公司 Method and terminal for editing and broadcasting news information
CN111739507A (en) * 2020-05-07 2020-10-02 广东康云科技有限公司 AI-based speech synthesis method, system, device and storage medium
WO2020248393A1 (en) * 2019-06-14 2020-12-17 平安科技(深圳)有限公司 Speech synthesis method and system, terminal device, and readable storage medium
US20210390945A1 (en) * 2020-06-12 2021-12-16 Baidu Usa Llc Text-driven video synthesis with phonetic dictionary

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103117057A (en) * 2012-12-27 2013-05-22 安徽科大讯飞信息科技股份有限公司 Application method of special human voice synthesis technique in mobile phone cartoon dubbing
WO2020248393A1 (en) * 2019-06-14 2020-12-17 平安科技(深圳)有限公司 Speech synthesis method and system, terminal device, and readable storage medium
CN110674255A (en) * 2019-09-24 2020-01-10 湖南快乐阳光互动娱乐传媒有限公司 Text content auditing method and device
CN111369969A (en) * 2020-02-20 2020-07-03 湖南芒果听见科技有限公司 Method and terminal for editing and broadcasting news information
CN111739507A (en) * 2020-05-07 2020-10-02 广东康云科技有限公司 AI-based speech synthesis method, system, device and storage medium
US20210390945A1 (en) * 2020-06-12 2021-12-16 Baidu Usa Llc Text-driven video synthesis with phonetic dictionary

Similar Documents

Publication Publication Date Title
US11475897B2 (en) Method and apparatus for response using voice matching user category
CN111785261B (en) Cross-language voice conversion method and system based on entanglement and explanatory characterization
Borrelli et al. Synthetic speech detection through short-term and long-term prediction traces
WO2017190674A1 (en) Method and device for processing audio data, and computer storage medium
CN111193834B (en) Man-machine interaction method and device based on user sound characteristic analysis and electronic equipment
CN113205793B (en) Audio generation method and device, storage medium and electronic equipment
CN111916054A (en) Lip-based voice generation method, device and system and storage medium
Cuccovillo et al. Open challenges in synthetic speech detection
Baird et al. Perception of paralinguistic traits in synthesized voices
CN117373431A (en) Audio synthesis method, training method, device, equipment and storage medium
KR20190016889A (en) Method of text to speech and system of the same
CN114125506B (en) Voice auditing method and device
Reimao Synthetic speech detection using deep neural networks
CN114519997A (en) Processing method and device for video synthesis based on personalized voice
Zahariev et al. An approach to speech ambiguities eliminating using semantically-acoustical analysis
CN113241054B (en) Speech smoothing model generation method, speech smoothing method and device
CN112242134A (en) Speech synthesis method and device
CN114783408A (en) Audio data processing method and device, computer equipment and medium
CN114495896A (en) Voice playing method and computer equipment
CN115700871A (en) Model training and speech synthesis method, device, equipment and medium
Chowdhury et al. A review-based study on different Text-to-Speech technologies
Banset et al. Deep learning based voice conversion network
Nema Automatic passkey generator using speech biometric features
Van et al. Text-dependent Speaker Recognition System Based on Speaking Frequency Characteristics
CN117649846B (en) Speech recognition model generation method, speech recognition method, device and medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination