CN110493613B

CN110493613B - Video lip synchronization synthesis method and system

Info

Publication number: CN110493613B
Application number: CN201910758080.XA
Authority: CN
Inventors: 郭志扬; 乔健; 吴鹏程; 陈起航; 朱西锋; 丁航; 陆佳莉
Original assignee: Jiangsu Aoxin Technology Co Ltd
Current assignee: Jiangsu Aoxin Technology Co Ltd
Priority date: 2019-08-16
Filing date: 2019-08-16
Publication date: 2020-05-19
Anticipated expiration: 2039-08-16
Also published as: CN110493613A

Abstract

The invention discloses a method and a system for synthesizing video lip synchronization, and belongs to the technical field of lip synchronization. The method specifically comprises the following steps: the cloud server receives the pronunciation manuscript through the terminal equipment and splits the manuscript into a plurality of sentences according to punctuation marks; the cloud server performs permutation and combination on each split sentence according to different lips, and matches the lips with the prototype video: splicing each sentence of prototype video successfully matched to form a synthesized video; calculating the playing time of the formed composite video; the cloud server sets the speed of speech according to time through the pronunciation manuscript received by the terminal equipment, and ensures that the pronunciation time length is equal to the character playing time length. The invention sets different codes for different lip shapes according to lip shape combination when characters on the pronunciation manuscript are pronounced, selects the prototype video corresponding to the lip shapes and synthesizes, thereby achieving the effect of ensuring the lip shapes to be consistent with the sound factors while the character picture is played in voice and increasing the authenticity.

Description

Video lip synchronization synthesis method and system

Technical Field

The invention belongs to the technical field of sound lip synchronization, and particularly relates to a video sound lip synchronization synthesis method and system.

Background

In order to enhance communication and exchange with customers and quasi-customers and provide better products and technical services for the customers, a plurality of merchants or organizations are specially provided with own customer service and after-sales technical service departments, the workload of the workers of the departments for communicating with the customers on line or off line every day is large, repeated and complicated question answering and guidance are needed, the users cannot be served on line or on duty every 24 hours, and the virtual real-person robot can be transported as soon as possible. That is, a large amount of real person videos and answering voices are stored in the display screen, and corresponding feedback is given to questions of customers.

However, in the answering process, the lip shape of the person in the video and the answering voice are synthesized in the later period, so that the lip shape of the person in the video is not synchronous with the voice, characters heard by the client are not matched with the lip shape, and the effect of face-to-face communication with real person customer service is not achieved, so that the client can be mentally excluded from the service.

Disclosure of Invention

The present invention provides a method and a system for synthesizing video lip synchronization that provides a sense of reality to people in order to solve the technical problems in the background art.

The invention is realized by the following technical scheme: a video lip synchronization synthesis method is characterized in that various lip prototype video files suitable for a virtual robot are stored in a cloud server;

the video lip synchronization synthesis method specifically comprises the following steps:

step 1: the cloud server receives the pronunciation manuscript through the terminal equipment and splits the manuscript into a plurality of sentences according to punctuation marks;

step 2: the cloud server performs permutation and combination on each split sentence according to different lips, and matches the lips with the prototype video:

and step 3: splicing each sentence of prototype video successfully matched to form a synthesized video;

and 4, step 4: calculating the playing time of the synthesized video formed in the step 3;

and 5: the cloud server sets the speed of speech according to the time of step 4, the pronunciation duration is equal to the text playing duration, the text is sent to the voice gateway, and the voice gateway converts the text into a sound file and sends the sound file back to the cloud server;

step 6: synthesizing the synthesized video generated in the step 3 and the sound generated in the step 5 to form a final synthesized video;

and 7: and (4) playing the synthesized video generated in the step (6) through a specified terminal, and exiting the system.

In a further embodiment, the step 2 specifically includes the following steps:

step 2.1: converting each Chinese character in the sentence into pinyin, setting lip codes to be 1 when consonants pronounce without closing lips according to vowels of the pinyin, setting lip codes to be 2 when the consonants pronounce while closing lips, setting lip codes to be 3 when the consonants pronounce while closing lips, setting lip codes to be 4 when the lip codes are large, setting lip codes to be 5 when the consonants pronounce while closing lips according to vowels, and setting lip codes to be 6 when the lip codes are large, thereby obtaining a string of lip permutation codes of the sentence;

step 2.2: searching and acquiring a prototype video with identical or similar lip-shaped arrangement codes in a prototype video library, wherein the lip-shaped codes of the last word in a sentence are required to be identical;

step 2.3: if the finding is found, turning to the step 3;

step 2.4: if the lip shape arranged codes with close lip shapes do not exist in the prototype video library, the lip shape arranged codes are subjected to limited splitting until prototype videos with the same or close lip shapes are found in each segment after splitting, the lip shape codes of the last word of a sentence are required to be equal, the prototype videos are spliced into sentence videos, and the step 3 is carried out;

step 2.5: if the lip-shaped equivalent or similar prototype video cannot be found after the limited splitting, the report system adds the prototype video supplementing the lip-shaped arrangement code, the matching fails, and the report system exits.

A uses the above-mentioned a video lip synchronous synthetic system, the terminal station of the robot, is used for receiving the question voice of the customer, and send and formate the video;

the cloud server is used for receiving the question voice sent by the robot terminal through the Internet, feeding back a corresponding synthesized video to the robot terminal through the Internet according to the question voice, and playing the synthesized video by the robot terminal;

in a further embodiment, the cloud server comprises: the device comprises a processor, a recording unit, a touch display unit, a communication unit and a lip arrangement unit, wherein the processor is respectively connected with the recording unit, the touch display unit, the communication unit and the lip arrangement unit;

the recording unit is used for acquiring the question voice of the client; the touch display unit is used for customer operation and video playing; the communication unit is used for carrying out data transmission with the cloud server; the lip shape arrangement unit is used for corresponding to the arrangement combination of different lip shapes of each sentence of characters and endowing each prototype video file with different lip shape arrangement combination codes, and the lip shape arrangement codes comprise: when consonants are sounded, lip codes are set to 1 when lip is slightly opened, and are set to 2 when lip is greatly opened, when vowels are sounded, lip codes are set to 3 when lip is slightly opened, and are set to 4 when lip is greatly opened, and when vowels are sounded, lip codes are set to 5 when lip is slightly opened, and are set to 6 when lip is greatly opened.

In a further embodiment, the cloud server comprises:

the receiving and pushing module is used for receiving the data sent by the robot terminal and sending the data to the robot terminal;

the voice conversion module is used for converting the question voice received from the cloud server through the Internet into question words and feeding the question words back to the cloud server; meanwhile, the pronunciation manuscript received on the cloud server through the Internet is converted into answer voice and is fed back to the cloud server through the Internet;

the matching module is used for matching the questioning words with corresponding answer voice or answer video from a question bank in a cloud server;

and the storage module is used for storing questioning voice, answering voice, pronunciation manuscript, synthesized video and key words of the client.

The invention has the beneficial effects that: different codes are set for different lips according to lip combination during pronunciation of characters on the pronunciation manuscript, prototype videos corresponding to the lips are selected and synthesized, the effect that the lips are consistent with sound when people pictures are played in voice is achieved, and authenticity is improved.

Drawings

Fig. 1 is a flow chart of a video lip synchronization synthesizing method.

Fig. 2 is a block flow diagram of step 2 in fig. 1.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

Although the steps in the present invention are arranged by using reference numbers, the order of the steps is not limited, and the relative order of the steps can be adjusted unless the order of the steps is explicitly stated or other steps are required for the execution of a certain step. It is to be understood that the term "and/or" as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items.

It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential attributes thereof. The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference sign in a claim should not be construed as limiting the claim concerned.

The applicant aims at the problems existing in the existing service industry: the lip shape of the person in the video and the answering voice are synthesized in a later period, so that the lip shape of the person in the video is not synchronous with the voice, characters heard by a client are not matched with the lip shape, the effect of face-to-face communication with real person customer service is not achieved, and the client can be prevented from mentally rejecting the service.

Therefore, in order to solve the above technical problems, the applicant designs a real person online help machine service system, and a video lip synchronization synthesis method and system which can improve the reality of the system.

First, various lip-shaped prototype video files suitable for the virtual robot are stored in the cloud server.

As shown in fig. 1, the method for synthesizing video lip synchronization specifically includes the following steps:

As shown in fig. 2, the step 2 specifically includes the following steps:

step 2.3: if the finding is found, turning to the step 3;

A video lip-synchronized compositing system, comprising: the robot terminal is used for receiving the questioning voice of the client and sending the synthesized video;

4. the system according to claim 3, wherein the cloud server comprises: the device comprises a processor, a recording unit, a touch display unit, a communication unit and a lip arrangement unit, wherein the processor is respectively connected with the recording unit, the touch display unit, the communication unit and the lip arrangement unit;

The cloud server comprises: the receiving and pushing module is used for receiving the data sent by the robot terminal and sending the data to the robot terminal; the voice conversion module is used for converting the question voice received from the cloud server through the Internet into question words and feeding the question words back to the cloud server; meanwhile, the pronunciation manuscript received on the cloud server through the Internet is converted into answer voice and is fed back to the cloud server through the Internet; the matching module is used for matching the questioning words with corresponding answer voice or answer video from a question bank in a cloud server; and the storage module is used for storing questioning voice, answering voice, pronunciation manuscript, synthesized video and key words of the client.

The video is synthesized by matching the answer voice, so that the answer voice and the synthesized video can be played simultaneously when the virtual robot demonstrates, the consistency of the sound effect and the picture is kept, the pronunciation of the sound effect and the lip shape on the picture reach a high degree, and the watching comfort level of a client is improved.

Furthermore, it should be understood that although the present description refers to embodiments, not every embodiment may contain only a single embodiment, and such description is for clarity only, and those skilled in the art should integrate the description, and the embodiments may be combined as appropriate to form other embodiments understood by those skilled in the art.

Claims

1. A video lip synchronization synthesis method is characterized in that prototype video files of various lips suitable for a virtual robot are stored in a cloud server;

step 2: the cloud server carries out permutation and combination on each split sentence according to different lips, and matches the lips with the prototype video;

and 7: playing the synthesized video generated in the step 6 through a specified terminal, and exiting the system;

the step 2 specifically comprises the following steps:

step 2.3: if the finding is found, turning to the step 3;

2. A video lip-sync synthesizing system using a video lip-sync synthesizing method according to claim 1, comprising: the robot terminal is used for receiving the questioning voice of the client and sending the synthesized video;

the cloud server comprises: the device comprises a processor, a recording unit, a touch display unit, a communication unit and a lip arrangement unit, wherein the processor is respectively connected with the recording unit, the touch display unit, the communication unit and the lip arrangement unit;

3. The system of claim 2, wherein the cloud server comprises: