CN110493613A

CN110493613A - A kind of synthetic method and system of video audio lip sync

Info

Publication number: CN110493613A
Application number: CN201910758080.XA
Authority: CN
Inventors: 郭志扬; 乔健; 吴鹏程; 陈起航; 朱西锋; 丁航; 陆佳莉
Original assignee: Jiangsu Aoxin Technology Co Ltd
Current assignee: Jiangsu Aoxin Technology Co Ltd
Priority date: 2019-08-16
Filing date: 2019-08-16
Publication date: 2019-11-22
Anticipated expiration: 2039-08-16
Also published as: CN110493613B

Abstract

The present invention discloses the synthetic method and system of a kind of video audio lip sync, belongs to the technical field of audio lip sync.Specifically includes the following steps: the cloud server receives pronunciation manuscript by terminal device, manuscript is split as several sentences according to punctuation mark；The cloud server carries out each sentence after fractionation to carry out permutation and combination according to different lip shapes, and lip shape is matched with prototype video: each prototype video of successful match is spliced, and forms synthetic video；Calculate the play time for the synthetic video to be formed；The pronunciation manuscript that cloud server will be received by terminal device, temporally sets word speed, it is ensured that pronunciation duration is equal to text and performs in a radio or TV programme duration.The lip shape when present invention is according to the word pronunciation to pronounce on manuscript, which combines, different lip shapes is arranged different codes, select the prototype video of corresponding lip shape, and synthesize, reach figure picture and guarantees that lip shape and sound because of consistent effect, increase authenticity while voice plays.

Description

A kind of synthetic method and system of video audio lip sync

Technical field

Originally the technical field for belonging to audio lip sync, more particularly to the synthetic method and system of a kind of video audio lip sync.

Background technique

In order to reinforce the communication with client and prospect and exchange, better product and technological service are provided for client, very More businessmans or mechanism are all specially provided with the customer service of oneself and technical service department, the staff of these departments exist daily after sale Workload under line or on line with Communication with Customer service is very big, carries out repetition, cumbersome problem answer and guidance, and not Daily 24 hours service users online or on duty of energy, virtual true man robot just comes into being.It is stored i.e. in display screen big True man's video of amount and answer voice, the enquirement for client provide corresponding feedback.

But because during answer, the lip shape of personage and answer voice are later period synthesis in video, therefore be will appear The lip shape of people in video and the nonsynchronous phenomenon of voice, the text and see that lip shape mismatches that client hears, be not achieved with very The effect of people's customer service face-to-face exchange, therefore client can be made to repel this service from heart.

Summary of the invention

The present invention is to solve technical problem present in above-mentioned background technique, provides a kind of video sound for giving the sense of reality Lip synchronous synthetic method and system.

The present invention is achieved through the following technical solutions: a kind of synthetic method of video audio lip sync, in cloud server There is the prototype video file for the various lip shapes that suitable virtual robot uses；

The synthetic method of the video audio lip sync specifically includes the following steps:

Step 1: the cloud server receives pronunciation manuscript by terminal device, splits manuscript according to punctuation mark For several sentences；

Step 2: the cloud server carries out each sentence after fractionation to carry out permutation and combination according to different lip shapes, by lip Shape is matched with prototype video:

Step 3: each prototype video of successful match being spliced, synthetic video is formed；

Step 4: calculating the play time of the synthetic video of step 3 formation；

Step 5: the pronunciation manuscript that cloud server will be received by terminal device sets word speed by the time of step 4, really It protects pronunciation duration and performs in a radio or TV programme duration equal to text, and manuscript is sent to voice gateways, voice gateways are by text conversion at sound text Part passes cloud server back；

Step 6: the sound that synthetic video and step 5 that step 3 generates generate being synthesized, final synthetic video is formed；

Step 7: the synthetic video that step 6 generates being played out by specified terminal, is logged off.

In a further embodiment, the step 2 specifically includes the following steps:

Step 2.1: each of sentence Chinese character is converted to phonetic, when not closing lip according to the vowel articulation of phonetic, consonant hair Lip shape parts a little when sound, and lip shape code is set as 1, and lip shape is opened greatly, and lip shape code is set as 2, when closing lip according to vowel articulation, when consonant articulation Lip shape parts a little, and lip shape code is set as 3, and lip shape is opened greatly, and lip shape code is set as 4, when stinging lip according to vowel articulation, lip shape when consonant articulation It parts a little, lip shape code is set as 5, and lip shape is opened greatly, and lip shape code is set as 6, it follows that a string of lip shape permutation codes of the sentence；

Step 2.2: found in prototype video library and obtain that lip shape permutation code is equivalent or similar prototype video, sentence last The lip shape code of a word must be equal；

Step 2.3: 3 are gone to step if finding；

Step 2.4: if there is no lip shape permutation code similar in lip shape in prototype video library, this lip shape permutation code being carried out limited Split, until after splitting every section all find that lip shape is equivalent or similar prototype video, the lip shape code of sentence the last character are necessary It is equal, and these prototype video-splicings are formed a complete sentence sub-video, go to step 3；

Step 2.5: if still can not find after carrying out limited fractionation, lip shape is equal or similar prototype video, reporting system are added The prototype video of the lip shape permutation code is supplemented, it fails to match, reports and logs off.

It is a kind of to use a kind of synthesis system of video audio lip sync as described above, robot terminal, for receiving client Enquirement voice, and send synthetic video；

Cloud server, for receiving the enquirement voice that the robot terminal is sent by internet, and according to the enquirement Voice feeds back corresponding synthetic video to the robot terminal by internet, and the robot terminal plays synthesis view Frequently；

In a further embodiment, the cloud server includes: processor, recoding unit, touch-display unit, communication unit Member and lip shape arrangement units, the processor respectively with the recoding unit, the touch-display unit, communication unit and lip shape Arrangement units connection；

The recoding unit is used to obtain the enquirement voice of client；The touch-display unit is for guest operation and plays view Frequently；The communication unit with the cloud server for carrying out data transmission；The lip shape arrangement units are for corresponding to every The permutation and combination of text difference lip shape, and assign each prototype video file different lip shape permutation and combination codes, the lip shape row Lip shape parts a little when column code includes: consonant articulation, and lip shape code is set as 1, and lip shape is opened greatly, and lip shape code is set as 2, is closed according to vowel articulation When lip, lip shape is parted a little when consonant articulation, and lip shape code is set as 3, and lip shape is opened greatly, and lip shape code is set as 4, stings lip according to vowel articulation When, lip shape parts a little when consonant articulation, and lip shape code is set as 5, and lip shape is opened greatly, and lip shape code is set as 6.

In a further embodiment, the cloud server includes:

Pushing module is received, for receiving the data of the robot terminal transmission and sending data to the robot terminal；

Voice conversion module puts question to text for the enquirement voice received on the cloud server by internet to be converted into Word simultaneously feeds back to the cloud server；The pronunciation manuscript meeting that will be received simultaneously by internet on the cloud server It is converted into answering voice, and the cloud service is fed back to by internet；

Matching module, for the enquirement text to be matched corresponding answer voice or answer from the exam pool in cloud server Video；

Memory module, for storing the enquirement voice of client, answering voice, pronunciation manuscript, synthetic video and keyword.

Beneficial effects of the present invention: lip shape when according to the word pronunciation to pronounce on manuscript, which combines, sets different lip shapes Different codes is set, the prototype video of corresponding lip shape is selected, and is synthesized, reaches figure picture and guarantees lip while voice plays Shape and sound increase authenticity because of consistent effect.

Detailed description of the invention

Fig. 1 is the flow diagram of the synthetic method of video audio lip sync.

Fig. 2 is the flow diagram of step 2 in Fig. 1.

Specific embodiment

In order to make the objectives, technical solutions, and advantages of the present invention clearer, with reference to the accompanying drawings and embodiments, right The present invention is further elaborated.It should be appreciated that the specific embodiments described herein are merely illustrative of the present invention, and It is not used in the restriction present invention.

Although the step in the present invention is arranged with label, it is not used to limit the precedence of step, unless Based on the execution of the order or certain step that specify step needs other steps, otherwise the relative rank of step is It is adjustable.It is appreciated that term "and/or" used herein be related to and cover in associated listed item one Person or one or more of any and all possible combinations.

It is obvious to a person skilled in the art that invention is not limited to the details of the above exemplary embodiments, Er Qie In the case where without departing substantially from spirit or essential attributes of the invention, the present invention can be realized in other specific forms.Therefore, no matter From the point of view of which point, the present embodiments are to be considered as illustrative and not restrictive, and the scope of the present invention is by appended power Benefit requires rather than above description limits, it is intended that all by what is fallen within the meaning and scope of the equivalent elements of the claims Variation is included within the present invention.Any reference signs in the claims should not be construed as limiting the involved claims.

Applicant existing service industry there are aiming at the problem that: in video the lip shape of personage and answer voice be the later period close At, therefore will appear the lip shape and the nonsynchronous phenomenon of voice of the people in video, the text and see lip shape not that client hears Matching, is not achieved the effect with true man's customer service face-to-face exchange, therefore client can be made to repel this service from heart.

Therefore in order to solve the above technical problems, applicant designs a kind of true man's online help machine service system, this can be allowed System improves the synthetic method and system of a kind of video audio lip sync of authenticity.

Exist first, has the prototype video file for the various lip shapes that suitable virtual robot uses in cloud server.

As shown in Figure 1, the synthetic method of the video audio lip sync specifically includes the following steps:

Step 4: calculating the play time of the synthetic video of step 3 formation；

As shown in Fig. 2, the step 2 specifically includes the following steps:

Step 2.3: 3 are gone to step if finding；

A kind of synthesis system of video audio lip sync, comprising: robot terminal, for receiving the enquirement voice of client, and Send synthetic video；

4. a kind of synthesis system of video audio lip sync according to claim 3, which is characterized in that the cloud server Include: processor, recoding unit, touch-display unit, communication unit and lip shape arrangement units, the processor respectively with it is described Recoding unit, the touch-display unit, communication unit are connected with lip shape arrangement units；

The cloud server includes: reception pushing module, for receive data that the robot terminal is sent and to The robot terminal sends data；Voice conversion module, for that will be received on the cloud server by internet It puts question to voice to be converted into puing question to text and feeds back to the cloud server；The cloud clothes will be received by internet simultaneously Pronunciation manuscript on business device can be converted into answering voice, and feed back to the cloud service by internet；Matching module is used In the enquirement text is matched corresponding answer voice or solution video from the exam pool in cloud server；Memory module, For storing the enquirement voice of client, answering voice, pronunciation manuscript, synthetic video and keyword.

Match synthetic video with voice is answered, in virtual robot demonstration, accomplishes to play simultaneously and answer voice and conjunction At video, audio and being consistent property of picture, the lip shape in the pronunciation and picture of audio reach forcing for height, increase client and see The comfort level seen.

In addition, it should be understood that although this specification is described in terms of embodiments, but not each embodiment is only wrapped Containing an independent technical solution, this description of the specification is merely for the sake of clarity, and those skilled in the art should It considers the specification as a whole, the technical solutions in the various embodiments may also be suitably combined, forms those skilled in the art The other embodiments being understood that.

Claims

1. a kind of synthetic method of video audio lip sync, which is characterized in that having suitable virtual robot in cloud server makes The prototype video file of various lip shapes；

Step 2: the cloud server carries out each sentence after fractionation to carry out permutation and combination according to different lip shapes, by lip Shape is matched with prototype video；

Step 4: calculating the play time of the synthetic video of step 3 formation；

2. a kind of synthetic method of video audio lip sync according to claim 1, which is characterized in that the step 2 is specific The following steps are included:

Step 2.3: 3 are gone to step if finding；

3. a kind of synthesis system using a kind of video audio lip sync as described in any one of claims 1 to 2, feature exist In, comprising: robot terminal for receiving the enquirement voice of client, and sends synthetic video；

A kind of synthesis system of video audio lip sync according to claim 3, which is characterized in that the cloud server packet Include: processor, recoding unit, touch-display unit, communication unit and lip shape arrangement units, the processor respectively with the record Sound unit, the touch-display unit, communication unit are connected with lip shape arrangement units；

4. a kind of synthesis system of video audio lip sync according to claim 4, which is characterized in that the cloud server Include: