CN109587543B

CN109587543B - Audio synchronization method and apparatus and storage medium

Info

Publication number: CN109587543B
Application number: CN201811616135.5A
Authority: CN
Inventors: 唐大闰; 徐浩; 吴明辉
Original assignee: Miaozhen Information Technology Co Ltd
Current assignee: Miaozhen Information Technology Co Ltd
Priority date: 2018-12-27
Filing date: 2018-12-27
Publication date: 2021-04-02
Anticipated expiration: 2038-12-27
Also published as: CN109587543A

Abstract

The invention discloses an audio synchronization method and device and a storage medium. Wherein, the method comprises the following steps: acquiring a dubbing file recorded for dubbing a target video file; dividing dubbing audio in the dubbing file into a plurality of dubbing fragments according to unit intervals; dividing a plurality of dubbing fragments into a plurality of dubbing sets according to a target audio extracted from a target video file; sequentially comparing the dubbing playing time length used by each dubbing set with a target playing time length, wherein the target playing time length is the audio playing time length used by a target audio clip corresponding to the dubbing set in the target audio; and adjusting dubbing segments in the dubbing set according to the comparison result so as to synchronously play the dubbing audio and the target audio. The invention solves the technical problem that the audio synchronization method provided by the related technology has higher operation complexity.

Description

Audio synchronization method and apparatus and storage medium

Technical Field

The present invention relates to the field of computers, and in particular, to an audio synchronization method and apparatus, and a storage medium.

Background

Today, more and more movie works start to employ late dubbing in order to avoid bringing noise picked up from the scene during shooting into the video. That is, a special dubbing recording is made by a dubbing actor for a character in a movie work.

However, the dubbing actors often need to make multiple attempts to maintain a speaking rhythm consistent with the character in the movie or television show during dubbing. That is, currently, the dubbing actor controls the dubbing rhythm thereof to synchronize the dubbing audio with the character audio in the movie and television play, which has the problem of high operation complexity.

In view of the above problems, no effective solution has been proposed.

Disclosure of Invention

The embodiment of the invention provides an audio synchronization method, an audio synchronization device and a storage medium, which are used for at least solving the technical problem that the audio synchronization method provided by the related technology has higher operation complexity.

According to an aspect of an embodiment of the present invention, there is provided an audio synchronization method, including: acquiring a dubbing file recorded for dubbing a target video file; dividing dubbing audio in the dubbing file into a plurality of dubbing fragments according to unit intervals; dividing the plurality of dubbing fragments into a plurality of dubbing sets according to the target audio extracted from the target video file; sequentially comparing the dubbing playing time length used by each dubbing set with a target playing time length, wherein the target playing time length is the audio playing time length used by a target audio clip corresponding to the dubbing set in the target audio; and adjusting dubbing segments in the dubbing set according to the comparison result so as to synchronously play the dubbing audio and the target audio.

As an optional example, the dividing the plurality of dubbing segments into a plurality of dubbing sets according to the target audio extracted from the target video file includes: acquiring a first text set obtained by performing text conversion on the target audio, wherein each text segment contained in the first text set is used for indicating text information corresponding to each object audio segment obtained by dividing the target audio according to a target time interval; repeatedly executing the following steps until all text segments in the first text set are traversed: acquiring a current text fragment from the first text set; acquiring a target dubbing fragment corresponding to the current text fragment from the plurality of dubbing fragments; dividing the target dubbing fragment into a dubbing set; and acquiring the next text segment as the current text segment.

As an optional example, the obtaining a first text set obtained by text conversion of the target audio includes: extracting the target audio from the target video file; dividing the target audio into a plurality of object audio segments according to the target time interval; and performing text conversion on the plurality of object audio clips to obtain the first text set, wherein the key audio clips are marked as the same audio clip in the object audio clips corresponding to the key audio clips when the playing duration of the key audio clips included in the object audio clips reaches the next target time interval.

As an optional example, the adjusting the dubbing segments in the dubbing set according to the comparison result includes: stretching the dubbing playing time length used by the dubbing set to the target playing time length under the condition that the comparison result indicates that the dubbing playing time length used by the dubbing set is smaller than the target playing time length; and compressing the dubbing playing time length used by the dubbing set to the target playing time length under the condition that the comparison result indicates that the dubbing playing time length used by the dubbing set is greater than the target playing time length.

As an optional example, after the adjusting the dubbing segments in the dubbing set according to the result of the comparison, the method further includes: in the case that the dubbing playing time length used by the dubbing collection is extended to the target playing time length, the frequency of the dubbing segments in the dubbing collection is reduced; and under the condition that the dubbing playing time length used by the dubbing collection is compressed to the target playing time length, increasing the frequency of the dubbing segments in the dubbing collection.

As an alternative example, the dividing the dubbing audio in the dubbing file into a plurality of dubbing segments according to the unit interval includes: carrying out text conversion on the dubbing audio to obtain a dubbing text; dividing the dubbing audio into the plurality of dubbing segments by taking the characters in the dubbing text as units; or, the dubbing audio is divided into the plurality of dubbing fragments by using the phoneme of the character in the dubbing text as a unit.

According to another aspect of the embodiments of the present invention, there is also provided an audio synchronization apparatus, including: an acquisition unit, configured to acquire a dubbing file recorded for dubbing a target video file; a first dividing unit, configured to divide the dubbing audio in the dubbing file into a plurality of dubbing segments according to unit intervals; a second dividing unit, configured to divide the plurality of dubbing segments into a plurality of dubbing sets according to a target audio extracted from the target video file; a comparison unit, configured to compare a dubbing playing duration used by each dubbing set with a target playing duration in sequence, where the target playing duration is an audio playing duration used by a target audio clip corresponding to the dubbing set in the target audio; and the adjusting synchronization unit is used for adjusting the dubbing fragments in the dubbing set according to the comparison result so as to enable the dubbing audio and the target audio to be played synchronously.

As an optional example, the second dividing unit includes: an obtaining module, configured to obtain a first text set obtained by performing text conversion on the target audio, where each text segment included in the first text set is used to indicate text information corresponding to each object audio segment obtained by dividing the target audio according to a target time interval; the processing module is used for repeatedly executing the following steps until all text segments in the first text set are traversed: acquiring a current text fragment from the first text set; acquiring a target dubbing fragment corresponding to the current text fragment from the plurality of dubbing fragments; dividing the target dubbing fragment into a dubbing set; and acquiring the next text segment as the current text segment.

As an optional example, the obtaining module includes: the extraction submodule is used for extracting the target audio from the target video file; a dividing submodule, configured to divide the target audio into a plurality of object audio segments according to the target time interval; and a conversion sub-module, configured to perform text conversion on the multiple object audio clips to obtain the first text set, where, when a playing duration of a key audio clip included in the object audio clip reaches a next target time interval, the key audio clip is marked as a same audio clip in the object audio clip corresponding to the key audio clip.

As an optional example, the adjusting synchronization unit includes: a first adjusting module, configured to stretch the dubbing playing time duration used by the dubbing set to the target playing time duration when the comparison result indicates that the dubbing playing time duration used by the dubbing set is smaller than the target playing time duration; and a second adjusting module, configured to compress the dubbing playing time duration used by the dubbing set to the target playing time duration when the comparison result indicates that the dubbing playing time duration used by the dubbing set is greater than the target playing time duration.

As an optional example, the apparatus further includes: a third adjusting module, configured to, after adjusting the dubbing segments in the dubbing collection according to the comparison result, lower the frequencies of the dubbing segments in the dubbing collection when the dubbing playing time duration used by the dubbing collection is extended to the target playing time duration; a fourth adjusting module, configured to, after the dubbing segments in the dubbing collection are adjusted according to the comparison result, increase the frequency of the dubbing segments in the dubbing collection when the dubbing playing time duration used by the dubbing collection is compressed to the target playing time duration.

As an alternative example, the first division unit includes: the conversion module is used for performing text conversion on the dubbing audio to obtain a dubbing text; a dividing module, configured to divide the dubbing audio into the plurality of dubbing segments by taking a word in the dubbing text as a unit; or, the dubbing audio is divided into the plurality of dubbing fragments by using the phoneme of the character in the dubbing text as a unit.

According to a further aspect of the embodiments of the present invention, there is also provided a storage medium having a computer program stored therein, wherein the computer program is configured to execute the above audio synchronization method when running.

In the embodiment of the invention, dubbing audio is divided into a plurality of dubbing fragments according to unit intervals, and the plurality of dubbing fragments are divided into a plurality of dubbing sets according to the rhythm of the target audio in the target video file, so that the dubbing fragments in the dubbing sets are adjusted by utilizing the comparison result of the playing time length used by the dubbing sets and the target playing time length, the automatic adjustment of the dubbing fragments in the dubbing sets is realized, the automation of the synchronous control of the dubbing audio and the target audio is realized, the audio synchronous control operation is simplified, the audio synchronous efficiency is improved, and the technical problem of higher operation complexity of the audio synchronous method provided by the related technology is solved.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the invention without limiting the invention. In the drawings:

FIG. 1 is a flow diagram of an alternative audio synchronization method according to an embodiment of the present invention;

FIG. 2 is a flow diagram of an alternative audio synchronization method according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of an alternative audio synchronizer according to an embodiment of the present invention;

fig. 4 is a schematic structural diagram of another alternative audio synchronization apparatus according to an embodiment of the present invention.

Detailed Description

In order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

According to an aspect of the embodiments of the present invention, there is provided an audio synchronization method, optionally as an optional implementation manner, as shown in fig. 1, the audio synchronization method includes:

s102, acquiring a dubbing file recorded for dubbing a target video file;

s104, dividing dubbing audio in the dubbing file into a plurality of dubbing fragments according to unit intervals;

s106, dividing a plurality of dubbing fragments into a plurality of dubbing sets according to the target audio extracted from the target video file;

s108, sequentially comparing the dubbing playing time length used by each dubbing set with a target playing time length, wherein the target playing time length is the audio playing time length used by a target audio clip corresponding to the dubbing set in the target audio;

s110, adjusting dubbing segments in the dubbing set according to the comparison result so as to enable the dubbing audio and the target audio to be played synchronously.

Optionally, in this embodiment, the audio synchronization method may be but is not limited to be applied to a scenario in which a target video file is dubbed, and the method provided in this embodiment performs synchronization control on the recorded dubbing audio and the target audio extracted from the target video file to be dubbed. The target video file may include, but is not limited to: movie and television works, documentaries, animations, etc., require video files to be dubbed or voice-over-captioned for their characters. The above is merely an example, and this is not limited in this embodiment.

It should be noted that, in this embodiment, after a dubbing file recorded for dubbing a target video file is acquired, dubbing audio in the dubbing file is divided into a plurality of dubbing segments at unit intervals. And dividing the dubbing segments into a plurality of dubbing sets according to the target audio extracted from the target video file, and comparing the dubbing playing time length used by each dubbing set with the target playing time length according to the target audio, wherein the target playing time length is the audio playing time length used by the target audio segment corresponding to the dubbing set in the target audio. Therefore, the dubbing segments in the dubbing set are adjusted according to the comparison result, so that the dubbing audio and the target audio are automatically and synchronously played. That is to say, the dubbing audio is divided into a plurality of dubbing segments according to the unit interval, and the dubbing segments are divided into a plurality of dubbing sets according to the rhythm of the target audio in the target video file, so that the dubbing segments in the dubbing sets are adjusted by using the comparison result of the playing time length used by the dubbing sets and the target playing time length, the automatic adjustment of the dubbing segments in the dubbing sets is achieved, the automation of the synchronous control of the dubbing audio and the target audio is realized, the audio synchronous control operation is simplified, the audio synchronous efficiency is improved, and the problem of more complex audio synchronous operation in the related technology is solved.

Optionally, in this embodiment, the dividing the dubbing audio in the dubbing file into a plurality of dubbing segments according to the unit interval includes: carrying out text conversion on the dubbing audio to obtain a dubbing text; a plurality of dubbing segments may then be obtained, but is not limited to, by one of:

1) dividing dubbing audio into a plurality of dubbing fragments by taking characters in the dubbing text as units;

2) and dividing the dubbing audio into a plurality of dubbing fragments by taking the phonemes of the characters in the dubbing text as units.

It should be noted that each word may be, but not limited to, split into multiple phonemes according to the initials and/or finals. That is to say, in this embodiment, the dubbing segments may be obtained by dividing the words in the dubbing segments at unit intervals, or may be obtained by dividing the phonemes of the words in the dubbing segments at unit intervals, so as to divide the dubbing audio into finer unit segments, so as to accurately divide the divided result into dubbing sets, and further improve the accuracy of the synchronization adjustment.

Optionally, in this embodiment, the target audio may be, but is not limited to, subjected to text conversion to obtain a first text set, where each text segment included in the first text set is used to indicate text information corresponding to each object audio segment obtained by dividing the target audio according to the target time interval.

It should be noted that the target time interval may be, but is not limited to, a unit time interval, that is, the target audio is divided into a plurality of object audio segments according to the unit time interval, and then each object audio segment is converted into a corresponding text segment. Each text segment comprises one or more characters obtained by converting the corresponding object audio segment. In addition, in this embodiment, when the playing duration of the key audio clip in one object audio clip reaches the next target time interval, it is not limited to that repeated key audio clips are retained in all text clips corresponding to the object audio clips. For example, when the pronunciation of a word in the target audio segment (i.e., the key audio segment) spans more than a unit time interval, such as 1ms, the duration of the pronunciation is 2ms, then it can be marked in the two text segments that are spanned: this is the pronunciation of the same word, not a repeat pronunciation.

Optionally, in this embodiment, the dubbing alignment process may be performed on the dubbing segments of the dubbing audio partition and the target audio partitioned object audio segment, but is not limited to: and dividing a target dubbing fragment corresponding to the text fragment into a dubbing set from the plurality of dubbing fragments. And synchronously aligning the dubbing sections in the dubbing audio and the object audio sections in the target audio by using the text sections as an alignment reference medium. Therefore, the aim of automatically controlling the dubbing audio and the target audio to be played at the same rhythm is achieved, and the speech speed of the dubbing actor does not need to be adjusted manually.

Optionally, in this embodiment, after the dubbing alignment processing, in order to ensure the sound quality of the dubbing audio and avoid the distortion problem caused by adjusting the playing time of the dubbing audio, the method may further include, but is not limited to: and correspondingly adjusting the frequency of dubbing fragments in the dubbing set according to the adjustment of the playing time length so as to ensure the smooth connection of the aligned dubbing set.

According to the embodiment provided by the application, dubbing audio is divided into a plurality of dubbing fragments according to unit intervals, and the dubbing fragments are divided into a plurality of dubbing sets according to the rhythm of the target audio in the target video file, so that the dubbing fragments in the dubbing sets are adjusted by utilizing the comparison result of the playing time length used by the dubbing sets and the target playing time length, the automatic adjustment of the dubbing fragments in the dubbing sets is achieved, the automation of synchronous control of the dubbing audio and the target audio is realized, the audio synchronous control operation is simplified, the audio synchronous efficiency is improved, and the problem of complex audio synchronous operation in the related technology is solved.

As an optional scheme, dividing a plurality of dubbing segments into a plurality of dubbing sets according to a target audio extracted from a target video file includes:

s1, obtaining a first text set obtained by text conversion of the target audio, wherein each text segment contained in the first text set is used for indicating text information corresponding to each object audio segment obtained by dividing the target audio according to the target time interval;

s2, repeatedly executing the following steps until all text segments in the first text set are traversed:

s21, acquiring current text segments from the first text set;

s22, obtaining a target dubbing fragment corresponding to the current text fragment from the plurality of dubbing fragments;

s23, dividing the target dubbing fragment into a dubbing set;

and S24, acquiring the next text segment as the current text segment.

It should be noted that, in this embodiment, the text segment may include, but is not limited to, one or more words converted from the object audio segment. For example, assuming that the target time interval for acquiring the text segment is 1 second(s), the text segment includes text information corresponding to an audio segment played by the target audio within 1 s. The above is merely an example, and this is not limited in this embodiment.

For example, it is assumed that the above dubbing segments are divided in units of words. If the text segment corresponding to the object audio segment played by the target audio in the 1 st s is 'hello', acquiring two target dubbing segments 'you', 'good' from a plurality of dubbing segments obtained by dividing according to characters according to the text segment corresponding to the target audio, and dividing the two target dubbing segments into a dubbing set A; further, it is obtained that a text segment corresponding to the object audio segment played by the target audio in the 2 nd s is "do", and according to the text segment corresponding to the target audio, a target dubbing segment "do" is obtained from a plurality of dubbing segments obtained by dividing according to characters and is divided into a dubbing set B. By analogy, a plurality of dubbing sets corresponding to the dubbing audio frequencies are obtained.

Optionally, in this embodiment, the obtaining of the first text set obtained by text conversion of the target audio includes:

s11, extracting target audio from the target video file;

s12, dividing the target audio into a plurality of object audio fragments according to the target time interval;

and S13, performing text conversion on the plurality of object audio clips to obtain a first text set, wherein the key audio clips are marked as the same audio clip in the object audio clips corresponding to the key audio clips under the condition that the playing duration of the key audio clips contained in the object audio clips is greater than the target time interval.

It should be noted that, in the present embodiment, in the process of converting the target audio into the text, the characters converted from the object audio segments divided according to the target time interval may be distributed to corresponding positions on the time axis, so as to obtain the text segments corresponding to the target time interval.

For example, assume that the target time interval is in seconds(s), the time axis is also in seconds(s), and the target audio time period is 1 minute. Then, the target audio may be divided into 60 object audio segments by taking s as a unit, and further, the object audio segments are respectively subjected to text conversion to obtain 60 text segments, which respectively correspond to 60 time lattices on the time axis, so as to obtain a first text set corresponding to the target audio.

According to the embodiment provided by the application, the first text set corresponding to the target audio and the plurality of dubbing sets corresponding to the dubbing audio are obtained, so that the text fragments in the first text set are utilized, the dubbing sets and the target audio are automatically aligned, the purpose of automatic audio synchronization control is achieved, and the audio synchronization operation is simplified.

As an alternative, adjusting the dubbing segments in the dubbing set according to the comparison result comprises:

1) stretching the dubbing playing time length used by the dubbing set to the target playing time length under the condition that the comparison result indicates that the dubbing playing time length used by the dubbing set is smaller than the target playing time length;

2) and compressing the dubbing playing time length used by the dubbing set to the target playing time length under the condition that the comparison result indicates that the dubbing playing time length used by the dubbing set is greater than the target playing time length.

Optionally, in this embodiment, after adjusting the dubbing segments in the dubbing set according to the comparison result, the method further includes:

1) reducing the frequency of dubbing segments in the dubbing set under the condition that the dubbing playing time length used by the dubbing set is stretched to the target playing time length;

2) and under the condition that the dubbing playing time length used by the dubbing set is compressed to the target playing time length, the frequency of the dubbing segments in the dubbing set is adjusted to be high.

It should be noted that, in this embodiment, the target playing time length may be, but is not limited to, a target time interval, and the target audio clip corresponding to the target playing time length may be, but is not limited to, an object audio clip corresponding to the target time interval. That is, whether the dubbing audio and the target audio are synchronous is determined by comparing the dubbing playing time length used by each dubbing set with the target time interval of the target audio. Further, when the comparison result indicates that the two are not synchronized, the two can be synchronized by adjusting the dubbing segments in the dubbing set.

Optionally, in this embodiment, the adjusting dubbing segments in the dubbing set may include, but are not limited to: and adjusting the dubbing playing time length used by the dubbing set. The adjusting the dubbing playing time length may include, but is not limited to:

1) the total dubbing play time length is adjusted. That is, the total dubbing playback time period is subjected to the overall stretching or compressing process so as to be equal to the target playback time period.

2) And adjusting the playing time length of each dubbing fragment in the dubbing set so as to enable the total dubbing playing time length to be equal to the target playing time length. That is to say, the playing duration of each word in the dubbing set can be adjusted respectively, or the playing duration of each phoneme of each word in the dubbing set can be adjusted, and the audio frequency can be synchronously controlled by adjusting the playing duration of a unit with a smaller magnitude, so as to improve the adjustment precision.

It should be noted that, since many pronunciations often have long tail sounds, in this embodiment, the method 2) may be applied to different adjustment for different dubbing segments, but not limited to this. For example, the first dubbing segment is not compressed, the second dubbing segment is stretched by 10%, and the third dubbing segment is compressed by 30%, so as to achieve the purpose of flexibly adjusting the dubbing playing time length of the whole dubbing set.

Further, in this embodiment, since compressing the playing time may cause distortion of the dubbing audio, the frequency adjustment is also performed on the dubbing segments in the dubbing collection whose dubbing playing time has been adjusted. The frequency adjustment of the dubbing segments may include, but is not limited to: the speech indicated by the dubbing fragments is frequency adjusted. For example, in the case that the dubbing playing time length used by the dubbing collection is stretched to the target playing time length, the frequency of the voice indicated by the dubbing segments in the dubbing collection is turned down; and in the case that the dubbing playing time length used by the dubbing set is compressed to the target playing time length, the frequency of the voice indicated by the dubbing segments in the dubbing set is adjusted to be high.

Through the embodiment that this application provided, through adjusting the dubbing set, not only realize the synchronous control of audio frequency, still will optimize the linking process of each dubbing set, guarantee the natural reality of linking up.

The description is made with reference to the example shown in fig. 2:

preparation step (also referred to as step 0): preparing a speech required for dubbing recording;

step 1: and a video recording module is used for acquiring a target video file through a camera, extracting a target audio from the target video file, and converting the target audio into a text to be distributed on a time axis to obtain a first text set. For example, assume that the time axis is divided into several time grids in units of 1 second, the target audio in each second is converted into words, and the words are distributed in the time grid corresponding to each second, so as to obtain a text segment. It should be noted that, if the time span of a certain key audio segment (e.g. a certain word) in the divided object audio segment exceeds 1 second, the word is put into the occupied time grid corresponding to every 1 second, and the word is marked to be the pronunciation of the same word, rather than the multiple repeated pronunciations of the same word. If multiple characters are converted within 1 second, the characters can be distributed in the 1 second grid. In other words, the text segment includes one or more words. It should be noted that the time axis dividing unit is not limited in this embodiment, that is, the target time interval may be set to different values according to the actual scene, but is not limited in this embodiment.

Step 2: and recording the dubbing for the target video file by using an audio recording module through a microphone to obtain a dubbing file, and automatically converting the dubbing audio in the dubbing file into characters which are distributed on a time axis. Here, the text distribution mode of the dubbing audio may refer to the text distribution mode of the target video, but is not described herein again.

And step 3: and splitting the dubbing audio by using a speech dubbing splitting module by taking characters as units to obtain a plurality of dubbing fragments. Such as one word for each dubbed segment.

And 4, step 4: and (3) using a speech dubbing alignment module to redistribute the plurality of dubbing fragments split in the step (3) on the time axis according to the distribution condition of each character on the time axis in the step (1) so as to align with the text fragments in the step (1). Namely, the plurality of dubbing fragments in the step 3 are grouped according to the text fragment to obtain a plurality of dubbing sets.

And 5: and (4) using a speech word linkage optimization module to perform time length adjustment and sound quality processing on the audio segments in the plurality of dubbing sets in the step (4), and then performing optimized linkage combination.

For example, if two words "hello" are included in a word segment corresponding to 1 s. Assuming that the duration of two dubbing fragments "hello" in a dubbing set is 2s in step 3, the dubbing fragments in the dubbing set are compressed. Further, in order to ensure the sound quality, the frequency of the pronunciation in the dubbing fragment may be appropriately increased. For example, if the duration of two dubbing segments "hello" in one dubbing set is 0.5s in step 3, the dubbing segments in the dubbing set are stretched and the frequency of sound emission in the dubbing segments is appropriately reduced.

In addition, a plurality of dubbing segments can be split according to the phoneme of the pronunciation of the text, and the adjustment mode can refer to the above example, which is not described herein again. It should be noted that the adjustment manner of each dubbing segment in the plurality of dubbing segments may be different, so as to ensure the flexibility of audio synchronization control.

Step 6: and obtaining the adjusted final dubbing audio frequency, and directly synthesizing the final dubbing audio frequency with the target video to generate a final dubbing video file.

It should be noted that, for simplicity of description, the above-mentioned method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present invention is not limited by the order of acts, as some steps may occur in other orders or concurrently in accordance with the invention. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required by the invention.

According to another aspect of the embodiments of the present invention, there is also provided an audio synchronization apparatus for implementing the audio synchronization method described above. As shown in fig. 3, the apparatus includes:

1) an obtaining unit 302, configured to obtain a dubbing file recorded by dubbing for a target video file;

2) a first dividing unit 304, configured to divide dubbing audio in the dubbing file into a plurality of dubbing segments at unit intervals;

3) a second dividing unit 306, configured to divide the multiple dubbing segments into multiple dubbing sets according to the target audio extracted from the target video file;

4) a comparing unit 308, configured to sequentially compare the dubbing playing duration used by each dubbing set with a target playing duration, where the target playing duration is an audio playing duration used by a target audio clip corresponding to the dubbing set in the target audio;

5) and an adjusting synchronization unit 310, configured to adjust the dubbing segments in the dubbing collection according to the comparison result, so that the dubbing audio and the target audio are played synchronously.

Optionally, in this embodiment, the audio synchronization apparatus may be applied, but not limited to, in a scenario of dubbing a target video file, and the apparatus provided in this embodiment performs synchronization control on recorded dubbing audio and a target audio extracted from a target video file to be dubbed. The target video file may include, but is not limited to: movie and television works, documentaries, animations, etc., require video files to be dubbed or voice-over-captioned for their characters. The above is merely an example, and this is not limited in this embodiment.

Optionally, in this embodiment, the first dividing unit includes: the conversion module is used for performing text conversion on the dubbing audio to obtain a dubbing text; a dividing module, configured to divide the dubbing audio into the plurality of dubbing segments by taking a word in the dubbing text as a unit; or, the dubbing audio is divided into the plurality of dubbing fragments by using the phoneme of the character in the dubbing text as a unit.

As an alternative, as shown in fig. 4, the second dividing unit 306 includes:

1) an obtaining module 402, configured to obtain a first text set obtained by performing text conversion on a target audio, where each text segment included in the first text set is used to indicate text information corresponding to each object audio segment obtained by dividing the target audio according to a target time interval;

2) a processing module 404, configured to repeatedly perform the following steps until all text segments in the first text set are traversed:

s1, acquiring current text segments from the first text set;

s2, obtaining a target dubbing fragment corresponding to the current text fragment from the plurality of dubbing fragments;

s3, dividing the target dubbing fragment into a dubbing set;

and S4, acquiring the next text segment as the current text segment.

Optionally, in this embodiment, the obtaining module 402 includes:

(1) the extraction submodule is used for extracting target audio from the target video file;

(2) the dividing submodule is used for dividing the target audio into a plurality of object audio fragments according to the target time interval;

(3) and the conversion sub-module is used for performing text conversion on the plurality of object audio clips to obtain a first text set, wherein the key audio clips are marked as the same audio clip in the object audio clips corresponding to the key audio clips under the condition that the playing duration of the key audio clips contained in the object audio clips is greater than the target time interval.

As an alternative, the adjusting synchronization unit 310 includes:

1) the first adjusting module is used for stretching the dubbing playing time length used by the dubbing set to the target playing time length under the condition that the comparison result indicates that the dubbing playing time length used by the dubbing set is smaller than the target playing time length;

2) and the second adjusting module is used for compressing the dubbing playing time length used by the dubbing set to the target playing time length under the condition that the comparison result indicates that the dubbing playing time length used by the dubbing set is greater than the target playing time length.

Optionally, in this embodiment, the method further includes:

3) a third adjusting module, configured to, after adjusting the dubbing segments in the dubbing set according to the comparison result, lower the frequency of the dubbing segments in the dubbing set when the dubbing playing time duration used by the dubbing set is extended to the target playing time duration;

4) and the fourth adjusting module is used for increasing the frequency of the dubbing clips in the dubbing set under the condition that the dubbing playing time length used by the dubbing set is compressed to the target playing time length after the dubbing clips in the dubbing set are adjusted according to the comparison result.

Each functional unit in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

According to a further aspect of embodiments of the present invention, there is also provided a storage medium having a computer program stored therein, wherein the computer program is arranged to perform the steps of any of the above-mentioned method embodiments when executed.

Alternatively, in the present embodiment, the storage medium may be configured to store a computer program for executing the steps of:

s1, acquiring a dubbing file recorded for dubbing the target video file;

s2, dividing dubbing audio in the dubbing file into a plurality of dubbing fragments according to unit intervals;

s3, dividing a plurality of dubbing fragments into a plurality of dubbing sets according to the target audio extracted from the target video file;

s4, sequentially comparing the dubbing playing time length used by each dubbing set with a target playing time length, wherein the target playing time length is the audio playing time length used by a target audio clip corresponding to the dubbing set in the target audio;

s5, adjusting dubbing segments in the dubbing collection according to the comparison result so as to make the dubbing audio and the target audio play synchronously.

Alternatively, in this embodiment, a person skilled in the art may understand that all or part of the steps in the methods of the foregoing embodiments may be implemented by a program instructing hardware associated with the terminal device, where the program may be stored in a computer-readable storage medium, and the storage medium may include: flash disks, Read-Only memories (ROMs), Random Access Memories (RAMs), magnetic or optical disks, and the like.

The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.

The integrated unit in the above embodiments, if implemented in the form of a software functional unit and sold or used as a separate product, may be stored in the above computer-readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for causing one or more computer devices (which may be personal computers, servers, network devices, etc.) to execute all or part of the steps of the method according to the embodiments of the present invention.

In the above embodiments of the present invention, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

In the several embodiments provided in the present application, it should be understood that the disclosed client may be implemented in other manners. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one type of division of logical functions, and there may be other divisions when actually implemented, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, units or modules, and may be in an electrical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.

Claims

1. An audio synchronization method, comprising:

acquiring a dubbing file recorded for dubbing a target video file;

dividing dubbing audio in the dubbing file into a plurality of dubbing fragments according to unit intervals with characters or phonemes as units;

dividing the plurality of dubbing fragments into a plurality of dubbing sets according to the target audio extracted from the target video file;

sequentially comparing the dubbing playing time length used by each dubbing set with a target playing time length, wherein the target playing time length is the audio playing time length used by a target audio clip corresponding to the dubbing set in the target audio;

adjusting dubbing segments in the dubbing set according to the comparison result so as to enable the dubbing audio and the target audio to be played synchronously;

wherein the dividing the plurality of dubbing segments into a plurality of dubbing sets according to the target audio extracted from the target video file comprises:

acquiring a first text set obtained by performing text conversion on the target audio, wherein each text segment contained in the first text set is used for indicating text information corresponding to each object audio segment obtained by dividing the target audio according to a target time interval;

repeatedly executing the following steps until all text segments in the first text set are traversed:

acquiring a current text fragment from the first text set;

acquiring a target dubbing fragment corresponding to the current text fragment from the plurality of dubbing fragments;

dividing the target dubbing segments into a dubbing set;

and acquiring the next text segment as the current text segment.

2. The method of claim 1, wherein obtaining the first text set resulting from text conversion of the target audio comprises:

extracting the target audio from the target video file;

dividing the target audio into a plurality of object audio segments according to the target time interval;

and performing text conversion on the plurality of object audio clips to obtain the first text set, wherein the key audio clips are marked as the same audio clip in the object audio clips corresponding to the key audio clips when the playing duration of the key audio clips contained in the object audio clips reaches the next target time interval.

3. The method according to claim 1, wherein said adjusting the dubbing segments in the dubbing set according to the result of the comparison comprises:

stretching the dubbing playing time length used by the dubbing set to the target playing time length under the condition that the comparison result indicates that the dubbing playing time length used by the dubbing set is smaller than the target playing time length;

and compressing the dubbing playing time length used by the dubbing set to the target playing time length under the condition that the comparison result indicates that the dubbing playing time length used by the dubbing set is greater than the target playing time length.

4. The method according to claim 3, further comprising, after said adjusting the dubbing segments in the dubbing set according to the result of the comparison:

in the case that the dubbing playing time length used by the dubbing collection is stretched to the target playing time length, turning down the frequency of the dubbing segments in the dubbing collection;

and under the condition that the dubbing playing time length used by the dubbing set is compressed to the target playing time length, increasing the frequency of the dubbing segments in the dubbing set.

5. The method according to any one of claims 1 to 4, wherein the dividing the dubbing audio in the dubbing file into a plurality of dubbing fragments at unit intervals in units of words or phonemes comprises:

performing text conversion on the dubbing audio to obtain a dubbing text;

dividing the dubbing audio into the plurality of dubbing fragments by taking the characters in the dubbing text as units; or, the dubbing audio is divided into the plurality of dubbing fragments by taking the phoneme of the word in the dubbing text as a unit.

6. An audio synchronization apparatus, comprising:

an acquisition unit, configured to acquire a dubbing file recorded for dubbing a target video file;

a first dividing unit, configured to divide dubbing audio in the dubbing file into a plurality of dubbing fragments at unit intervals of a unit of word or phoneme;

the second dividing unit is used for dividing the dubbing fragments into a plurality of dubbing sets according to the target audio extracted from the target video file;

a comparison unit, configured to compare a dubbing playing duration used by each dubbing set with a target playing duration in sequence, where the target playing duration is an audio playing duration used by a target audio clip corresponding to the dubbing set in the target audio;

the adjusting synchronization unit is used for adjusting dubbing segments in the dubbing set according to the comparison result so as to enable the dubbing audio and the target audio to be played synchronously;

the second dividing unit includes:

an obtaining module, configured to obtain a first text set obtained by performing text conversion on the target audio, where each text segment included in the first text set is used to indicate text information corresponding to each object audio segment obtained by dividing the target audio according to a target time interval;

a processing module, configured to repeatedly perform the following steps until all text segments in the first text set are traversed:

acquiring a current text fragment from the first text set;

dividing the target dubbing segments into a dubbing set;

and acquiring the next text segment as the current text segment.

7. The apparatus of claim 6, wherein the obtaining module comprises:

the extraction submodule is used for extracting the target audio from the target video file;

the dividing submodule is used for dividing the target audio into a plurality of object audio fragments according to the target time interval;

and the conversion sub-module is used for performing text conversion on the plurality of object audio clips to obtain the first text set, wherein the key audio clips are marked as the same audio clip in the object audio clips corresponding to the key audio clips under the condition that the playing duration of the key audio clips contained in the object audio clips reaches the next target time interval.

8. The apparatus of claim 6, wherein the adjustment synchronization unit comprises:

a first adjusting module, configured to stretch the dubbing playing duration used by the dubbing set to the target playing duration when the comparison result indicates that the dubbing playing duration used by the dubbing set is smaller than the target playing duration;

and the second adjusting module is used for compressing the dubbing playing time length used by the dubbing set to the target playing time length under the condition that the comparison result indicates that the dubbing playing time length used by the dubbing set is greater than the target playing time length.

9. The apparatus of claim 8, further comprising:

a third adjusting module, configured to, after adjusting the dubbing segments in the dubbing set according to the comparison result, lower the frequencies of the dubbing segments in the dubbing set when the dubbing playing time duration used by the dubbing set is extended to the target playing time duration;

a fourth adjusting module, configured to, after adjusting the dubbing segments in the dubbing set according to the comparison result, increase the frequency of the dubbing segments in the dubbing set when the dubbing playing time duration used by the dubbing set is compressed to the target playing time duration.

10. The apparatus according to any one of claims 6 to 9, wherein the first dividing unit comprises:

the conversion module is used for performing text conversion on the dubbing audio to obtain a dubbing text;

the dividing module is used for dividing the dubbing audio into a plurality of dubbing fragments by taking the characters in the dubbing text as units; or, the dubbing audio is divided into the plurality of dubbing fragments by taking the phoneme of the word in the dubbing text as a unit.

11. A computer-readable storage medium, comprising a stored program, wherein the program is executed by a computer device to perform the method of any one of claims 1 to 5.