CN116612784A

CN116612784A - Audio clipping method and device

Info

Publication number: CN116612784A
Application number: CN202310741308.0A
Authority: CN
Inventors: 文博龙; 陈海涛; 闫影; 李娜; 李海
Original assignee: Chengdu iQIYI Intelligent Innovation Technology Co Ltd
Current assignee: Chengdu iQIYI Intelligent Innovation Technology Co Ltd
Priority date: 2023-06-21
Filing date: 2023-06-21
Publication date: 2023-08-18

Abstract

The embodiment of the invention provides an audio clipping method and device, which are used for sequentially matching a first sub-audio segment in a first audio segment with a second sub-audio segment in the same order, respectively determining a starting coordinate and a non-alignment point coordinate of an audio segment to be clipped based on the first sub-audio segment and the second sub-audio segment which are not successfully matched, sequentially matching a third sub-audio segment and one or more fourth sub-audio segments behind the starting coordinate after the non-alignment point, determining a termination coordinate of the audio segment to be clipped based on the fourth audio segment which is successfully matched, positioning and clipping the first audio segment based on the starting coordinate and the termination coordinate of the audio segment to be clipped in the first audio segment, and improving the efficiency of positioning the difference audio content in the audio and clipping.

Description

Audio clipping method and device

Technical Field

The invention relates to the technical field of audio processing, in particular to an audio clipping method and device.

Background

The premise of dubbing of a film and television work is that the film and television work has international sound matched with an online video medium, and the international sound refers to all sounds except for the foreign language of the produced language of the film and television work in a film and television work. The method mainly comprises the steps of music and action effects of the film and television works, and possibly comprises other languages of non-film and television works in country of production and dubbing. For auditing reasons, many international sounds provided by movie and television work suppliers are original, and are not matched with video media of over-audited version, and many international sound audio fragments are increased, corresponding to the deleted video fragments.

Therefore, it is necessary to clip international sounds provided by a movie and television work provider, and clip out a plurality of audio clips therein, which may also be called international sound repair. The traditional international sound repairing method is to manually compare international sound, over-examination video and mix track audio (audio extracted from over-examination video medium) by using audio processing software, and complete repairing by the steps of naked eye comparison, audio segmentation and deletion and video point rechecking, and has low efficiency and high price.

Disclosure of Invention

The embodiment of the invention aims to provide an audio clipping method and device so as to improve the efficiency of locating and clipping differential audio content in audio. The specific technical scheme is as follows:

in a first aspect of the present invention, there is provided an audio clipping method, the method comprising:

acquiring a first audio segment and a second audio segment; the first audio segment is a standard audio segment corresponding to the complete version of video, and the second audio segment is a non-standard audio segment separated from the non-complete version of video;

respectively extracting audio fingerprints of the first audio segment and the second audio segment to respectively obtain audio fingerprints of each audio frame in the first audio segment and the second audio segment;

Sequentially determining one or more first sub-audio segments in the first audio segment by taking the number of the first audio frames as the segment length, sequentially determining one or more second sub-audio segments in the second audio segment, sequentially matching the first sub-audio segment and the second sub-audio segment in the same sequence until the ith first sub-audio segment and the ith second sub-audio segment are not successfully matched, determining the initial coordinate of the audio segment to be cut in the first audio segment based on the coordinate of the audio frame in the ith first sub-audio segment, and determining the non-alignment point coordinate in the second audio segment based on the coordinate of the audio frame in the ith second sub-audio segment; the matching of the first sub-audio segment and the second sub-audio segment successfully represents that the similarity between the audio fingerprint corresponding to the first sub-audio segment and the audio fingerprint corresponding to the second sub-audio segment meets the preset condition;

determining a third sub-audio segment in the second audio segment located behind the non-aligned point coordinates by taking the second audio frame number as a segment length, determining one or more fourth sub-audio segments in the first audio segment located behind the initial coordinates, sequentially matching the third sub-audio segment with the one or more fourth sub-audio segments until a fourth sub-audio segment successfully matched with the third sub-audio segment is determined, and determining a termination coordinate of the audio segment to be cut in the first audio segment based on the coordinates of the audio frame in the fourth sub-audio segment successfully matched;

And cutting the first audio segment based on the starting coordinate and the ending coordinate of the audio segment to be cut in the first audio segment to obtain a standard audio segment matched with the incomplete version video.

Optionally, whether the first sub-audio segment and the second sub-audio segment are successfully matched is determined based on the following manner:

performing cross-correlation calculation on the audio fingerprint corresponding to the first sub-audio segment and the audio fingerprint corresponding to the second sub-audio segment to obtain a similarity sequence, and judging whether the coordinate of a similarity peak value in the similarity sequence is the center coordinate of the similarity sequence or not;

if yes, determining that the first sub-audio segment and the second sub-audio segment are successfully matched;

if not, determining that the current first sub-audio piece and the current second sub-audio piece are not successfully matched.

Optionally, the step of determining the start coordinate of the audio segment to be clipped in the first audio segment based on the coordinate of the audio frame in the ith first sub-audio segment and determining the non-aligned point coordinate in the second audio segment based on the position coordinate of the ith second sub-audio segment includes:

and taking the number of third audio frames as a shortening step length, synchronously shortening the ith first sub-audio segment and the ith second sub-audio segment, and matching the shortened first sub-audio segment and the shortened second sub-audio segment until the shortened first sub-audio segment and the shortened second sub-audio segment are successfully matched, taking the coordinate of the last audio frame in the ith first sub-audio segment which is successfully matched as the initial coordinate of the audio segment to be cut in the first audio segment, and taking the coordinate of the last audio frame in the ith second sub-audio segment which is successfully matched as the non-alignment point coordinate in the second audio segment.

Optionally, the method further comprises:

and determining a first audio segment after the termination coordinate as a new first audio segment, determining a second audio segment after the non-aligned point coordinate as a new second audio segment, returning to the step of determining one or more first sub-audio segments in the first audio segment by using the number of the first audio frames as a segment length, determining one or more second sub-audio segments in the second audio segment in turn, and sequentially matching the first sub-audio segment and the second sub-audio segment in the same order, obtaining a start coordinate and a termination coordinate of the audio segment to be cut determined for the new first audio segment, returning to the step of determining the first audio segment after the current termination coordinate as the new first audio segment, and determining the second audio segment after the current non-aligned point coordinate as the new second audio segment until the first sub-audio segment and the second sub-audio segment are successfully matched.

Optionally, the step of clipping the first audio segment based on the start coordinate and the end coordinate of the audio segment to be clipped in the first audio segment includes:

and cutting the first audio segment according to the start coordinate and the end coordinate corresponding to the audio segment to be cut aiming at each determined audio segment to be cut.

Optionally, the complete version video is a trial version sending video, the incomplete version video is a trial version passing video after the trial version sending video is subjected to subtraction processing, the standard version audio frequency band is standard international sound corresponding to the complete version video, and the non-standard version audio frequency band is international sound separated from the trial version passing video.

In a second aspect of the present invention, there is also provided an audio clipping apparatus, including:

the acquisition module is used for acquiring the first audio piece and the second audio piece; the first audio segment is a standard audio segment corresponding to the complete version of video, and the second audio segment is a non-standard audio segment separated from the non-complete version of video;

the extraction module is used for extracting the audio fingerprints of the first audio segment and the second audio segment respectively to obtain the audio fingerprints of each audio frame in the first audio segment and the second audio segment respectively;

the first matching module is used for sequentially determining one or more first sub-audio segments in the first audio segment by taking the number of first audio frames as the segment length, sequentially determining one or more second sub-audio segments in the second audio segment, sequentially matching the first sub-audio segments and the second sub-audio segments in the same order until the ith first sub-audio segment and the ith second sub-audio segment are not successfully matched, determining the initial coordinate of an audio segment to be cut in the first audio segment based on the coordinate of an audio frame in the ith first sub-audio segment, and determining the non-aligned point coordinate in the second audio segment based on the coordinate of an audio frame in the ith second sub-audio segment; the matching of the first sub-audio segment and the second sub-audio segment successfully represents that the similarity between the audio fingerprint corresponding to the first sub-audio segment and the audio fingerprint corresponding to the second sub-audio segment meets the preset condition;

A second matching module, configured to determine a third sub-audio segment in the second audio segment located after the non-aligned point coordinate, determine one or more fourth sub-audio segments in the first audio segment located after the start coordinate, and match the third sub-audio segment with one or more fourth sub-audio segments in sequence until a fourth sub-audio segment successfully matched with the third sub-audio segment is determined, and determine a termination coordinate of the audio segment to be cut in the first audio segment based on coordinates of an audio frame in the fourth sub-audio segment successfully matched;

and the clipping module is used for clipping the first audio segment based on the starting coordinate and the ending coordinate of the audio segment to be clipped in the first audio segment to acquire the standard audio segment matched with the incomplete version video.

Optionally, the first matching module includes:

the judging unit is used for carrying out cross-correlation calculation on the audio fingerprint corresponding to the first sub-audio segment and the audio fingerprint corresponding to the second sub-audio segment to obtain a similarity sequence, and judging whether the coordinate of a similarity peak value in the similarity sequence is the center coordinate of the similarity sequence or not;

In a third aspect of the present invention, there is provided an electronic device including a processor, a communication interface, a memory, and a communication bus, wherein the processor, the communication interface, and the memory perform communication with each other through the communication bus;

a memory for storing a computer program;

a processor for implementing the audio clipping method of any one of the above claims when executing a program stored on a memory.

In a fourth aspect of the present invention, there is also provided a computer readable storage medium having a computer program stored therein, which when executed by a processor, implements any of the above-described audio clipping methods.

In a fifth aspect of the invention, there is also provided a computer program product containing instructions which, when run on a computer, cause the computer to perform any of the above described audio clipping methods.

According to the audio clipping method provided by the embodiment of the invention, one or more first sub-audio segments are sequentially determined in a first audio segment, one or more second sub-audio segments are sequentially determined in a second audio segment, the first sub-audio segments and the second sub-audio segments in the same order are sequentially matched, the initial coordinates of the audio segments to be clipped in the first audio segment are determined based on the coordinates of audio frames in the first sub-audio segments which are not successfully matched, the non-aligned point coordinates in the second audio segment are determined based on the coordinates of audio frames in the second sub-audio segments which are not successfully matched, a third sub-audio segment is determined in the second audio segment which is behind the non-aligned point coordinates, one or more fourth sub-audio segments are determined in the first audio segment which is behind the initial coordinates, the third sub-audio segments and the fourth sub-audio segments are sequentially matched, the termination coordinates of the audio segments to be clipped in the first audio segment are determined based on the coordinates of audio frames in the fourth sub-audio segments which are successfully matched, and the audio segments to be clipped in the first audio segment are completely matched, and the audio standard is obtained. By applying the audio clipping method provided by the embodiment of the invention, the audio segments to be clipped are positioned by matching the sub-audio segments in the first audio segment and the second audio segment, the positioning and clipping of the audio content which is more in the first audio segment than in the second audio segment can be realized without manually comparing the first audio segment with the second audio segment, the standard audio segment of the incomplete audio video is obtained, the efficiency of positioning the difference audio content in the audio and clipping can be improved, and the international sound repairing efficiency can be improved and the repairing cost can be reduced when the method is applied to the international sound repairing.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below.

Fig. 1 is a schematic flow chart of an audio clipping method according to an embodiment of the present invention;

FIG. 2 is an exemplary diagram of a first audio segment and a second audio segment provided by an embodiment of the present invention;

FIG. 3 is an exemplary diagram of a start coordinate of an audio segment to be cut in a first audio segment and a non-aligned point coordinate in a second audio segment according to an embodiment of the present invention;

FIG. 4 is an exemplary diagram of start and end coordinates, and non-aligned point coordinates of an audio segment to be cropped, provided by an embodiment of the present invention;

FIG. 5 is an exemplary diagram of a cross-correlation calculation process provided by an embodiment of the present invention;

fig. 6 is an exemplary diagram of a matching process of a first sub-audio piece and a second sub-audio piece provided by an embodiment of the present disclosure;

FIG. 7 is another exemplary diagram of a first audio segment and a second audio segment provided by an embodiment of the present invention;

FIG. 8 is a schematic diagram of an audio clipping method according to an embodiment of the present invention;

fig. 9 is a schematic structural diagram of an audio clipping apparatus according to an embodiment of the present invention;

Fig. 10 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be described below with reference to the accompanying drawings in the embodiments of the present invention.

In order to solve the problem of low efficiency in repairing international sound by a manual method at present, an embodiment of the present invention provides an audio clipping method, and fig. 1 is a schematic flow diagram of the audio clipping method provided by the embodiment of the present invention, as shown in fig. 1, the method specifically includes the following steps:

step S101: acquiring a first audio segment and a second audio segment; the first audio segment is a standard audio segment corresponding to the complete version of video, and the second audio segment is a non-standard audio segment separated from the non-complete version of video.

Step S102: and respectively extracting the audio fingerprints of the first audio segment and the second audio segment to respectively obtain the audio fingerprints of each audio frame in the first audio segment and the second audio segment.

Step S103: sequentially determining one or more first sub-audio segments in the first audio segment by taking the number of the first audio frames as the segment length, sequentially determining one or more second sub-audio segments in the second audio segment, sequentially matching the first sub-audio segments and the second sub-audio segments in the same sequence until the ith first sub-audio segment and the ith second sub-audio segment are not successfully matched, determining the initial coordinate of the audio segment to be cut in the first audio segment based on the coordinate of the audio frame in the ith first sub-audio segment, and determining the non-alignment point coordinate in the second audio segment based on the coordinate of the audio frame in the ith second sub-audio segment; the matching of the first sub-audio segment and the second sub-audio segment successfully represents that the similarity between the audio fingerprint corresponding to the first sub-audio segment and the audio fingerprint corresponding to the second sub-audio segment meets the preset condition.

Step S104: and determining a third sub-audio piece in the second audio piece positioned behind the non-aligned point coordinates by taking the number of the second audio frames as the segmentation length, determining one or more fourth sub-audio pieces in the first audio piece positioned behind the initial coordinates, sequentially matching the third sub-audio piece with the one or more fourth sub-audio pieces until the fourth sub-audio piece successfully matched with the third sub-audio piece is determined, and determining the termination coordinate of the audio piece to be cut in the first audio piece based on the coordinates of the audio frames in the fourth sub-audio piece successfully matched.

Step S105: and cutting the first audio segment based on the starting coordinate and the ending coordinate of the audio segment to be cut in the first audio segment to obtain a standard audio segment matched with the non-complete version of video.

The following describes the foregoing steps S101 to S105 in detail:

in step S101, the non-full version video is compared with the full version video in which the partial video content is missing, and the first audio segment is specifically an audio segment adapted to the full version video, and the second audio segment is an audio segment separated from the non-full version video based on a separation algorithm. For example, the first audio segment may be a human voice audio/background voice audio adapted to a full version of video, and the second audio segment may be a human voice audio/background voice audio separated from a non-full version of video based on a human voice background sound separation algorithm.

Because of the limited accuracy of the separation algorithm, a portion of the noise may be contained in the second audio segment, and the audio quality may not meet the requirements. The first audio segment is more than the second audio segment, and the partial audio content corresponds to the content of the complete version of video which is more than the incomplete version of video.

Fig. 2 is an exemplary diagram of a first audio segment and a second audio segment provided in an embodiment of the present invention, where, as shown in fig. 2, the first audio segment is specifically a standard audio segment m, a standard audio segment k, and a standard audio segment n, and the second audio segment is specifically a non-standard audio segment m 'and a non-standard audio segment n'. Specifically, audio segments m 'and m correspond to the same audio content, but the audio quality is different, audio segments n' and n are the same, and furthermore, the second audio segment lacks the audio content corresponding to audio segment k as compared to the first audio segment.

In step S102, an audio fingerprint of each audio frame in the first audio segment and the second audio segment is extracted, where the audio fingerprint of each audio frame is specifically an array determined based on acoustic features of the audio frame.

Regarding the method for extracting the audio fingerprint, reference may be made to the content in the related art, and embodiments of the present invention are not limited thereto. As an example, the audio fingerprint extraction may be performed on the first audio segment and the second audio segment by Shazam (an audio fingerprint extraction algorithm).

In step S103, the number of the first audio frames is taken as the segment length, one or more first sub-audio segments are sequentially determined in the first audio segment, one or more second sub-audio segments are sequentially determined in the second audio segment, specifically, the first audio segment and the second audio segment are segmented based on the audio frames in the first audio segment and the second audio segment, and each determined first sub-audio segment or second sub-audio segment is composed of the number of the audio frames. The first audio frame number may be set based on actual requirements.

As an example, if the number of the first audio frames is set to 10, the 1 st to 10 th audio frames in the first audio segment are the first sub-audio segment, the 11 th to 20 th audio frames are the second first sub-audio segment, and so on, and the procedure of determining the second sub-audio segment in the second audio segment is the same.

And in the process of matching the first sub-audio piece and the second sub-audio piece, the first sub-audio piece and the second sub-audio piece with the same sequence are needed to be matched in sequence until the ith first sub-audio piece and the ith second sub-audio piece are not successfully matched. That is, the first sub-audio segment and the first second sub-audio segment are matched, if the matching is successful, the second first sub-audio segment and the second sub-audio segment are matched, and the like until the first sub-audio segment and the second sub-audio segment which are not successfully matched are determined.

The similarity between the audio fingerprint corresponding to the first sub-audio segment and the audio fingerprint corresponding to the second sub-audio segment is specifically represented to meet a preset condition.

When calculating the similarity between the audio fingerprint corresponding to the first sub-audio segment and the audio fingerprint corresponding to the second sub-audio segment, the audio fingerprints of each audio frame in the first sub-audio segment can be spliced to obtain one array, the audio fingerprints of each audio frame in the second sub-audio segment can be spliced to obtain another array, and the similarity between the two arrays is calculated. For a specific calculation manner of the similarity between audio fingerprints, reference may be made to the content in the related art.

The preset condition of the similarity used when the first sub-audio segment and the second sub-audio segment are successfully matched can be selected according to actual requirements, and the embodiment of the invention is not limited to this. For example, a similarity threshold may be preset, and if the similarity between the audio fingerprint corresponding to the first sub-audio segment and the audio fingerprint corresponding to the second sub-audio segment is not less than the similarity threshold, the matching is considered to be successful.

After the fact that the ith first sub-audio piece and the ith second sub-audio piece are not successfully matched is determined, starting coordinates of the audio piece to be cut in the first audio piece are determined based on coordinates of the audio frame in the ith first sub-audio piece, and non-alignment point coordinates in the second audio piece are determined based on coordinates of the audio frame in the ith second sub-audio piece.

Wherein for a non-aligned point in the second audio segment, it can be understood that: the second sub-audio segment in front of the point in the second audio segment corresponds to the audio content of the first sub-audio segment in the same order in the first audio segment, i.e. can be successfully matched, and the second sub-audio segment in back of the point in the second audio segment is not aligned with the audio content of the first sub-audio segment in the same order in the first audio segment, i.e. cannot be successfully matched. For example, the intersection point between the audio piece m 'and the audio piece n' in the second audio piece illustrated in fig. 2 is a non-alignment point.

With respect to the coordinates of the audio frames, each of the first audio piece and the second audio piece typically has a fixed length of time, and thus the start coordinates and the end coordinates of each of the first audio piece and the second audio piece may be considered known. As an example, if the time length of each audio frame is 20ms, the start coordinate and the end coordinate of the first audio frame in the first audio segment are respectively 0ms and 20ms, the start coordinate and the end coordinate of the second audio frame are respectively 20ms and 40ms, and so on.

As an example, since the time length of each audio frame is fixed, the sequence number of the audio frame can also be used to characterize the time point in the audio, the starting coordinate of the audio piece to be clipped in the first audio piece and the non-aligned point coordinate in the second audio piece can also be characterized by the sequence number of the audio frame, for example, the starting coordinate of the audio piece to be clipped in the first audio piece is the 100 th frame in the first audio piece.

As an example, the starting coordinate of the audio piece to be clipped in the first audio piece and the non-aligned point coordinate in the second audio piece may also be characterized by the time of the audio frame, e.g. the starting coordinate of the audio piece to be clipped in the first audio piece is 100s in the first audio piece.

As described above, the first audio segment and the second audio segment specifically include a plurality of audio frames, and the start coordinate and the non-alignment point coordinate of the audio segment to be clipped are determined specifically based on the coordinates of which audio frame, which may be selected based on actual requirements. As an example, the starting coordinate of the audio segment to be clipped in the first audio segment may be determined based on the coordinate of the first audio frame in the i first sub-audio segment, the non-aligned point coordinate in the second audio segment may be determined based on the coordinate of the first audio frame in the i second sub-audio segment, and if the number of the first audio frames is set to 10, and the first sub-audio segment and the first second sub-audio segment are successfully matched, and the second first sub-audio segment and the second sub-audio segment are not successfully matched, the starting coordinate of the audio segment to be clipped in the first audio segment is the 11 th frame, and the non-aligned point coordinate in the second audio segment is the 11 th frame.

As an example, the number of first audio frames may also be set to 1 frame, and each of the first sub-audio piece and the second sub-audio piece is specifically one audio frame. Therefore, each audio frame in the first audio segment can be sequentially matched with the audio frames in the same order in the second audio segment until the ith audio frame in the first audio segment is not successfully matched with the ith audio frame in the second audio segment. In this case, the start coordinate of the audio piece to be cut in the first audio piece is the i-th audio frame, and the non-alignment point coordinate in the second audio piece is the i-th audio frame.

In the following, description will be made with reference to a specific example, and fig. 3 is an exemplary diagram of starting coordinates of an audio segment to be cut in a first audio segment and non-aligned point coordinates in a second audio segment according to an embodiment of the present invention, where a hatched portion in fig. 3 is audio content of the first audio segment that is more than the second audio segment, and corresponds to the audio segment k shown in fig. 1.

As shown in fig. 3, a first sub-audio segment a in a first audio segment and a second sub-audio segment a ' in a second audio segment are first matched, if the matching is successful, the first sub-audio segment B in the first audio segment and the second sub-audio segment B ' in the second audio segment are matched, if the matching is successful, the matching is continued, if the matching is failed, the first sub-audio segment C in the first audio segment and the second sub-audio segment C ' in the second audio segment are matched, an initial coordinate of an audio segment to be cut in the first audio segment is determined based on coordinates of an audio frame in the first audio segment C, namely, a diagram x11, and a non-aligned point coordinate in the second audio segment is determined based on coordinates of an audio frame in the second sub-audio segment, namely, a diagram x1.

In step S104, after determining the start coordinate of the audio piece to be cut in the first audio piece, the end coordinate of the audio piece to be cut in the first audio piece is also determined, so as to realize positioning of the audio piece to be cut.

After determining the start coordinate of the audio piece to be cut in the first audio piece and the non-aligned point coordinate in the second audio piece in the aforementioned step S103, a third sub-audio piece may be determined in the second audio piece located after the non-aligned point based on the second audio frame number, that is, the audio frame of the second audio frame number located after the non-aligned point is taken as a third sub-audio piece, and similarly, one or more fourth sub-audio pieces may be sequentially determined in the first audio piece located after the start coordinate. For example, if the number of the second audio frames is set to 3 and the non-aligned point coordinates are 11 th frame, the 11 th to 13 th frames in the second audio segment are a third sub-audio segment, if the start coordinates are 11 th frame, the 11 th to 13 th frames in the first audio segment are a first fourth sub-audio segment, the 14 th to 16 th frames are a second fourth sub-audio segment, and so on.

On the basis, the third sub-audio segment is sequentially matched with one or more fourth sub-audio segments until the fourth sub-audio segment successfully matched with the third sub-audio segment is determined, and the termination coordinates of the audio segment to be cut in the first audio segment are determined based on the coordinates of the audio frame in the fourth sub-audio segment. As an example, the termination coordinates of the audio piece to be clipped in the first audio piece may be determined based on the coordinates of the first audio frame in the fourth sub-audio piece.

The second audio frame number can be configured according to actual requirements. As an example, to improve accuracy of the positioning result, the second audio frame number may be set to a value smaller than the first audio frame number, for example, the second audio frame number may be 3 frames or 1 frame.

As an example, the number of second audio frames may be 1 frame, the first audio frame in the second audio segment after the non-aligned point is the third sub-audio segment, the first audio frame in the first audio segment after the start coordinate is the first fourth sub-audio segment, the second audio frame is the second fourth sub-audio segment, and so on.

In the following description of the example with reference to fig. 4, fig. 4 is an exemplary diagram of providing start coordinates and end coordinates of an audio segment to be cut, and coordinates of non-alignment points according to an embodiment of the present invention, where fig. 4 corresponds to fig. 2, and specifically, a hatched portion in fig. 4 is audio content of a first audio segment that is more than a second audio segment, and corresponds to an audio segment k shown in fig. 1.

Fig. 4 shows the start coordinate x11 of the audio piece to be cut in the first audio piece and the non-alignment point coordinate x1 in the second audio piece, which are determined in step S103. Taking the number of the second audio frames as the segmentation length, determining a third sub-audio segment a 'in the second audio segment after the non-aligned point coordinate x1, determining a first fourth sub-audio segment a in the first audio segment after the initial coordinate x11 of the audio segment to be cut, and matching the third sub-audio segment a' with the fourth sub-audio segment a, if the matching fails, continuing to match the third sub-audio segment a 'with the fourth sub-audio segment b, if the matching fails, continuing to match the third sub-audio segment a' with the fourth sub-audio segment c, and if the matching is successful, determining the termination coordinate of the audio segment to be cut in the first audio segment based on the coordinates of the audio frames in the fourth sub-audio segment b. For example, the termination coordinates x12 of the audio piece to be cut in the first audio piece may be determined based on the coordinates of the first audio frame in the fourth sub-audio piece b.

In the example shown in fig. 4, if the number of the second audio frames is 1 frame, the third sub-audio segment a 'can be understood as the first audio frame in the audio segment n' in the second audio segment shown in fig. 2, and the fourth sub-audio segment a can be understood as the first audio frame in the audio segment k in the first audio segment.

The fourth sub-audio segment c may be understood as a first audio frame in the audio segment n in the first audio segment shown in fig. 2, so that the third sub-audio segment a 'and the fourth sub-audio segment c can be successfully matched, and under the condition that the third sub-audio segment a' and the fourth sub-audio segment c are successfully matched, the termination coordinate x12 of the audio segment to be cut in the first audio segment can be further determined based on the coordinates of the audio frame in the fourth sub-audio segment c, so as to realize positioning of the audio segment to be cut.

The same procedure as the matching process of the first sub-audio segment and the second sub-audio segment can be considered to be successful in matching the third sub-audio segment and the fourth sub-audio segment under the condition that the similarity between the audio fingerprint corresponding to the third sub-audio segment and the audio fingerprint corresponding to the fourth sub-audio segment meets the preset condition. The embodiment of the present invention does not limit the preset conditions, and as an example, the similarity threshold between the third sub-audio segment and the fourth sub-audio segment may be set in advance, and if the similarity between the audio fingerprint corresponding to the third sub-audio segment and the audio fingerprint corresponding to a certain fourth sub-audio segment is not less than the similarity threshold, the matching between the third sub-audio segment and the fourth sub-audio segment may be considered successful.

In the embodiment of the invention, if the third sub-audio segment and the fourth sub-audio segment are respectively sub-audio segments corresponding to different audio contents, the similarity between the audio fingerprints corresponding to the third sub-audio segment and the fourth sub-audio segment is lower and does not exceed a predetermined threshold, i.e. the third sub-audio segment and the fourth sub-audio segment cannot be successfully matched.

If the third sub-audio segment and the fourth sub-audio segment are the same sub-audio segments corresponding to the same audio content, the difference is only that whether the sub-audio segments are standard versions or not, then the similarity between the audio fingerprints corresponding to the third sub-audio segment and the fourth sub-audio segment is higher, and the similarity exceeds a predetermined threshold value, namely the third sub-audio segment and the fourth sub-audio segment can be successfully matched.

In the embodiment of the present invention, after the start coordinate and the end coordinate of the audio segment to be cut in the first audio segment are obtained based on the steps S101 to S104, the start coordinate and the end coordinate are considered to be specific to the start coordinate and the end coordinate of the audio segment that is more than the second audio segment in the first audio segment, or the start coordinate and the end coordinate of the difference audio segment between the two, see the example of fig. 1, that is, the located audio segment to be cut is considered to be the audio segment k in fig. 1, that is, the portion to be cut.

Thus, in step S105, by clipping the audio piece to be clipped from the first audio piece, an audio piece adapted to the non-full version video and meeting the audio quality requirement, or a standard version audio piece, can be obtained.

In step S105, a location interval of the audio segment to be clipped may be determined based on the start coordinate and the end coordinate of the audio segment to be clipped in the first audio segment, and clipping is performed on the first audio segment based on the location interval.

Specifically, the audio frame in the first audio piece between the start coordinate and the end coordinate may be regarded as the content in the audio piece to be cut.

As an example, in the case of representing coordinates based on the audio frame number, if the start coordinate of the audio segment to be cut in the first audio segment is x11 and the end coordinate is x12, the position interval of the audio segment to be cut in the first audio segment is [ x11, x 12), and the standard audio segment adapted to the non-full version video can be obtained by cutting out the audio frame in the [ x11, x 12) of the first audio segment. Specifically, since the termination coordinate x12 of the audio segment to be clipped is specifically determined based on the coordinate of the audio frame in the fourth sub-audio segment successfully matched with the third sub-audio segment, the audio frame in the fourth sub-audio segment does not belong to the content in the audio segment to be clipped when the fourth sub-audio segment and the third sub-audio segment are successfully matched, and therefore the endpoint coordinate is not included in the position interval of the audio segment to be clipped.

For example, if the start coordinate of the audio segment to be cut in the first audio segment is the 100 th frame and the end coordinate is the 200 th frame, the position interval of the audio segment to be cut in the first audio segment is [100,200 ], which indicates that the content of the 100 th-199 th frame in the first audio segment needs to be cut, and the standard audio segment adapted to the non-full version video is obtained.

In one embodiment of the invention, the complete version video is a trial version video, the incomplete version video is a trial version video after the trial version video is subjected to pruning, the standard version audio frequency segment is a standard international sound corresponding to the complete version video, and the non-standard version audio frequency segment is an international sound separated from the trial version video.

In the embodiment of the invention, the standard version of international sound can be provided by a film and television work provider while providing the original video, namely, transmitting the review video. And separating human voice background sound from the audio corresponding to the over-examination version video medium, wherein the separated background sound is the non-standard version audio frequency band.

Therefore, the standard international sound corresponding to the transmitted trial video and the international sound separated from the over-trial video are processed based on the steps S101-S104, so that the audio segment to be cut, namely the position interval of the international sound corresponding to the deleted video segment in the standard international sound, can be determined, and the original standard international sound is cut based on the position interval, so that the standard international sound matched with the over-trial video can be obtained. Specifically, the process of processing the original international sound provided by the provider to obtain an international sound adapted to the over-review video is generally referred to as repairing the international sound.

Therefore, the embodiment of the invention does not need to manually compare the audio corresponding to the trial video and the over-trial video, can automatically position and cut the audio to be cut included in the standard version over-trial corresponding to the trial video, obtains the standard international sound matched with the over-trial video, completes the international sound repair, and can improve the repair efficiency of the international sound and reduce the repair cost.

In one embodiment of the present invention, it may be determined whether the first sub-audio piece and the second sub-audio piece are successfully matched based on the following manner:

As described above, each of the first sub-audio segment and the second sub-audio segment may specifically include a number of audio frames of the first audio frame, if the number of the first audio frame is specifically N, performing a cross-correlation operation on the audio fingerprint corresponding to the first sub-audio segment and the audio fingerprint corresponding to the second sub-audio segment, specifically, a similarity sequence with a term number of 2N-1 may be obtained, and if the coordinate of the similarity peak in the similarity sequence is the center coordinate of the similarity sequence, that is, the peak of the similarity sequence is specifically the nth term, then the matching between the first sub-audio segment and the second sub-audio segment is considered to be successful, otherwise, the matching between the first sub-audio segment and the second sub-audio segment is not successful.

Regarding how to perform the cross-correlation operation on the audio feature corresponding to the first sub-audio piece and the audio fingerprint corresponding to the second sub-audio piece, specific reference may be made to the content in the related art, and the cross-correlation operation will be briefly described below with reference to only specific examples.

Fig. 5 is an exemplary diagram of a cross-correlation calculation process provided in an embodiment of the present invention, for convenience of description, specifically taking the number of the first audio frames as 3 frames as an example, as shown in fig. 5, three audio frames in the first sub-audio segment are marked with 0,1,2, corresponding audio fingerprints are marked with x0, x1, x2, respectively, three audio frames in the second audio segment are marked with 3,4,5, and corresponding audio features are marked with x3, x4, x5, respectively. The cross-correlation operation is performed on the audio fingerprints corresponding to the first sub-audio segment and the audio fingerprints corresponding to the second sub-audio segment shown in fig. 5, which can be specifically understood as sliding the second sub-audio segment on the position where the first sub-audio segment is located by taking the first sub-audio segment as a reference, and for the audio frames sliding on the second sub-audio segment to the position of the first sub-audio segment, calculating the similarity between the audio fingerprints corresponding to the audio frames and the audio fingerprints corresponding to the first audio segment, and in order to achieve that the data length of the audio features is consistent when the similarity is calculated, zero padding can be performed on the position where the data is missing.

Referring to fig. 5 (a), the audio frame 5 in the second sub-audio piece is slid to the corresponding position of the audio frame 0, at which time the similarity between the audio fingerprints (x 0, x1, x 2) and (x 5, 0) can be calculated, and (b) - (e) are the same, and a total of 5 similarities can be calculated. And (5) if the similarity of (a) - (e) is 0.2,0.2,0.9,0.3,0.1, the coordinate of the similarity peak value in the similarity sequence is the center coordinate of the similarity sequence, and the first sub-audio segment and the second sub-audio segment are considered to be successfully matched.

As described above, the first sub-audio piece of the first audio piece is specifically a non-standard version, and the second sub-audio piece of the second audio piece is specifically a standard version, so if matching is performed based on only the similarity between the audio fingerprints corresponding to the first sub-audio piece and the second sub-audio piece, the accuracy of the matching result may be limited. In the embodiment of the invention, the similarity sequence is obtained by carrying out cross-correlation operation on the audio fingerprints corresponding to the first sub-audio segment and the second sub-audio segment, and whether the coordinate of the similarity peak value in the similarity sequence is the center coordinate of the similarity sequence is judged.

In an embodiment of the present invention, the step of determining the start coordinate of the audio segment to be clipped in the first audio segment based on the coordinate of the audio frame in the i first sub-audio segment and determining the non-aligned point coordinate in the second audio segment based on the position coordinate of the i second sub-audio segment may specifically include:

and taking the number of the third audio frames as a shortening step length, synchronously shortening the ith first sub-audio segment and the ith second sub-audio segment, matching the shortened first sub-audio segment with the shortened second sub-audio segment until the shortened first sub-audio segment and the shortened second sub-audio segment are successfully matched, taking the coordinate of the last audio frame in the ith first sub-audio segment which is successfully matched as the initial coordinate of the audio segment to be cut in the first audio segment, and taking the coordinate of the last audio frame in the ith second sub-audio segment which is successfully matched as the non-alignment point coordinate in the second audio segment.

The third audio frame number may be predetermined in advance, and the third audio frame number is smaller than the first audio frame number, for example, if the first audio frame number is 10 frames, the third audio frame number may be 3 frames or 1 frame.

In the embodiment of the invention, the ith first sub-audio segment and the ith second sub-audio segment are synchronously shortened, and the shortened first sub-audio segment and the shortened second sub-audio segment are matched until the shortened first sub-audio segment and the shortened second sub-audio segment are successfully matched, which can be specifically understood as that the first sub-audio segment and the second sub-audio segment are synchronously shortened by the number of third audio frames for the first sub-audio segment and the second sub-audio segment which are not successfully matched, and the shortened first sub-audio segment and the shortened second sub-audio segment are matched, if the matching is not successful, the first sub-audio segment and the second sub-audio segment are continuously shortened by the number of third audio frames and are matched, and so on until the shortened first sub-audio segment and the shortened second sub-audio segment are successfully matched.

As an example, if the number of third audio frames is 2 frames, the 11 th-20 th audio frame in the first audio segment is the 2 nd first sub-audio segment, the 11 th-20 th audio frame in the second audio segment is the 2 nd second sub-audio segment, and the 2 nd first sub-audio segment and the 2 nd second sub-audio segment are not successfully matched,

and taking the 11 th to 18 th audio frames in the first audio section as a shortened first sub-audio section, taking the 11 th to 18 th audio frames in the second audio section as a shortened second sub-audio section, matching the shortened first sub-audio section with the second sub-audio section, if the matching fails, continuing to shorten the first sub-audio section and the second sub-audio section, matching the 11 th to 16 th audio frames in the first audio section with the 11 th to 16 th audio frames in the second audio section, and so on until the matching of the shortened first sub-audio section and the second sub-audio section is successful.

After the shortened first sub-audio segment and the second sub-audio segment are successfully matched, the coordinate of the last audio frame in the i-th first sub-audio segment which is successfully matched is used as the initial coordinate of the audio segment to be cut in the first audio segment, and the coordinate of the last audio frame in the i-th second sub-audio segment which is successfully matched is used as the non-alignment point coordinate in the second audio segment.

As an example, if the shortened first sub-audio segment is 11 th to 16 th frames, the shortened second sub-audio segment is 11 th to 16 th frames, and the first sub-audio segment and the second sub-audio segment are successfully matched, the starting coordinate of the audio segment to be cut in the first audio segment is 16 th frame, and the non-alignment point coordinate in the second audio segment is 16 th frame.

It should be noted that, in the embodiment of the present invention, if the starting position of the audio piece to be clipped in the first audio piece is denoted as x11, since x11 is specifically determined based on the coordinates of the audio frame in the first sub-audio piece that is successfully matched with the second sub-audio piece, in the case where the second sub-audio piece is successfully matched with the first sub-audio piece, the audio frame in the first sub-audio piece does not belong to the content in the audio piece to be clipped, and therefore, when the position interval of the audio piece to be clipped is subsequently determined so as to clip the first audio piece, the endpoint coordinates are not included in the position interval of the audio piece to be clipped. That is, if the termination coordinate of the audio piece to be cut in the first audio piece is x12, the position interval of the audio piece to be cut in the first audio piece is (x 11, x 12).

In the following description, referring to the specific example, fig. 6 is an exemplary diagram of a matching process of a first sub-audio segment and a second sub-audio segment provided by the embodiment of the present invention, a shadow portion in fig. 6 is an audio content that is more than a second sub-audio segment in the first sub-audio segment, fig. 6 shows, in (a), the first sub-audio segments a, B, C and the second sub-audio segments a ', B', C ', the first sub-audio segment a and the second sub-audio segment a', the first sub-audio segment B and the second sub-audio segment B 'are successfully matched, if the first sub-audio segment C and the second sub-audio segment C' are not successfully matched, the first sub-audio segment C and the second sub-audio segment C 'are synchronously shortened until the first sub-audio segment C and the second sub-audio segment C are successfully matched, D shown in (B) is a first sub-audio segment obtained after shortening the first sub-audio segment C, D' is a second sub-audio segment obtained after shortening the second sub-audio segment D is used as a coordinate of a first sub-audio segment, and D is successfully matched in a first audio segment x, and D is used as a first audio segment of a last segment in x.

In practical application, when the number of the first audio frames is used as the segment length to match the first audio piece with the second audio piece, the initial position of the audio piece to be cut in the first audio piece is located at the middle position of the first audio piece with a certain probability, and under the condition that the matching of the first audio piece and the second audio piece fails, the initial coordinate of the audio piece to be cut in the first audio piece and the non-alignment point coordinate of the audio piece to be cut in the second audio piece are difficult to be accurately determined. In the embodiment of the invention, when the first sub-audio segment and the second sub-audio segment are not successfully matched, the first sub-audio segment and the second sub-audio segment which are not successfully matched are synchronously shortened until the first sub-audio segment and the second sub-audio segment which are shortened are successfully matched, the coordinate of the last audio frame in the first sub-audio segment which is shortened is used as the initial coordinate of the audio segment to be cut in the first audio segment, and the coordinate of the last audio frame in the second sub-audio segment which is shortened is used as the non-alignment point coordinate in the second audio segment, so that the accuracy of positioning the audio segment to be cut is improved.

In one embodiment of the present invention, the audio segments to be clipped may be a plurality of audio segments distributed in a second audio segment, and fig. 7 is another exemplary diagram of a first audio segment and a second audio segment provided by the implementation of the present invention, where the first audio segment includes audio segments m, n, o, and k and p of standard versions, and the second audio segment includes audio segments m ', n ', o ' of non-standard versions only, as shown in fig. 7.

Therefore, in this embodiment, the audio clipping method provided by the embodiment of the present invention further includes:

determining a first audio segment after the termination coordinate as a new first audio segment, determining a second audio segment after the non-aligned point coordinate as a new second audio segment, returning to the step of determining one or more first sub-audio segments in the first audio segment by taking the number of the first audio frames as the segment length, determining one or more second sub-audio segments in the second audio segment in turn, and sequentially matching the first sub-audio segment and the second sub-audio segment in the same order, obtaining the start coordinate and the termination coordinate of the audio segment to be cut determined for the new first audio segment, determining the first audio segment after the current termination coordinate as the new first audio segment, and determining the second audio segment after the current non-aligned point coordinate as the new second audio segment until the first sub-audio segment and the second sub-audio segment are successfully matched.

Specifically, after locating the start coordinate and the end coordinate of one audio segment to be cut in the first audio segment, matching is further required to be performed on subsequent audio contents in the first audio segment and the second audio segment, so as to locate other audio segments to be cut possibly included in the first audio segment. Therefore, the specific process of determining the first audio segment after the termination coordinate as a new first audio segment, determining the second audio segment after the non-aligned point coordinate as a new second audio segment, and matching the first sub-audio segment in the newly determined first audio segment with the second sub-audio segment in the second audio segment may be referred to as the description of steps S103-S104. After the second audio segment to be cut included in the first audio segment is positioned based on the first audio segment, the positioning of the subsequent audio segment to be cut is similar to the positioning until all the audio contents in the first audio segment and the second audio segment are matched.

In the following, with reference to fig. 7, an exemplary embodiment of the present invention is described, where A1 and B1 are an original first audio segment and a second audio segment, and based on matching the first sub-audio segment in A1 and the second sub-audio segment in B1 in steps S103-S104, a start coordinate x11 and an end coordinate x12 of the first audio segment to be cut in the first audio segment may be determined, so as to implement positioning of the first audio segment to be cut k. After k is located, determining the first audio segment A2 after the termination coordinate x12 as a new first audio segment, determining the second audio segment B2 after the non-alignment point coordinate x1 as a new second audio segment, and performing new round matching on the first sub-audio segment in A2 and the second sub-audio segment in B2 based on steps S103-S104, so as to determine the start coordinate x21 and the termination coordinate x22 of the second audio segment to be cut in the first audio segment, and realize positioning of the second audio segment to be cut p. After p is located, determining the first audio segment A3 after the termination coordinate x22 as a new first audio segment, determining the second audio segment B3 after the non-alignment point coordinate x2 as a new second audio segment, performing new round of matching on the first sub-audio segment in A3 and the second sub-audio segment in B3, and if the subsequent first audio segment also comprises other audio segments to be cut, locating the audio segments to be cut similarly.

According to the embodiment of the invention, the start coordinates and the end coordinates of the plurality of audio segments to be cut in the first audio segment can be obtained, for example, the start coordinates of the ith audio segment to be cut in the first audio segment are xi1, and the end coordinates are xi2.

In the embodiment of the invention, the first audio segment after the termination coordinate is determined to be the new first audio segment, and the second audio segment after the non-alignment point coordinate is determined to be the new second audio segment, so that the positioning of a plurality of audio segments to be cut, which are included in the first audio segment, can be realized, and the method has better applicability.

In one embodiment of the present invention, the step of clipping the first audio segment based on the start coordinate and the end coordinate of the audio segment to be clipped in the first audio segment specifically includes:

In the embodiment of the invention, after the start coordinate and the end coordinate corresponding to one audio segment to be cut are determined, the start coordinate and the end coordinate corresponding to the audio segment to be cut can be recorded first, after the matching of all the audio contents in the first audio segment and the second audio segment is completed, the first audio segment is cut based on the start coordinate and the end coordinate corresponding to each audio segment to be cut, namely, the audio contents between xi1 and xi2 are cut.

In the embodiment of the invention, the first audio segment after the initial coordinate is determined as the new first audio segment, the second audio segment after the ending coordinate is determined as the new second audio segment, and the first audio segment is cut based on the initial coordinate and the ending coordinate of each positioned audio segment to be cut, so that the applicability of the audio cutting method is improved.

Fig. 8 is a schematic diagram of an audio clipping method according to an embodiment of the present invention, and the audio clipping method according to the embodiment of the present invention is further described below with reference to fig. 8.

As shown in fig. 8, background sound separation is performed on the over-examined video medium to obtain a separated international sound MnE _2, i.e., a second audio segment, and an original international sound MnE _1, i.e., a first audio segment. And then respectively carrying out audio fingerprint extraction on the separated international sound and the original international sound to respectively obtain an audio fingerprint sequence FP1 corresponding to the first audio segment and an audio fingerprint sequence FP2 corresponding to the second audio segment, matching the audio fingerprint Fp1_sub_i corresponding to the first sub audio segment in the first audio segment with the audio fingerprint F2_sub_i corresponding to the second sub audio segment in the second audio segment, if the matching is successful, continuing to match the Fp1_sub_i+1 and the Fp2_sub_i+1, if the matching is unsuccessful, matching the Fp1_sub_i corresponding to the shortened first sub audio segment with the Fp2_sub_i corresponding to the shortened second sub audio segment until the matching is successful, obtaining the initial coordinate xi1 of the audio segment to be cut in the first audio segment based on the audio coordinates of the audio frame corresponding to Fp1_sub_i+1, and then carrying out cutting on the third sub audio segment after the non-aligned point coordinates xi in the second audio segment with one or more fourth audio segments after the first audio segment in the first audio segment, and further obtaining the audio segment to be cut in the first audio segment and the audio segment to be cut in the second audio segment to be the sliding process. After matching is completed, the original international sound may be clipped based on the obtained xi1 and x12, and clipped international sound MnE _out may be obtained.

In the embodiment of the application, the background audio of the over-examination video medium is separated to obtain the separated international sound MnE _2, the MnE _2 and the original international sound MnE _1 are respectively subjected to audio fingerprint extraction, then the audio fingerprint F1_sub_i corresponding to the first sub-audio segment and the audio fingerprint F2_sub_i corresponding to the second sub-audio segment are subjected to sliding matching, the initial coordinate of the audio segment to be cut is determined based on the coordinates of the audio frame in the first sub-audio segment which is not successfully matched, then the third sub-audio segment after the non-aligned point coordinates in the second audio segment and one or more fourth sub-audio segments after the initial coordinates in the first audio segment are matched to obtain the final coordinates of the audio segment to be cut, and the international sound of the original is cut based on the initial coordinates and the final coordinates of the audio segments to be cut to obtain the international sound corresponding to the over-examination video, so that the repairing efficiency of the international sound can be improved and the repairing cost of the international sound can be reduced.

Corresponding to the above method embodiment, the embodiment of the present application further provides an audio clipping apparatus, as shown in fig. 9, including:

an acquisition module 901, configured to acquire a first audio segment and a second audio segment; the first audio segment is a standard audio segment corresponding to the complete version of video, and the second audio segment is a non-standard audio segment separated from the non-complete version of video;

The extracting module 902 is configured to extract audio fingerprints of the first audio segment and the second audio segment, respectively, to obtain audio fingerprints of each audio frame in the first audio segment and the second audio segment, respectively;

a first matching module 903, configured to sequentially determine one or more first sub-audio segments in the first audio segment by using the number of the first audio frames as a segment length, sequentially determine one or more second sub-audio segments in the second audio segment, and sequentially match the first sub-audio segment and the second sub-audio segment in the same order until the i first sub-audio segment and the i second sub-audio segment are not successfully matched, determine a start coordinate of the audio segment to be cut in the first audio segment based on a coordinate of the audio frame in the i first sub-audio segment, and determine a non-aligned point coordinate in the second audio segment based on a coordinate of the audio frame in the i second sub-audio segment; the matching of the first sub-audio segment and the second sub-audio segment successfully represents that the similarity between the audio fingerprint corresponding to the first sub-audio segment and the audio fingerprint corresponding to the second sub-audio segment meets the preset condition;

a second matching module 904, configured to determine a third sub-audio segment from the second audio segments located after the non-aligned point coordinates, determine one or more fourth sub-audio segments from the first audio segments located after the start coordinates, match the third sub-audio segment with the one or more fourth sub-audio segments in sequence until a fourth sub-audio segment successfully matched with the third sub-audio segment is determined, and determine a termination coordinate of the audio segment to be cut in the first audio segment based on coordinates of an audio frame in the fourth sub-audio segment successfully matched;

The clipping module 905 is configured to clip the first audio segment based on the start coordinate and the end coordinate of the audio segment to be clipped in the first audio segment, and obtain a standard audio segment adapted to the non-full version video.

According to the audio clipping device provided by the embodiment of the invention, one or more first sub-audio segments are sequentially determined in a first audio segment, one or more second sub-audio segments are sequentially determined in a second audio segment, the first sub-audio segments and the second sub-audio segments in the same order are sequentially matched, the initial coordinates of the audio segments to be clipped in the first audio segment are determined based on the coordinates of audio frames in the first sub-audio segments which are not successfully matched, the non-aligned point coordinates in the second audio segment are determined based on the coordinates of audio frames in the second sub-audio segments which are not successfully matched, a third sub-audio segment is determined in the second audio segment which is behind the non-aligned point coordinates, one or more fourth sub-audio segments are determined in the first audio segment which is behind the initial coordinates, the third sub-audio segments and the fourth sub-audio segments are sequentially matched, the termination coordinates of the audio segments to be clipped in the first audio segment are determined based on the coordinates of audio frames in the fourth sub-audio segments which are successfully matched, and the audio segments to be clipped in the first audio segment are completely matched, and the audio standard is obtained. By applying the audio clipping method provided by the embodiment of the invention, the audio segments to be clipped are positioned by matching the sub-audio segments in the first audio segment and the second audio segment, the positioning and clipping of the audio content which is more than the audio content in the second audio segment in the first audio segment can be realized without manually comparing the first audio segment with the second audio segment, the standard audio segment of the incomplete audio video is obtained, the efficiency of positioning the audio content which is more than the audio content in the audio and clipping can be improved, and the international sound repairing efficiency can be improved and the repairing cost can be reduced when the method is applied to the international sound repairing.

In one embodiment of the present invention, the first matching module 903 includes:

the matching unit is used for synchronously shortening the ith first sub-audio segment and the ith second sub-audio segment by taking the number of third audio frames as a shortening step length, matching the shortened first sub-audio segment and the shortened second sub-audio segment until the shortened first sub-audio segment and the shortened second sub-audio segment are successfully matched, taking the coordinate of the last audio frame in the ith first sub-audio segment which is successfully matched as the initial coordinate of the audio segment to be cut in the first audio segment, and taking the coordinate of the last audio frame in the ith second sub-audio segment which is successfully matched as the non-alignment point coordinate in the second audio segment.

In one embodiment of the invention, the apparatus further comprises:

a return module, configured to determine a first audio segment after the termination coordinate as a new first audio segment, determine a second audio segment after the non-aligned point coordinate as a new second audio segment, return to the step of determining one or more first sub-audio segments in the first audio segment sequentially with the number of first audio frames as a segment length, determine one or more second sub-audio segments in the second audio segment sequentially, and match the first sub-audio segment and the second sub-audio segment in the same order sequentially, obtain a start coordinate and a termination coordinate of an audio segment to be cut determined for the new first audio segment, return to the step of determining the first audio segment after the current termination coordinate as the new first audio segment, and determine the second audio segment after the current non-aligned point coordinate as the new second audio segment until the first sub-audio segment and the second sub-audio segment are both matched successfully.

In one embodiment of the present invention, clipping module 905 is specifically configured to:

The embodiment of the invention also provides an electronic device, as shown in fig. 10, which comprises a processor 101, a communication interface 102, a memory 103 and a communication bus 104, wherein the processor 101, the communication interface 102 and the memory 103 complete communication with each other through the communication bus 104,

a memory 103 for storing a computer program;

the processor 101 is configured to execute a program stored in the memory 103, and implement the following steps:

acquiring a first audio segment and a second audio segment; the first audio segment is a standard audio segment corresponding to the complete version of video, and the second audio segment is a non-standard audio segment separated from the non-complete version of video.

And respectively extracting the audio fingerprints of the first audio segment and the second audio segment to respectively obtain the audio fingerprints of each audio frame in the first audio segment and the second audio segment.

Sequentially determining one or more first sub-audio segments in the first audio segment by taking the number of the first audio frames as the segment length, sequentially determining one or more second sub-audio segments in the second audio segment, sequentially matching the first sub-audio segments and the second sub-audio segments in the same sequence until the ith first sub-audio segment and the ith second sub-audio segment are not successfully matched, determining the initial coordinate of the audio segment to be cut in the first audio segment based on the coordinate of the audio frame in the ith first sub-audio segment, and determining the non-alignment point coordinate in the second audio segment based on the coordinate of the audio frame in the ith second sub-audio segment; the matching of the first sub-audio segment and the second sub-audio segment successfully represents that the similarity between the audio fingerprint corresponding to the first sub-audio segment and the audio fingerprint corresponding to the second sub-audio segment meets the preset condition.

And determining a third sub-audio piece in the second audio piece positioned behind the non-aligned point coordinates by taking the number of the second audio frames as the segmentation length, determining one or more fourth sub-audio pieces in the first audio piece positioned behind the initial coordinates, sequentially matching the third sub-audio piece with the one or more fourth sub-audio pieces until the fourth sub-audio piece successfully matched with the third sub-audio piece is determined, and determining the termination coordinate of the audio piece to be cut in the first audio piece based on the coordinates of the audio frames in the fourth sub-audio piece successfully matched.

And cutting the first audio segment based on the starting coordinate and the ending coordinate of the audio segment to be cut in the first audio segment to obtain a standard audio segment matched with the non-complete version of video.

The communication interface is used for communication between the terminal and other devices.

The memory may include random access memory (Random Access Memory, RAM) or non-volatile memory (non-volatile memory), such as at least one disk memory. Optionally, the memory may also be at least one memory device located remotely from the aforementioned processor.

The processor may be a general-purpose processor, including a central processing unit (Central Processing Unit, CPU for short), a network processor (Network Processor, NP for short), etc.; but also digital signal processors (Digital Signal Processor, DSP for short), application specific integrated circuits (Application Specific Integrated Circuit, ASIC for short), field-programmable gate arrays (Field-Programmable Gate Array, FPGA for short) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components.

In yet another embodiment of the present invention, a computer readable storage medium is provided, in which a computer program is stored, which when executed by a processor, implements the audio clipping method according to any of the above embodiments.

In yet another embodiment of the present invention, a computer program product containing instructions that, when run on a computer, cause the computer to perform the audio clipping method of any of the above embodiments is also provided.

In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, produces a flow or function in accordance with embodiments of the present invention, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in or transmitted from one computer-readable storage medium to another, for example, by wired (e.g., coaxial cable, optical fiber, digital Subscriber Line (DSL)), or wireless (e.g., infrared, wireless, microwave, etc.). The computer readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that contains an integration of one or more available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., solid State Disk (SSD)), etc.

It is noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

In this specification, each embodiment is described in a related manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for apparatus, electronic devices, computer readable storage media and computer program product embodiments, the description is relatively simple as it is substantially similar to method embodiments, as relevant points are found in the partial description of method embodiments.

The foregoing description is only of the preferred embodiments of the present invention and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention are included in the protection scope of the present invention.

Claims

1. An audio clipping method, comprising:

2. The method of claim 1, wherein determining whether the first sub-audio segment and the second sub-audio segment match is successful is based on:

3. The method of claim 1, wherein the step of determining start coordinates of the audio segment to be clipped in the first audio segment based on coordinates of the audio frame in the i-th of the first sub-audio segment and determining non-aligned point coordinates in the second audio segment based on position coordinates of the i-th of the second sub-audio segment comprises:

4. The method as recited in claim 1, further comprising:

5. The method of claim 4, wherein the step of clipping the first audio segment based on the start and end coordinates of the audio segment to be clipped in the first audio segment comprises:

6. The method of claim 1, wherein the full version video is a censored version video, the non-full version video is a censored version video after censoring processing of the censored version video, the standard version audio segment is a standard international sound corresponding to the full version video, and the non-standard version audio segment is an international sound separated from the censored version video.

7. An audio clipping apparatus, comprising:

8. The apparatus of claim 7, wherein the first matching module comprises:

9. The electronic equipment is characterized by comprising a processor, a communication interface, a memory and a communication bus, wherein the processor, the communication interface and the memory are communicated with each other through the communication bus;

a memory for storing a computer program;

a processor for carrying out the method steps of any one of claims 1-6 when executing a program stored on a memory.

10. A computer-readable storage medium, characterized in that the computer-readable storage medium has stored therein a computer program which, when executed by a processor, implements the method steps of any of claims 1-6.