WO2022129104A1

WO2022129104A1 - Method and system for automatically synchronizing video content and audio content

Info

Publication number: WO2022129104A1
Application number: PCT/EP2021/085781
Authority: WO
Inventors: Philippe Guillaud; Igal Cohen Hadria; André MANOUKIAN; Hervé GOURDIKIAN
Original assignee: Imuze France
Priority date: 2020-12-14
Filing date: 2021-12-14
Publication date: 2022-06-23
Also published as: FR3119063B1; FR3119063A1

Abstract

The invention relates to a method for synchronizing an audio sample with a sequence of moving images, or video sequence, the method comprising the steps of: - analysing (20) the video sequence to generate characteristic data, at least one of the characteristic data generated being representative of a tempo value calculated for the video sequence; - searching (21), through an audio file database, for one or more audio files containing an audio sample featuring one or more characteristics compatible with the characteristic data generated from the video sequence, at least one of the compatible characteristics being a tempo value of the audio file in question; - selecting (22), from among the audio samples found, that audio sample which has the best compatibility with the video sequence; - synchronizing (23) the selected audio sample with the video sequence; - generating (26) a video file containing the video sequence synchronized with the selected audio sample.

Description

Method and system for automatic synchronization of video content and audio content

The invention relates to the field of the production of sound effects for sequences of animated images, or videos.

With the development of online video hosting platforms, and the proliferation of portable devices capable of capturing high quality video (miniature digital cameras and mobile phones), the number of videos created and posted online, whether by professionals or amateurs, has experienced exponential growth for several years. The creation of a video very often requires the creation or adaptation of a soundtrack to appropriately accompany the images of the video. If you are not able to compose music to accompany a video, the most immediate solution is to use an existing musical title whose character is likely to best match the content of the video. However, finding a suitable musical title can be very difficult since the creator of the video will generally choose a title among those he knows, which will represent for most people a very limited contingent of titles, whereas today there are nearly 150 million existing musical titles.

The present invention is based on the observation that, among all the titles available in the catalogs of music industry companies, which represent more than 150 million titles, barely 0.1% of these titles are exploited and generate income for their authors or assigns. A considerable quantity of existing titles is therefore totally unused, and therefore unrecognized. However, most of these unused titles necessarily have qualities and characteristics that make them likely to be used for the sound design of a video. Thus, the less a title is used, the less it will be known and the less likely it will be used by a person doing video editing, whether that person is a video and/or sound editing professional or not.

In addition, for people who are not video and/or sound editing professionals, finding suitable music for the soundtrack of a video is not the only difficulty: it is then necessary to adapt the title chosen according to the duration of the video, which can be very variable (even though almost all contemporary musical titles see their format adapted to the format so as not to reduce the chances of being broadcast by prime time radio stations, their duration thus often being very close to 3 minutes), and finally to synchronize the music with the video.

The aim of the present invention is to propose a method and a system making it possible to automatically carry out the dressing of a sequence of animated images from a database containing audio files, the content of which may for example be titles music, excerpts of musical titles, various sounds, etc. The present invention also aims to provide such a method and a system making it possible to carry out the sound dressing of a video very quickly, typically in a few seconds.

To this end, the invention relates to a method for synchronizing an audio sample with a sequence of animated images, or video sequence, the method comprising the steps of:

- analyzing the video sequence to generate characteristic data, at least one of the characteristic data generated being representative of a tempo value calculated for the video sequence;

- search, in a database of audio files, for one or more audio files containing an audio sample having one or more characteristics compatible with the characteristic data generated from the video sequence, at least one of the compatible characteristics being a tempo value of the audio sample considered;

- select, among the audio samples found, the audio sample which presents the best compatibility with the video sequence;

- synchronize the selected audio sample with the video sequence;

- generate a video file containing the video sequence synchronized with the selected audio sample.

Thus, the method in accordance with the invention makes it possible to automatically associate a digital audio file containing an audio sample with a file containing a video, and to synchronize this audio sample with the video, so as to offer a user an appropriate sound design of this video. By detecting a tempo value of a video, comparable to the tempo value of a piece of music, the method according to the invention makes it possible to find very quickly (typically in a few seconds) compatible audio samples in a database data. The compatibility between the tempo value assigned to the video and the tempo value of the song used has a decisive effect on the compatibility between the visual aspect and the sound aspect of a video, as it will be felt by a person viewing the video. Thus, by favoring the tempo criterion, the method in accordance with the invention makes it possible to offer audio samples compatible with a video submitted by a user in an extremely rapid manner.

In one embodiment, an audio sample is compatible if it has a tempo value equal or close to the tempo value of the video sequence or a tempo value equal to or close to a multiple or a sub-multiple of the tempo value of the video sequence.

In one embodiment, the calculated tempo value for the video sequence is determined by detecting characteristic events occurring during the video sequence, such as scene changes.

In one embodiment, the detection of a characteristic event such as a change of scene is carried out by chromatic analysis of each image of the video sequence, a change of scene being detected if a significant change in color is measured between two successive images.

In one embodiment, the step of selecting from among the audio samples found, the audio sample which has the best compatibility with the video sequence includes a sub-step of calculating a compatibility score. In one embodiment, the synchronization step includes a sub-step of modifying the duration of the audio sample to adapt it to the duration of the video sequence.

In one embodiment, the duration modification sub-step is performed by recombining one or more portions of the audio sample and/or one or more blocks of a portion of the audio sample.

In one embodiment, the recombination is carried out so that after modification of the duration, the recombined audio sample has a structure similar to that of the initial audio sample, and comprises for example an introductory part, followed by a central part and a final part.

In one embodiment, the synchronization step includes a sub-step of adapting the duration of at least part of the audio sample, consisting in locally decreasing or increasing the tempo value.

In one embodiment, the adaptation sub-step includes modifying the tempo value of at least one block.

In one embodiment, the adaptation sub-step includes modifying the tempo value of at least one first block and of at least one block adjacent to the first block, and preferably includes at least modifying the value of tempo of the block immediately preceding the first block and of the block immediately following the first block.

In one embodiment, the tempo values of neighboring blocks of blocks adjacent to the first block are also modified.

In one embodiment, the modifications of the tempo values of the first block and of the adjacent blocks and of the neighboring blocks are carried out in such a way as to obtain a local variation of the continuous tempo value.

In one embodiment, the analysis step comprises the generation of characteristic data relating to the light and/or characteristic data relating to the colors of the images of the video sequence.

In one embodiment, the analysis step comprises the generation of characteristic data relating to the speed of movement of objects appearing in the images of the video sequence.

The invention also relates to a method for generating a digital audio file containing an audio sample, comprising the steps of:

- calculate at least one tempo value of the audio sample;

- generate data relating to a rhythmic structure of the audio sample, by detecting the measures constituting the audio sample, and, for each detected measure, the number of beats it contains;

- generate data relating to a musical structure of the audio sample, by detecting one or more parts constituting the audio sample;

- generate a digital audio file containing the audio sample and the generated data.

In one embodiment, the calculation of the tempo value is carried out by implementing an iterative determination step.

In one embodiment, the tempo value of the audio sample is calculated with an accuracy of less than +/- 0.1 beats per minute, and preferably of the order of +/- 0.01 beats per minute.

In one embodiment, the detection of the measures constituting the audio sample is carried out by detection of a rhythmic pulse.

In one embodiment, the determination of the musical structure is carried out by identifying one or more parts among:

- an introductory part;

- a central part;

- a final part. The invention also relates to a synchronization method as defined above, in which the audio files stored in the database have previously been generated according to an audio file generation method in accordance with that defined above.

The invention also relates to a computer program product comprising instructions which, when the program is executed by one (or more) processors, lead the latter(s) to implement the steps of the synchronization method such as defined above and/or the steps of a method for generating audio files as defined above.

The invention also relates to a synchronization system for implementing a synchronization method as defined above and/or implementing a method for generating audio files as defined above, the system comprising a terminal configured to transfer a video file containing a video sequence to a server, the server comprising a database of digital audio files containing audio samples and a processor for synchronizing the video sequence transferred by the terminal with an audio sample contained in an audio file stored in the server database.

The present invention will be better understood on reading the following detailed description, made with reference to the accompanying drawings, in which:

[Fig. 1] Figure 1 is a diagram of a system configured to implement a method according to the invention.

[Fig. 2] Figure 2 represents the implementation steps of a method according to the invention.

[Fig. 3a] Figure 3a illustrates a first image extracted from a video during the analysis step carried out in accordance with the method according to the invention.

[Fig. 3b] Figure 3b illustrates a second image extracted from a video during the analysis step carried out in accordance with the method according to the invention.

[Fig. 3c] Figure 3c illustrates a third image extracted from a video during the analysis step carried out in accordance with the method according to the invention.

[Fig. 3d] Figure 3d illustrates a third image extracted from a video during the analysis step carried out in accordance with the method according to the invention.

[Fig. 3e] Figure 3e illustrates a third image extracted from a video during the analysis step carried out in accordance with the method according to the invention.

[Fig. 4] FIG. 4 represents a curve of the tempo value calculated in accordance with the invention for an audio sample. [Fig. 5] Figure 5 shows a correspondence table between different shades of colors and associated tones.

[Fig. 6] Figure 6 is a diagram illustrating the implementation of the step of modifying the duration of an audio sample.

FIG. 1 represents a synchronization system 1 of an audio sample to a video sequence allowing the implementation of a synchronization method in accordance with the invention. The synchronization system 1 comprises a terminal 10, in particular a local terminal such as a computer or a portable device (mobile telephone, tablet, etc.). The terminal 10 comprises means for exchanging files and data with a server 12, for example through a network 14 such as the Internet network in the case of a remote server. The server 12 comprises a processor 16 and a database 18 containing digital audio files, each audio file containing an audio sample (such as a piece of music) capable of being associated with a sequence of animated images, or sequence video.

Figure 2 shows steps for implementing the method according to the invention.

The method comprises a first step 20 of analyzing a video sequence V, for example a video sequence not comprising any soundtrack submitted by a user, for example by means of the terminal 10. During this analysis step 20 is implementation of a step for generating characteristic data of the video sequence, at least one of these characteristic data being representative of a tempo value Tv. An example of the method for calculating this tempo value Tv is detailed below . This step is in the example implemented by the processor 16 of the server 12 of FIG.

The method then includes a search step 21 in a database of digital audio files, for example the database 18 of the server 12. This search step 21 aims to find at least one, and preferably several audio samples compatible with the video sequence, that is to say audio samples likely to be appropriately associated with the video sequence submitted by the user. To this end, the database 18 is searched for audio samples having characteristics compatible with the characteristic data of the video sequence generated during the analysis step 20. An audio sample will be retained as being potentially associable with the sequence video if at least one compatible characteristics is a tempo value Ta assigned to the audio sample.

When several audio samples have been found, it is determined, during a selection step 22, which of these audio samples has the best compatibility with the video. Preferably, a value representative of a compatibility score is calculated for each audio sample. Thus, the audio sample that has the best compatibility with the video submitted by the user will be the audio sample E whose value representative of the compatibility score is the highest. Advantageously, the compatibility score corresponds to the sum of at least two sub-scores, each sub-score being representative of a degree of compatibility of the audio sample considered with the video relative to a given criterion. Preferably, weighting coefficients are assigned to each of the sub-scores, the sub-score corresponding to the tempo value being associated with the highest weighting coefficient.

When an audio sample E has been selected, it is associated with the video sequence, during a synchronization step 23. An example of implementation of the synchronization step 23 is described in detail below.

After the synchronization step 23, a step 26 of generating a video file containing the video sequence synchronized with the selected audio sample is implemented.

Figures 3a to 3e illustrate images extracted from a video sequence, or sequence of animated images. The video sequence from which these images are extracted lasts in the example 25 seconds.

FIG. 3a illustrates a first image extracted from the video sequence, this image appearing 5 seconds after the start of the video sequence. This first image is part of a subset of images, forming, among the set of images constituting the video sequence, a first scene. In this scene, one can for example observe, in the foreground, a motor vehicle 30 moving on a road 32. In the background, one can observe the sun 34 illuminating the entire scene in backlight as well a line of ridges formed by mountain ranges 36 located in the background. The backlighting creates a shaded area 38 at the rear of the vehicle 30.

Figures 3b to 3e illustrate four successive images extracted from the same video sequence. These four images are part of a second subset of images which forms, among the set of images constituting the video sequence, a second scene. In this second scene, the same vehicle 30 as that appearing in the first scene can be observed. This is, in figures 3b-3e, seen from above traveling on a road 40. On one of the edges of the road, one can observe, in the foreground of the images, the progressive appearance of a geyser 42 The appearance of the geyser 42 coincides with the movement of the vehicle, in the sense that the geyser is at its peak when the vehicle 30 passes in front of it. The four images shown in Figures 3b to 3e appear respectively 9, 10, 11 and 12 seconds after the start of the video. The analysis that is made of the video from which the images of FIGS. 3a to 3e are extracted in the context of the implementation of the method that is the subject of the invention, during the analysis step 20, is detailed below.

During the analysis step, a first characteristic datum generated relates to one or more characteristic events, such as scene changes. Thus, when a video sequence includes several scenes, this characteristic will be detected during the analysis of the video sequence. In the example of the figures, the distinction between the first scene (figure 3a) and the second scene (figures 3b, 3c, 3d, 3e) is detected. For example, the detection of a characteristic event, such as a change of scene, is carried out by comparing each image of the video sequence with the previous image. The comparison is performed on the basis of the color characteristics of the analyzed images, in order to detect any significant change between a given image and the following image. Thus, the frequency of occurrences of characteristic events such as scene changes is measured, which makes it possible to calculate a tempo value Tv of the video sequence, preferably expressed in beats per minute (bpm).

As mentioned above, a characteristic event such as a change of scene can advantageously be detected by means of a comparison of each image constituting the video with the preceding image. A video sequence comprising a significant number of images per second (typically 24 to 30 images per second), the evolution between two images which immediately follow each other is normally low, except if a change of scene occurs. Preferably, a dominant color is determined for each image in each of several reference areas of the image, the change in dominant color within one or more reference areas between one image and the next being detected. For each area of reference of each image, a color is determined as dominant if it corresponds for example to the majority color within the reference zone. When from a given image to the following image, it is detected that the dominant color changes significantly in a large or majority proportion of the reference areas, then it is determined that this change corresponds to a change of scene. Preferably, the set of reference zones completely covers each analyzed image. To this end, each image is subdivided into a plurality of squares, each square forming a reference zone of the image. In the example, each image is divided into squares of n pixels on a side, n being in particular less than 100, preferably less than 50 and for example equal to 16. When all the scene changes of the video have been detected, it is possible to calculate a tempo value Tv of the video. For this purpose, knowing the time value of each scene change, it is possible to determine a tempo grid which corresponds as closely as possible to the scene changes, i.e. to determine one or more tempo values of the video such that scene changes occur on a beat, preferably on a downbeat.

Advantageously, a second characteristic datum generated during the analysis step 20 relates to the colors present in the video. For example, the dominant color or colors are determined within one or more zones of each image constituting the video, and characteristic data of the video is generated linked to these dominant colors (for example, light or dark colors, cold or hot, etc.)

Advantageously, a third characteristic datum generated during the analysis step 20 relates to a musical tone of the video. In the example, to assign a musical key to an analyzed video, we use a correspondence table between, on the one hand, shades of colors, and, on the other hand, musical tones. Figure 5 shows an example of such a correspondence table, in the form of a chromatic circle 5 in which each shade of color, corresponding to one of the sectors I to XII of the chromatic circle, is uniquely associated with a musical tone. In the example the associations are as follows:

- the magenta color (sector I) is associated with the key of C/Do;

- the color red (sector II) is associated with the tonality of G/G;

- the color orange (sector III) is associated with the key of D/D; - the orange-yellow color (sector IV) is associated with the tonality of A/La;

- the color yellow (sector V) is associated with the tone of E/mi;

- the yellow-green color (sector VI) is associated with the tone of B/Si;

- the color green (sector VII) is associated with the key of G flat/G flat;

- the blue-green color (sector VIII) is associated with the key of D flat/D flat;

- the color Cyan (sector IX) is associated with the tonality of A flat/A flat;

- the blue-violet color (sector X) is associated with the key of E flat/E flat)

- the color purple (sector XI) is associated with the key of B flat/B flat;

- the purple red color (sector XII) is associated with the tonality of F/Fa.

Advantageously, a fourth characteristic datum generated during the analysis step 20 relates to the light of each image constituting the video. For example, in the case of the image shown in Figure 3a, the differences in brightness between sunlit and shaded areas are measured when analyzing the video. The result of this analysis may be used to determine one or more compatibility sub-scores, as detailed below.

Advantageously, a fifth characteristic datum generated during the analysis step 20 relates to the movement of objects within the images constituting the video, and in particular to the speed of movement of these objects. For example, in figures 3a to 3e, the analysis of the video makes it possible to detect the displacement of an object (such as the vehicle 30 or the geyser 42), and to determine the speed of this displacement. The result of this analysis can be used to calculate a Tv tempo value for the video, as detailed below. The result of this analysis can also be used to determine one or more compatibility subscores, as detailed below.

In order to guarantee the best performance of the system and of the method in accordance with the invention, it is preferable to have a database of audio files whose content has been analyzed beforehand, in order to generate characteristic data facilitating the subsequent association of the audio samples during the analysis step of a video. The purpose of the stage of preliminary analysis of an audio sample is to generate data characteristic of this audio sample, these characteristic data allowing later to determine if this sample can be associated with a video being analyzed.

A first characteristic datum generated is a tempo value Ta of the audio sample, preferably expressed in bpm, or beats per minute. It is essential that the tempo value Ta calculated for an audio sample during the preliminary analysis is calculated as accurately as possible. If the audio sample considered is a piece of music, the tempo value given by known algorithms (for example by music analysis software) will not be precise enough. Indeed, conventional algorithms analyze a piece of music to detect strong beats (by analyzing in particular the variations in energy produced). However, such an analysis is by nature imprecise because the energy peak generated by a musical instrument during the production of a musical note strongly depends on the timbre of the instrument. Thus, when several instruments are played together so as to simultaneously produce a note, the energy peaks produced by each of the instruments will not coincide if we measure with a high level of precision (for example at the level of a millisecond). In order to be able to calculate a tempo value which is sufficiently precise to allow the implementation of the method which is the subject of the invention, a most probable tempo value is determined by implementing an iterative determination step. For example, we take an initial value equal to the value given by a conventional algorithm, then we check whether this value corresponds with the desired precision to the tempo value of the audio sample, by detecting any shifts between this value theoretical and the beats detected in the audio sample. Since the tempo value can vary during an audio sample, a tempo value grid is produced such as that represented in FIG. 4. FIG. 4 shows the result of the analysis of the tempo of an audio sample carried out in accordance with to the invention. Figure 4 thus shows a succession of points representing the tempo value calculated throughout the audio sample. It is thus observed that the tempo of the piece of music to which the audio sample corresponds presents a certain number of irregularities, these irregularities being visible due to the precision of measurement of the tempo value. Preferably, the tempo value is determined with a precision of less than +/-0.1 bpm and preferably of the order of +/-0.01 bpm.

A second characteristic datum generated relates to the rhythmic structure of the audio sample, and more particularly to the structure of the measurements within the audio sample. The generation of this characteristic datum makes it possible to know the number of beats constituting a measure (for example 2 beats, 3 beats, 4 beats, etc.). A musical analysis algorithm is used for this (for example an algorithm of the “MIR” type for “music information retrieval”). The analysis is in particular based on the principle that a bar comprising more than one beat necessarily includes one or more strong beats and one or more weak beats, and that the first beat of a bar is necessarily a strong beat. The analysis aiming to detect the number of beats of a measure within a piece of music is complex, and the known algorithms generally present an average reliability. In order to improve the reliability of the detection, this step is preferably implemented using at least three different algorithms. Thus, if the results provided by the algorithms differ, the value retained will be the one with the majority among the results. For example, if two of the algorithms used give an identical result and the third algorithm gives a different result, then the value retained will be that given by the first two algorithms. Of course, if the three algorithms give an identical result, it is this result which will be retained.

A third characteristic datum generated relates to the identification of the structure of the audio sample, that is to say its temporal organization. When the audio sample is a piece of music, its structure can generally be broken down into different parts, including, for example: one or more introductory parts, one or more intermediate or central parts, and one or more final parts. For example, as shown in FIG. 6, an audio sample E corresponding to a musical title could consist of an introductory part Ei, a central part E2 and a final part E3 or conclusion. The central part E2 may be composed of a certain number of elements E20, E22, E24, E26, E28, or blocks, corresponding to musical subsets such as: one or more refrains, one or more verses, one or several bridges, etc. All of the parts E1, E2, E3 and, where appropriate, of the elements E20-E28 making up these parts are analyzed and recognized. The purpose of this step is to precisely determine the structure of the analyzed sample E, in order to be able, during the subsequent association and synchronization of this sample E with a video sequence V, to recombine some of the parts E1 , E2, E3 and/or some of the elements E22-E28 constituting certain parts of the audio sample E, in order to modify the duration ts. This recombination makes it possible to obtain a second audio sample, or recombined audio sample ERI, of duration t2 different from the starting audio sample E, but which retains an analogous musical structure with at least an introductory part Ei, a central part E2 and a final part E3. If it is necessary to reduce the duration of the starting sample to adapt it to the duration of the video sequence, it is possible, as in the example of FIG. 6, to keep intact the introductory part E1 and the final part E3 , and reduce the duration of the central part E2 by keeping only a part of the elements (or blocks) composing it, for example by keeping only one verse and one refrain from among a set of several verses and refrains. In the example, the recombined audio sample ERI comprises, in the central part, only the elements E20, E22, and E24. If one wishes to increase the duration of the starting sample, one can for example keep intact the introductory part E1 and the final part E3, and increase the duration of the central part E2 by duplicating all or part of the elements E20-E28 constituting it, which will thus amount to repeating one or more verses and/or one or more refrains and/or one or more bridges. As shown in FIG. 6, a step 25 of adaptation of the duration of the audio sample may also be provided, by locally modifying the tempo value Ta. In the example of FIG. 6, the duration of the recombined sample ERI is modified to obtain a second recombined sample ER2, whose duration ti is adjusted to that of the video sequence, by reducing the duration of the element E24 , increasing the tempo value within that Element only. Advantageously, when the tempo value of a part or of an element of a part must be modified, the tempo value of one or more adjacent elements is also modified, so as to smooth the modification of tempo value, thus making it progressive (in particular with the aim that it is difficult or not detectable by a listener). For example, when one wishes to modify the tempo value of the element E24, one also modifies the tempo value of the (immediately) adjacent elements, ie in the example the elements E22 and E3. Thus, if you wish to increase the tempo value of an element (here element E24), you will also increase, preferably in a lesser proportion, the tempo value of the immediately preceding element (here element E22) this element, and we will increase, preferably in a lesser proportion, the tempo value of the element immediately following (here the element E3) this element. So the tempo value will increase from the original value from the previous element E22, will reach a local maximum within the modified element E24, then will decrease from the following element E3 to reach the original value. Conversely, if you want to decrease the tempo value of an element (here element E24), you will also decrease, preferably in a lesser proportion, the tempo value of the immediately preceding element (here element E22 ) this element, and the tempo value of the element immediately following (here the element E3) this element will be reduced, preferably to a lesser extent. If necessary, to obtain satisfactory smoothing and progressiveness of the modification of the tempo value, it is also possible to modify the tempo value of the neighboring elements of the adjacent elements, For example, to modify the tempo value of the element of element E22, you can modify the tempo values of elements E1, E20 and E24, E3. The variation in tempo will thus increase when approaching the main object of the modification (here element E22), and will decrease when moving away from this element. Advantageously, sub-step 25 will be implemented so that the tempo variation follows a continuous or quasi-continuous curve.

The possibilities of implementing the search step 21 for audio samples compatible with a video sequence are described below in more detail. During this step, the database 18 of audio files is searched for at least one, and preferably several audio samples having characteristics compatible with the characteristic data generated for the video sequence, including the tempo value Tv. Preferably, all the audio samples whose tempo value Ta is compatible with the tempo value Tv calculated for the video are first searched for. A tempo value Ta of an audio sample is evaluated as compatible if it is a multiple or a sub-multiple of the tempo value Tv calculated for the video sequence, or a multiple or a sub-multiple of a value close to this calculated tempo value Tv. For example, if the tempo value Tv calculated for the analyzed video sequence is equal to 120 beats per minute (bpm), then audio samples will present a compatible tempo value if this is equal, in particular, to 60, 120, 180 or 240 bpm. Preferably, audio samples will be searched on the basis of an equal or close tempo value. For example, if the video has a tempo value determined as equal to 119 bpm, we will search for audio samples compatible with a tempo value equal to or close to 119, for example between 117 and 121 bpm or corresponding to a multiple or a sub -multiple of values between 117 and 121. Thus, we will look for audio samples having a tempo value compatible with a tempo value located in a range of +/- 5% around the tempo value Tv of the video, and preferably in a range of +/- 3% around this value. To determine an audio sample tempo value to be sought preferentially among the values corresponding to the multiples and sub-multiples of the tempo value Tv of the video, it is possible to use other characteristic data, such as the speed of the objects mentioned above high.

To determine which audio samples have the best compatibility, a value representative of a compatibility score is preferably determined for each sample whose tempo value Ta is compatible. Preferably, the compatibility score corresponds to the sum of at least two compatibility sub-scores, each sub-score being representative of the compatibility of the audio sample considered with the video with respect to a given criterion. Preferably, weighting coefficients are assigned to each of the sub-scores, the sub-score corresponding to the tempo value being associated with the highest weighting coefficient.

In the example, the compatibility score is calculated based on the following subscores, ranked in order of preferred importance:

- a first sub-score, representing the compatibility of the audio sample with respect to the tempo value assigned to the video;

- a second sub-score, representative of the compatibility of the audio sample with respect to the colors present in the video;

- a third sub-score, representing the compatibility of the audio sample with respect to a musical key assigned to the video;

- a fourth sub-score, representative of the compatibility of the audio sample with respect to a musical genre assigned to the video (for example: classical, jazz, rock, etc.), the musical genre being determined by example based on the rhythm of scene changes and/or the speed of movement of objects in the video;

- a fifth sub-score, representative of the compatibility of the audio sample with respect to a type of atmosphere assigned to the video (for example: suspense, sad, funny, etc.), for example on the based on the dominant colors present in the video (dark or light, cold or warm colors, etc.). - a sixth sub-score, representative of the compatibility of the audio sample with respect to a musical sub-genre (for example baroque or romantic for classical music, cool jazz or be-bop jazz for jazz, etc.)

When several audio samples compatible with the video sequence have been found, the audio sample having the best compatibility is selected, as described above, that is to say the audio sample E having the best compatibility score.

The synchronization step 23 is then implemented, during which the audio sample is synchronized with the video sequence. If necessary, during the synchronization step, the retained audio sample E can be modified, in particular to increase or decrease the duration of certain parts, respectively by locally decreasing or increasing the tempo value. This modification can be useful in order to very precisely match highlights of the audio sample to “highlights” of the video, such as a change of scene. If the duration of the audio sample must be adapted, for example because the duration of the video is significantly less than the duration of the audio sample selected as being the best candidate, a step 24 for modifying the duration of the audio sample is performed. As described above, this step can be performed by recombining parts and/or parts of parts of the audio sample, either to decrease or lengthen the duration of the original audio sample.

The method according to the invention comprises, after the synchronization step 23, a step 26 for generating a video file containing the video sequence synchronized with the audio sample E. The generated file can be transferred from the server 12 to the terminal 10, in order to be read and/or downloaded by the user. The method and the system in accordance with the invention are suitable for all types of audio and video file formats, and in particular for the following formats:

- video audio file format: Mp4, AVI, MPEG, Mov, m4v, mkv, wmv, webm, etc. ;

- audios file format: m4a, mp3, wav, flac, aiff, etc.

The implementation of the method according to the invention has been described above with a video sequence comprising no sound. Of course, if the initial video sequence is associated with a soundtrack, the synchronization step will be preceded by a preliminary step of removing the soundtrack.

Claims

1. Method of synchronizing an audio sample with a sequence of moving images, or video sequence, the method comprising the steps of:

- analyzing (20) the video sequence to generate characteristic data, at least one of the characteristic data generated being representative of a tempo value (Tv) calculated for the video sequence;

- searching (21), in a database of audio files, for one or more audio files containing an audio sample having one or more characteristics compatible with the characteristic data generated from the video sequence, at least one of the compatible characteristics being a tempo value (Ta) of the audio sample considered;

- selecting (22), from among the audio samples found, the audio sample which has the best compatibility with the video sequence;

- synchronizing (23) the selected audio sample with the video sequence;

- generating (26) a video file containing the video sequence synchronized with the selected audio sample.

2. Method according to the preceding claim, in which an audio sample is compatible if it has a tempo value (Ta) equal to or close to the tempo value (Tv) of the video sequence or a tempo value equal to or close to 'a multiple or sub-multiple of the tempo value of the video sequence.

3. Method according to one of the preceding claims, in which the tempo value (Tv) calculated for the video sequence is determined by detecting characteristic events occurring during the video sequence, such as scene changes.

4. Method according to the preceding claim, in which the detection of a characteristic event such as a change of scene is carried out by chromatic analysis of each image of the video sequence, a change of scene being detected if a significant change in color is measured between two successive images.

5. Method according to one of the preceding claims, in which the step of selecting from among the audio samples found, the audio sample which has the best compatibility with the video sequence includes a sub-step for calculating a compatibility score.

6. Method according to one of the preceding claims, in which the synchronization step (23) comprises a sub-step of modifying the duration (24) of the audio sample (E) to adapt it to the duration of the video sequence.

7. Method according to the preceding claim, in which the duration modification sub-step (24) is carried out by recombining one or more parts (Ei, E2, E3) of the audio sample (E) and/or one or more blocks (E20, E22, E24, E26, E28) of part of the audio sample.

8. Method according to the preceding claim, in which the recombination is carried out so that after modification of the duration, the recombined audio sample (ERI) has a structure similar to that of the initial audio sample (E), and comprises for example an introductory part (E1), followed by a central part (E2) and a final part (E3).

9. Method according to one of the preceding claims, in which the synchronization step (23) comprises a sub-step of adaptation (25) of the duration of at least a part (E2) of the audio sample, consisting of locally decreasing or increasing the tempo value (Ta).

10. Method according to the preceding claim, in which the adaptation sub-step (25) comprises modifying the tempo value of at least one block (E24).

11 . Method according to the preceding claim, in which the adaptation sub-step (25) comprises the modification of the tempo value of at least one first block (E24) and of at least one block (E22, E3) adjacent to the first block (E24), and preferably includes at least the modification of the tempo value of the block (E22) immediately preceding the first block (E24) and of the block (E3) immediately following the first block.

12. Method according to the preceding claim, in which the tempo values of the neighboring blocks (E1, E20) of the adjacent blocks (E22, E3) to the first block (E24) are also modified.

13. Method according to one of claims 11 and 12, in which the modifications of the tempo values of the first block (E24) and of the adjacent blocks and of the neighboring blocks are carried out so as to obtain a local variation of the tempo value ( Ta) continues.

14. Method according to one of the preceding claims, in which the analysis step (20) comprises the generation of characteristic data relating to 19 the light and/or characteristic data relating to the colors of the images of the video sequence.

15. Method according to one of the preceding claims, in which the analysis step (20) comprises the generation of characteristic data relating to the speed of movement of objects (30) appearing in the images of the video sequence.

16. Method for generating a digital audio file containing an audio sample (E), comprising the steps of:

- calculate at least one tempo value (Ta) of the audio sample;

- generating data relating to a musical structure of the audio sample, by detecting one or more parts (Ei, E2, E _e s) constituting the audio sample;

17. Method according to the preceding claim, in which the calculation of the tempo value (Ta) is carried out by implementing an iterative determination step.

18. Method according to one of claims 16 to 17, in which the tempo value (Ta) of the audio sample is calculated with a precision of less than +/- 0.1 beats per minute, and preferably of order of +/- 0.01 beats per minute.

19. Method according to one of claims 16 to 18, in which the detection of the measurements constituting the audio sample is carried out by detection of a rhythmic pulse.

20. Method according to one of claims 16 to 19, in which the determination of the musical structure is carried out by identifying one or more parts from among:

- an introductory part;

- a central part;

- a final part.

21. Synchronization method according to one of claims 1 to 15, in which the audio files stored in the database have previously been generated according to a method in accordance with one of claims 16 to 20. 20

22. Computer program product comprising instructions which, when the program is executed by one (or more) processor(s), lead it (them) to implement the steps of the synchronization method according to the one of claims 1 to 15 or according to claim 21.

23. Synchronization system (1) for the implementation of a synchronization method according to one of claims 1 to 15 or according to claim 21, the synchronization system (1) comprising a terminal (10) configured to transfer a video file containing a video sequence to a server (12), the server (12) comprising a database (18) of digital audio files containing audio samples and a processor (16) for synchronizing the video sequence transferred by the terminal (10) with an audio sample contained in an audio file stored in the database (18) of the server (12).