CN115695899A

CN115695899A - Video generation method, electronic device and medium thereof

Info

Publication number: CN115695899A
Application number: CN202110841263.5A
Authority: CN
Inventors: 徐高
Original assignee: Petal Cloud Technology Co Ltd
Current assignee: Petal Cloud Technology Co Ltd
Priority date: 2021-07-23
Filing date: 2021-07-23
Publication date: 2023-02-03
Also published as: WO2023001115A1

Abstract

The present application belongs to the field of electronic technologies, and in particular, to a video generation method, an electronic device, and a medium thereof. In the method, the electronic equipment matches the poster image with proper background music according to the content of each poster image, so that the poster image is matched with the background music on the subject content, and in addition, the method also correlates the motion range of the foreground object of the poster image with the motion range of the audio interval, so that the audio interval just finishes playing after the motion of the foreground object of the poster image is finished, and a user can generate audio-visual linkage when watching the click video of the dynamic poster image, thereby improving the user experience.

Description

Video generation method, electronic device and medium thereof

Technical Field

The present application belongs to the field of electronic technologies, and in particular, to a video generation method, an electronic device, and a medium thereof.

Background

At present, for each large video application platform, a movie theater, a movie distributor or an advertiser provides special effect poster images such as dynamic poster images (dynamic posts) and naked eye 3D (dimension) in order to attract users, and poster images such as dynamic poster images and naked eye 3D can attract users visually compared with static poster images with a single form, so that good immersion feeling is brought to the users.

However, when playing a dynamic poster image, the background music of the dynamic poster image is not linked with the theme material of the dynamic poster image, and users cannot generate audiovisual linkage. For example, some dynamic poster images with mild themes have a strong background music tempo, or some dynamic poster images have a large foreground object motion range and a moderate background music tempo, which causes discomfort to the user and reduces the user experience.

Disclosure of Invention

In view of this, the present application provides a method for generating a video. In the video generation method, the electronic equipment adds a poster image theme tag to each poster image according to the theme content of each poster image, and determines the motion range of a foreground object in each poster image by using an image foreground segmentation algorithm and a depth estimation algorithm; meanwhile, detecting a time node of a change moment of a rhythm in background music, determining a theme corresponding to the audio interval according to a rate value of audio change in the period of time, adding an audio theme label to the audio interval according to the theme of the audio interval, and determining a motion range of the audio interval according to the duration of the audio interval and the rate value of audio change; and then, the audio interval is associated with the dynamic poster which is similar to the audio interval theme label and has a similar motion range, and the dynamic poster is played according to the playing sequence of the audio interval to form a dynamic poster image click video. The method makes the dynamic poster image and the background music fit on the theme content, and in addition, the method also associates the motion range of the foreground object of the dynamic poster image with the motion range of the audio interval, so that the audio interval just finishes playing after the foreground object of the dynamic poster image finishes moving, and a user can generate audio-visual linkage when watching the click video of the dynamic poster image, thereby improving the user experience.

The following describes a video generation method of the present application.

In a first aspect, an embodiment of the present application provides a video generation method, which is applicable to an electronic device, and the method includes: dividing background music used for generating a video into N continuous audio intervals, wherein N is a positive integer; and matching the N images to the corresponding audio intervals according to the audio rhythm change of each audio interval and the matching degree between the N images to generate a video, wherein the images matched with each audio interval are displayed when the video is played to each audio interval in the playing process of the video.

It can be understood that, in order to implement the above-mentioned checkpoint video of dynamic poster images, so that each dynamic poster image and the corresponding audio interval not only adapt to the subject content, but also adapt the motion range of the foreground object of the dynamic poster image and the motion range of the corresponding audio interval, the electronic device needs to first divide the background music into a plurality of continuous audio intervals according to the change of the audio rhythm of the background music, and then match the images to the corresponding audio intervals according to the degree of matching between the change of the rhythm of each audio interval and the images, so as to generate the checkpoint video of the dynamic poster images. Moreover, it is understood that, in the playing process of the click video of the dynamic poster image, every time a certain audio interval is played, an image matching the change of the rhythm of the audio interval is displayed.

Since the change in the rhythm of the audio interval may reflect not only the content of the audio interval but also the motion range of the audio interval to some extent, in some embodiments, the degree of matching the change in the rhythm of each audio interval with the image includes a first degree of matching between the change in the rhythm of each audio interval and the content of the image and/or a second degree of matching between the change in the rhythm of each audio interval and the dynamic motion range of the foreground object in the image. The first matching degree can be obtained by calculating the matching degree between a rhythm label reflecting the rhythm change of the audio interval and a content label reflecting the image content, and the second matching degree can be obtained by calculating the matching degree between a motion range reflecting the rhythm change of the audio interval and the playing time and a motion range of a foreground object in the image.

Specifically, in some embodiments, the first degree of match may be calculated by: utilizing a content recognition neural network model to perform content recognition on the image, adding a content tag reflecting the image content to the image according to the recognized content, and adding a rhythm tag reflecting the audio rhythm change to an audio interval according to the audio rhythm change of the audio interval; and calculating the matching degree of the content label and the rhythm label to obtain a first matching degree. In some embodiments, the second degree of match may be calculated by: calculating the motion range of the audio interval according to the change of the audio rhythm and the duration of the audio interval; and calculating the matching degree between the motion range of the audio interval and the dynamic motion range of the foreground object in the image to obtain a second matching degree. Here, the rhythm tag reflecting the change in the rhythm of the audio section corresponds to the theme tag of the audio section in the following embodiment section, and the content tag reflecting the image content corresponds to the theme tag of the image in the following embodiment section.

With reference to the first aspect and the foregoing possible implementations, in another possible implementation of the first aspect, the dynamic range of the foreground object of the image may be calculated by: and carrying out foreground segmentation on the image by using the foreground segmentation neural network model to obtain a foreground object and a background object of the image, and obtaining a dynamic motion range of the foreground object in the image according to the position relation of the foreground object relative to the background object. It can be understood that the positional relationship of the foreground object with respect to the background object means that, in each direction of the foreground object, the distance from the contour of the outermost side to the edge of the image is the distance from the contour of the outermost side to the edge of the image in each direction of the foreground object, which is the dynamic motion range of the foreground object. For example, the distance from the leftmost side of the foreground object to the left edge of the image is 2 pixels, the distance from the rightmost side of the foreground object to the right edge of the image is 3 pixels, the distance from the uppermost side of the foreground object to the upper edge of the image is 2 pixels, and the distance from the lowermost side of the foreground object to the lower edge of the image is 1 pixel, then at this time, the dynamic motion range of the foreground object is: the leftward movement range is 2 pixels, the rightward movement range is 3 pixels, the upward movement range is 2 pixels, and the downward movement range is 1 pixel. The following description is provided in the following detailed description of embodiments of the invention, and will not be further described herein.

In conjunction with the above, in some embodiments, the way in which each audio interval matches a suitable image according to the change in the rhythm of each audio interval may include the following: 1) The method comprises the steps of firstly respectively allocating N images to N continuous audio intervals according to a first sequence, then judging whether a first matching degree and a second matching degree between each audio interval and the images meet requirements or not, judging whether the sum of the first matching degree and the second matching degree of all the audio intervals and the images meets the requirements or not under the condition that the first matching degree and the second matching degree between each audio interval and the images meet the requirements, and matching N images for the N continuous audio intervals according to the first sequence under the condition that the sum of the first matching degree and the second matching degree of all the audio intervals and the images meets the requirements to generate a video.

Specifically, N images are respectively distributed to N continuous audio intervals according to a first sequence, and a plurality of first matching degrees and a plurality of second matching degrees between each audio interval and the distributed images are calculated by the method; under the condition that the plurality of first matching degrees are all larger than a first matching degree threshold value and the plurality of second matching degrees are all larger than a second matching degree threshold value, calculating the sum of the plurality of first matching degrees and the plurality of second matching degrees; and when the sum of the first matching degrees and the second matching degrees is greater than the total matching degree threshold value, matching the N images to the corresponding audio intervals according to a first sequence. For example, if the overall style of the background music is "burning", the overall style of the images is also "burning", which indicates that the images and the background music are easily matched, the first matching degree threshold may be set higher, and if the overall style of the background music is "urgent", the overall style of the images is "burning", which indicates that the images and the background music are not easily matched, so the first matching degree threshold may be set lower. The setting mode of the second matching degree threshold is similar to the setting mode of the first matching degree threshold in principle, and is not described herein again. However, it should be understood that neither the first matching degree threshold nor the second matching degree threshold may be too low, and if the first matching degree threshold and the second matching degree threshold need to be set to be over to match the N images to the N audio intervals, it indicates that the background music and the N images are not suitable for generating a video, and at this time, the background music may be replaced.

Further, in order to avoid the situation that the first matching degree and the second matching degree of a certain audio interval and the assigned image satisfy the condition, but the sum of the first matching degree and the second matching degree of the audio interval and the assigned image does not satisfy the condition, for example, assuming that the first matching degree between the audio interval 1 and the image 1 is 1.1 and the second matching degree is 1.1, and the first matching degree threshold is 1 and the second matching degree threshold is also 1, it is obvious that the first matching degree and the second matching degree of the audio interval 1 and the image 1 both satisfy the requirement, but the sum of the first matching degree and the second matching degree between the audio interval 1 and the image 1 is only 2.2, which indicates that the audio interval 1 and the image 1 are still not particularly matched. In some embodiments, it is further required to calculate whether a sum of the first matching degree and the second matching degree of each audio interval and the image satisfies a condition. Specifically, in the case where the plurality of first matching degrees are all greater than the first matching degree threshold value and the plurality of second matching degrees are all greater than the second matching degree threshold value, calculating the sum of the plurality of first matching degrees and the plurality of second matching degrees includes: and calculating the sum of the first matching degree and the second matching degree between each audio interval and the distributed images respectively, and calculating the sum of a plurality of first matching degrees and a plurality of second matching degrees under the condition that the sum of the first matching degree and the second matching degree between each audio interval and the distributed images is greater than a third matching degree threshold value. The third threshold of the matching degree corresponds to a first preset value in the following detailed embodiment, and reference may be made to the following detailed embodiment for setting the first preset value, which will not be described herein.

2) Since the method in 1) requires a large amount of calculation, which is not beneficial to saving power consumption of the electronic device, in some embodiments, the electronic device may directly calculate whether a sum of a first matching degree and a second matching degree of all audio intervals and images satisfies a requirement after allocating N images to N consecutive audio intervals respectively according to a first order, and match N images for the N consecutive audio intervals according to the first order to generate a video under a condition that the sum of the first matching degree and the second matching degree of all audio intervals and images satisfies the requirement. By the mode, the power consumption of the electronic equipment is saved, and meanwhile, the click video of the dynamic poster image capable of meeting the watching requirements of the user is generated to the maximum extent.

Specifically, in some embodiments, matching the N images to corresponding audio intervals according to the change in audio tempo of each audio interval and the matching degree between the N images includes: respectively allocating the N images to N continuous audio intervals according to a first sequence, and calculating a plurality of first matching degrees and second matching degrees between each audio interval and the allocated images; and calculating the sum of the plurality of first matching degrees and the plurality of second matching degrees, and matching the N images to the corresponding audio intervals according to a first sequence under the condition that the sum of the plurality of first matching degrees and the plurality of second matching degrees is greater than a total matching degree threshold value. The total matching degree threshold corresponds to a second preset value below, and for setting the second preset value, reference may be made to the following specific embodiment, which is not described herein again.

Further, in order to avoid that the total matching degree meets the requirement, but the sum of the first matching degrees and the sum of the second matching degrees of the audio intervals and the images do not meet the requirement, for example, in the first order, the sum of the first matching degrees and the second matching degrees of all the audio intervals and the images is 4 and is greater than the threshold value of the total matching degree 3, but the sum of the first matching degrees and the second matching degrees of a certain audio interval and the images is very small (for example, 0.1), in this case, the generated video is also in a non-harmonious manner in view of both eyes and eyes, so in some embodiments, before the sums of the first matching degrees and the second matching degrees are calculated, the sums of the first matching degrees and the second matching degrees between each audio interval and the assigned images may be calculated respectively, in the case that the sums of the first matching degrees and the second matching degrees between each audio interval and the assigned images are both greater than the threshold value of the third matching degrees, the sums of the first matching degrees and the second matching degrees are calculated respectively, and then, in the case that the sums of the first matching degrees and the second matching degrees are greater than the threshold value of the total matching degree, the first matching degree of the audio interval and the video is generated in the order corresponding to the corresponding video, and the corresponding video is generated.

With reference to the first aspect and the foregoing possible implementation manners, in another possible implementation manner of the first aspect, dividing background music used for generating a video into N consecutive audio intervals includes: according to the rhythm change of the background music, the background music is divided into N continuous audio intervals. It can be understood that when the rhythm of the background music changes, time nodes, that is, audio nodes correspond to each other, and an audio interval is formed between the audio nodes. After obtaining a plurality of continuous audio intervals of the background music, matching a suitable image for the audio interval according to the style or the subject content of the audio interval and the change of the rhythm of the audio interval, so as to generate the click video of the dynamic poster image.

In a second aspect, an embodiment of the present application further provides an electronic device, which includes a memory storing computer program instructions; a processor coupled to the memory, the processor and the memory storing computer program instructions that, when executed by the processor, cause the electronic device to implement the method of generating video of any of the first aspects described above.

In a third aspect, an embodiment of the present application provides a computer-readable storage medium, where a computer program is stored, and when executed by a processor, the computer program implements the video generation method according to any one of the above first aspects.

In a fourth aspect, the present application provides a computer program product, which when run on an electronic device, causes the electronic device to execute the video generation method according to any one of the above first aspects.

It is understood that the beneficial effects of the second to fourth aspects can be seen from the description of the first aspect, and are not described herein again.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings required to be used in the embodiments or the prior art description will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and for those skilled in the art, other drawings may be obtained according to these drawings without inventive labor.

Fig. 1 is an application scenario diagram of an example video generation method according to some embodiments;

FIG. 2 is a schematic illustration of an example dynamic poster display provided in some embodiments;

FIG. 3 is a schematic illustration of yet another example dynamic poster display provided by some embodiments;

FIG. 4 is a flow chart illustrating an exemplary method for generating a video according to some embodiments;

FIG. 5 is a schematic illustration of an example of calculating a foreground range of motion of a poster according to some embodiments;

FIG. 6 is a schematic diagram of an example time node for detecting a tempo change in an audio node according to some embodiments;

fig. 7 is a schematic diagram of a method for adjusting a play order of posters by using a video generation method according to the present application according to some embodiments;

fig. 8 is a schematic diagram of a hardware structure of an example of a smart television according to some embodiments;

fig. 9 is a schematic diagram of a software structure of an example of a smart television according to some embodiments;

FIG. 10 is a flowchart illustrating an exemplary method for generating a video according to some embodiments;

fig. 11 is a schematic structural diagram of an example server according to some embodiments.

Detailed Description

Various aspects of the illustrative embodiments will be described using terms commonly employed by those skilled in the art. It is to be understood that the illustrative embodiments of the present application include, but are not limited to, a method of generating a video, and an electronic device, storage medium, and the like.

Fig. 1 provides an application scenario diagram of a dynamic poster according to an embodiment of the present application.

As shown in fig. 1, the display interface of the smart tv 100 is divided into a dynamic poster image display area 110 and a movie introduction area 120, and the dynamic poster image display area 110 is used for presenting video information such as a recent online movie or a tv series to a user in the form of a dynamic poster image, so as to attract the user to click and watch a certain video. The movie profile area 120 lists various movies in a layout as shown in fig. 1, so that the user can select a certain video to watch through the profiles of various videos in the movie profile area 120 in combination with his/her preference. The dynamic poster image is a video in which the foreground object of the poster image moves, wherein the dynamic poster image is a depth estimation image of the poster image obtained by the electronic device through depth estimation (depth estimation) on the poster image, and then the depth (depth) value and the color (RGB) value of each pixel of the poster image are converted according to a specified motion mode (motion range and duration) according to the depth estimation image. For example, if the foreground object in the poster image moves to the left, when the foreground object moves to the left, the pixel value in the area where the foreground object passes will be converted into the pixel value consistent with the depth value and color of the foreground object pixel, and when the moving process becomes continuous, the user can be visually given a feeling that the foreground object moves to the left, that is, a dynamic poster image is formed.

It should be understood that the dynamic poster image of the present application can be formed by the above-described depth estimation, and can also be formed by other methods, such as processing the poster image by view synthesis (view synthesis), lighting simulation (light field synthesis), etc., and simulating the variation of the content of the poster image to form the dynamic poster image, and the present application does not limit the manner of forming the dynamic poster image.

It should also be understood that the video generation method of the present application may be applied to various electronic devices, for example, a tablet computer, a mobile phone, a server, a portable game machine, a portable music player, a reader device, or other electronic devices capable of accessing a network. The present application does not limit the types of electronic devices to which the video generation method can be applied. For convenience of description, the following description will proceed with the example of the smart tv 100.

Background music of dynamic poster images provided by various large video application platforms at present is not related to subject themes of the poster images and motion modes of foreground objects of the poster images, and users cannot generate audio-visual linkage. For example, the movie 1 dynamic poster image and the movie 2 dynamic poster image shown in fig. 2 are provided, and the background music of both the two dynamic poster images is the audio interval 1 of the background music 1. Assuming that the theme of movie 1 is "happy", the theme of movie 2 is "burning", and the theme or genre of audio interval 1 of background music 1 is "urgent", when the user is watching a video composed of movie 1 dynamic poster images and movie 2 dynamic poster images, the background music and images are found to be very discordant; moreover, the foreground object 101 of the movie 1 dynamic poster image moves more gently (rotates around the center), the foreground object 201 of the movie 2 dynamic poster image also moves more gently (moves along the right diagonal direction of the movie 2 dynamic poster image) but the rhythm of the audio interval 1 of the background music 1 is more intense, and the contrast also appears that the picture of the video and the background music are not consistent.

In order to improve the playing mode of the dynamic poster image and enable a user to obtain better audio-visual experience, some embodiments of the application provide a video generation method. The method comprises the steps of detecting audio nodes with changed audio rhythms in background music, dividing the background music into a plurality of audio intervals, dividing a video into a plurality of video segments, wherein the number of the video segments is equal to that of the audio intervals, matching different video segments in different audio intervals, adjusting the playing time of each audio interval and the corresponding video segment to enable the playing time of the audio interval and the corresponding video segment to be equal, and then playing the video segments according to the playing sequence of the audio intervals to generate the click video. For example, taking background music 1 and video 1 as an example, assuming that the background music 1 detects 5 audio nodes, the background music 1 will be divided into 6 audio intervals, and likewise, the video 1 will also be divided into 6 video segments, and each video segment and the audio interval matched with the video segment have the same playing time length, specifically, the corresponding relationship between each audio interval of the background music 1 and each video segment of the video 1 is as shown in table 1 below:

TABLE 1

As can be seen from table 1 above, although the above-mentioned manner can associate the video with the background music to some extent, that is, each audio interval corresponds to one video clip, and the playing duration of each audio interval is equal to the playing duration of the video clip corresponding to the audio interval, this manner can only edit the existing picture or video material, and in order to ensure that the playing duration of each audio interval is equal to the playing duration of the video clip, it is difficult to simultaneously make the theme of each audio interval correspond to the theme of each video clip. For example, suppose the theme of video clip 1 is "burning" and the playing time is 3 seconds; the theme of the video clip 2 is 'emergency', and the playing time is 7 seconds; the playing duration of the audio interval 1 is originally 5 seconds, the theme is "burning", the playing duration of the audio interval 2 is originally 5 seconds, the theme is "urgent", in order to make the playing duration of the audio interval 1 equal to the playing duration of the video clip 1 and the playing durations of the audio interval 2 and the video clip 2, the playing duration of the audio interval 1 (or the playing duration of the video clip 1) needs to be adjusted, finally, the playing duration of the audio interval 1 is adjusted to 3 seconds, the playing duration of the audio interval 2 becomes 7 seconds, but it can be understood that, at this time, the theme of the audio interval 2 becomes: the theme of the first 2 seconds is "burning" and the theme of the second 5 seconds is "urgent", and when the video segment 2 with the theme of "urgent" is played in the audio interval 2, the picture of the video and the background music still have the problem of incompatibility in the first 2 seconds.

In order to solve the above technical problem, other embodiments of the present application provide a video generation method. Specifically, in the method, the smart television 100 adds a poster image theme tag to each poster image according to the theme content to be expressed by each poster image in M posters, and determines the motion range of the foreground object in each poster image by using an image foreground segmentation algorithm, a depth estimation algorithm and the like.

Meanwhile, the smart television 100 detects N audio nodes with changed tempo in at least one music X in the music library, then divides the music X into (N + 1) continuous audio intervals according to the N audio nodes, determines a theme corresponding to each audio interval and a corresponding theme tag according to the audio change rate value in each audio interval, and determines the motion range of each audio interval according to the playing duration of each audio interval and the audio change rate value in each audio interval.

Then, the smart television 100 calculates the matching degree of the theme label, the motion range, and each dynamic poster image in each audio interval of the music X to obtain the matching degree of M × N groups of dynamic poster images and audio intervals, calculates the total matching degree of the combination mode of each dynamic poster image and each audio interval, and then determines the combination mode of the dynamic poster and the audio interval in which the total matching degree is highest and higher than a preset value from the multiple combination modes to obtain the dynamic poster stuck point video. For example, assuming that the matching degrees between the dynamic poster image 1 (hereinafter referred to as "image 1"), the dynamic poster image 2 (hereinafter referred to as "image 2"), the audio section 1 of music 1 (hereinafter referred to as "section 1"), and the audio section 2 of music 1 (hereinafter referred to as "section 2"), are shown in table 2 below:

TABLE 2

	Image 1	Image 2
			Interval 1	1.25	1.20
Interval 2	1.22	1.24

As can be seen from table 2, the matching degree between image 1 and section 1 is 1.25, the matching degree between image 2 and section 1 is 1.20, the matching degree between image 1 and section 2 is 1.22, and the matching degree between image 2 and section 2 is 1.24. Then, the total matching degree of each of the 2 combinations of the dynamic poster image and the respective audio sections of the music 1 is as shown in table 3 below:

TABLE 3

As can be seen from table 3, the total matching degree of combination 1{ section 1-image 1, section 2-image 2} is 2.49, the total matching degree of combination 2{ section 1-image 2, section 2-image 1} is 2.42, the total matching degree of combination 1 is the highest and is assumed to be higher than a preset value, then the smart tv 100 will play the dynamic poster image 1 in the audio section 1 of music 1 and play the dynamic poster image 2 in the audio section 2 of music 2 according to the combination 1, and generate the dynamic poster image click video. The calculation of the matching degree between the dynamic poster image and the audio interval, the combination manner between the dynamic poster image and the audio interval, and the like will be described in detail below, and will not be described herein again.

It can be understood that, in the above method, the total matching scores of various combinations of the M dynamic poster images and all audio intervals of the music X may be smaller than a preset value, that is, the music X does not satisfy the condition for generating the dynamic poster image click video including the M dynamic poster images. In this case, the smart tv 100 may replace the music X, that is, replace the music X with the music Y, and re-determine the combination manner in which the total matching degree between each audio interval of the music Y and the M dynamic poster images is the highest and higher than the preset value in the above manner, so as to obtain a new dynamic poster image click video. The manner of changing the music of the smart tv 100 will be described in detail below, and will not be described herein again.

It can also be understood that, in the above method, there may be a case where a Zhang Dongtai poster image is newly added to M dynamic poster images or a certain dynamic poster image in one dynamic poster image is replaced, in this case, the smart television 100 will redetermine a combination mode, in which the total matching degree of each audio interval of the music X and each combination mode of all new dynamic poster images is highest and higher than the preset value, according to all new dynamic poster images, so as to obtain a new dynamic poster image click video.

It should be noted that the number of the dynamic poster images is generally the same as the number of the audio intervals, that is, assuming that there are M dynamic poster images, then correspondingly, the smart television 100 randomly selects M consecutive audio intervals from (n + 1) audio intervals of the music X, then when the total matching degree of all the combination manners of the M audio intervals and the M dynamic poster images is smaller than the preset value, the smart television 100 starts to acquire M audio intervals again from the previous audio interval or the next audio interval of any one of the M audio intervals, and then repeats the above method until M audio intervals satisfying the condition are determined, and the audio constituted by the M audio intervals is used as the background music for generating the dynamic image click video. It should be understood that in some embodiments, the number of dynamic poster images may also be greater than the number of audio intervals, at which point the smart tv 100 will redetermine new background music for the dynamic poster images that are added up to form a click video of the added dynamic poster images. The specific implementation details will be described below, and are not described herein again.

The dynamic poster image click video formed by the method not only makes the dynamic poster image and the background music fit on the theme content, but also associates the motion range of the foreground object of the dynamic poster image and the motion range of each audio interval of the background music, solves the problem that the motion range of the foreground object of the dynamic poster image is not coordinated with the background music, provides audio-visual linkage for users, and improves the user experience. For example, assuming that the theme tag of the movie 1 dynamic poster image shown in fig. 3 is "comfortable", the theme tag of the movie 2 dynamic poster image is "in mood", the theme tag of the audio interval 1 of the background music 1 is "urgent", the theme tag of the audio interval 2 of the background music 1 is "moderate", the smart television 100 associates the movie 1 dynamic poster image with the audio interval 1 of the background music 1, the movie 2 dynamic poster image with the same or similar theme tag and with the same or similar motion range, and associates the movie 2 dynamic poster image with the audio interval 2 of the background music 1, and then plays the movie 1 dynamic poster image and the movie 2 dynamic poster image in the audio playing order (audio interval 1-audio interval 2), so that the movie 1 dynamic poster image and the audio interval 1 of the background poster 1 associated with the movie 1 dynamic poster image, or the movie 2 dynamic poster image and the audio interval 2 of the background music 1 associated with the movie 2 dynamic poster image can be associated with the movie 2 dynamic poster image, and thus, not only the user can experience the linkage of the movie 1 dynamic poster and the user with the poster can be interested in the movie.

It is to be understood that, in the embodiment of the present application, the background music and poster images may be music and poster images downloaded from a cloud or other server and stored locally by the smart tv 100 after the smart tv 100 is networked, for example, after the poster images, music, and the like of each newly released movie and television play are downloaded from each large video application platform after the smart tv 100 is networked, the downloaded music and poster images are stored locally; alternatively, the smart tv 100 obtains the background music and the poster image from a music library and an image library stored in the smart tv 100 by the user according to the preference of the user.

In order to better understand the implementation process of the video generation method, details of the video generation method of the present application will be described in detail below with reference to fig. 4 to 11. It should be noted that the following method may be executed by a processor of the smart television 100, or may be executed by an application installed on the smart television 100, for example, a video application installed on the smart television 100, a gallery application, and the like, which is not limited in this application, and for convenience of description, the following method is described below with the processor of the smart television 100 as an execution subject.

It should also be understood that the above method may be applied to determine M poster images, then select appropriate background music for M poster images, may be applied to select N background music, and then match the poster images adapted to the background music for N background music, respectively, and the principle and implementation process of the two ways are similar. The method of the present application will be described below by taking as an example the selection of appropriate background music for the determined M poster images.

Specifically, fig. 4 is a schematic method flow diagram of an example of a video generation method according to some embodiments. As shown in fig. 4, the method 400 includes:

step 402, obtaining M poster images and determining the theme label of each poster image.

It can be understood that in order to match the M poster images with background music corresponding to the content and theme of the poster images, the theme label of each poster image needs to be determined according to the content of the poster images, and then the appropriate background music is matched for all the poster images according to the theme label of each poster image. In some embodiments, the M poster images may be obtained by the smart tv 100 from a poster image stored in a video application or other application installed on the smart tv 100, for example, the smart tv 100 obtains the M poster images from a gallery installed in the smart tv 100. The M poster images may be poster images of videos pushed by the video application program according to the preference of the user, or poster images of videos online recently, which is not limited in this application.

Specifically, in some embodiments, the smart tv 100 may identify the content of each poster image by using an image identification method, determine the theme of each poster image according to the identified poster image content, and add a theme tag corresponding to the poster image theme to each poster image. For example, if the smart television 100 recognizes that a certain poster image content includes elements such as an action and an automobile by using an image recognition method, the theme of the poster image is an action piece, and a theme tag added to the poster image by the smart television 100 is "urgent"; if the smart television 100 identifies that the content of a certain poster image contains elements such as tables, rooms and the like by using an image identification method, the theme of the poster image is a kindness, and the theme tag added to the poster image by the smart television 100 is "warmth". In some embodiments, the smart tv 100 may train an image recognition model in advance, then perform content editing on the acquired M poster images using the trained image recognition model, and add a theme tag to each poster image according to the recognized content. The method for training the image recognition model and the method for adding the theme tag to the poster image by the smart television 100 will be described in detail below.

Step 404, determining the motion range of the foreground object of each poster image.

It can be understood that if the background music is matched for the poster image only according to the theme tag of the poster image, it may be matched with the music with the theme close to or consistent with that of the poster image, but the music has a short or long duration, and at this time, if the music is intercepted, etc., it may cause discontinuity of the background music when the whole dynamic poster image click video is played, and there may also be a problem of incompatibility in audio-visual sense.

Specifically, in some embodiments, the smart tv 100 performs foreground segmentation (foreground segmentation) on the content of each poster image to obtain a foreground object of the poster image, and then the smart tv 100 can preliminarily determine the motion range of the foreground object relative to the background object of the poster image according to the position of the foreground object of the poster image in the poster image, that is, the distance from the outermost end of the outline of the foreground object in each direction to the edge (up, down, left, and right) of the poster image is the motion range of the foreground object relative to the background object of the poster image. For example, the distance from the leftmost end side of the foreground object of the poster image to the left edge of the poster image is 2 pixels, the distance from the rightmost end side of the foreground object to the right edge of the poster image is 3 pixels, the distance from the topmost end side of the foreground object to the upper edge of the poster image is 2 pixels, and the distance from the bottommost end side of the foreground object to the lower edge of the poster image is 1 pixel, so that the motion range of the foreground object relative to the background object of the poster image is: the leftward movement range is 2 pixels, the rightward movement range is 3 pixels, the upward movement range is 2 pixels, and the downward movement range is 1 pixel. The details of implementing the foreground segmentation of the poster image and determining the motion range of the poster image foreground object by the smart television 100 will be described below, and are not described herein again.

And step 406, acquiring music as background music to be matched, dividing audio intervals of the background music to be matched, and determining a theme label of each audio interval of the background music to be matched.

It will be appreciated that when matching appropriate background music for a particular poster image, it is first necessary to determine whether the style or theme of the background music is appropriate for the poster image, and then to use the music whose theme is appropriate for the poster image as the background music for that poster image. Similarly, when selecting suitable background music for a plurality of poster images simultaneously, the theme of all or most of the poster images in the background music and the plurality of poster images also needs to be determined to be suitable, the video of the click of the formed dynamic poster image can realize preliminary consistency on the audio-visual effect, so after the music is obtained as the background music to be matched, the audio interval division needs to be carried out on the background music, and then the preliminary matching is carried out on the audio interval and the poster images according to the theme tags of the audio intervals.

Specifically, in some embodiments, the smart television 100 may first obtain at least one piece of music from a music library as background music to be selected, then detect n time nodes corresponding to music tempo changes for the background music to be selected according to the music tempo changes of the background music, divide the background music into (n + 1) continuous audio intervals according to the n time nodes, then determine a theme label of each audio interval according to the audio change of each audio interval, and add the theme label to each audio interval. Specific implementation details of detecting the audio node of the background music to be matched and adding the theme tag to each audio interval by the smart television 100 will be described below, and will not be described herein again.

Step 408, determining the motion range of each audio interval in the background music to be matched.

As described in step 404, matching background music for the poster image only according to the theme tag of the poster image may match music with a theme close to or consistent with that of the poster image, but with a short or long music duration, thereby causing a problem of incompatibility in viewing and listening, so when matching background music for the poster image, it is also necessary to consider whether the background music is adapted to the movement range of the poster image. Therefore, in some embodiments of the present application, the motion range of each audio interval of the background music to be selected needs to be calculated, and then, on the basis of performing preliminary matching between the poster image and the audio interval according to the poster image theme tag and the theme tag of each audio interval, matching is performed according to the motion range of the poster image and the motion range of the audio interval, so as to obtain the background music which is not only suited to the theme of the poster image in terms of theme, but also suited to the motion range of the poster image foreground object in terms of motion range.

Specifically, in some embodiments, the smart tv 100 may determine the motion range of the audio interval according to the audio interval duration (i.e., the audio interval playing duration) and the audio change rate value of the audio interval.

For example, the smart tv 100 calculates the motion range of a certain audio interval by equation (1):

range of motion for audio interval = audio rate of change value x audio interval duration (1)

Taking the motion range of the audio interval 2 with the audio change rate value of 7 and the audio interval duration of 6 as an example, the motion range =7 × 6=42 of the audio interval 2.

Suppose that the interval duration and the audio change rate value of the partial audio interval with the music ID number of "000012" are: the duration of the audio interval 1 is 3 seconds, the audio change rate value is 4, and the subject label is' burning; the duration of the audio interval 2 is 6 seconds, the audio change rate value is 7, and the theme label is 'urgent'; if the interval duration of the audio interval 3 is 4 seconds, the audio change rate value is 1, and the theme label is "comfortable", then the smart television 100 calculates the motion range of the part of the audio interval with the music ID number of "000012" and the theme label corresponding to each audio interval by using the equation (1) as shown in the following table 4:

TABLE 4

As can be seen from table 6 above, the range of motion of the audio interval 1 is 12; the range of motion of audio interval 2 is 42; the range of motion of the audio interval 3 is 4.

And step 410, calculating the matching degree of each poster image and each audio interval of the background music to be matched.

It can be understood that when the suitable background music matched with a specific poster image is used, only the poster image is required to be matched with a certain audio interval of the background music, but when the suitable background music is matched with the M poster images simultaneously, each poster image is required to be matched with each audio interval respectively, and then the most suitable combination mode is determined. Therefore, the matching degree of each poster image and each audio interval needs to be calculated, and then the total matching degree of each combination mode of the M poster images and each audio interval is determined according to the matching degree of each poster image and each audio interval.

Specifically, in some embodiments, the smart television 100 first calculates the matching degree between each poster image and the theme label of each audio interval, then calculates the matching degree between each poster image and the motion range of each audio interval, and finally adds the matching degree between each poster image and the theme label of each audio interval to the matching degree between each image and the motion range of each audio interval to obtain the matching degree between each poster image and each audio interval.

For example, taking calculation of the matching degree between a poster image and an audio interval as an example, the smart television 100 first calculates the matching degree between the poster image and the subject label of the audio interval, for example, 0.8, then calculates the matching degree between the poster image and the motion range of the audio interval, for example, 0.9, and then adds the matching degree between the poster image and the subject label of the audio interval and the matching degree between the poster image and the motion range of the audio interval to obtain the matching degree between the poster image and the audio interval: 0.8+0.9 (1.7). Other specific implementation details of this step will be described below, and are not described herein again.

Step 412, arranging and combining the M poster images and the audio intervals of the background music to be matched, and calculating the total matching degree of the poster images and the music to be matched in each combination mode.

It can be understood that there are multiple combination manners between the M poster images and the respective audio intervals, and in each combination manner, each poster image and the corresponding audio interval have a corresponding matching degree, and in order to balance the matching degree between each poster image and the audio interval, the smart television 100 needs to calculate the total matching degree between the poster image and the background music to be matched in each combination manner, and it can be understood that the total matching degree can represent the optimal matching manner between each poster image and each audio interval to a certain extent. For example, continuing to take the

images

1 and 2 and the

audio intervals

1 and 2 shown in table 2 as an example, assume that the matching degrees between the

images

1 and 2 and the

audio intervals

1 and 2 are as follows in table 5:

TABLE 5

	Image 1	Image 2
			Interval 1	1.30	1.20
Interval 2	1.25	1.10

As can be seen from table 5, the matching degree between the image 1 and the section 1 is 1.30, the matching degree between the image 2 and the section 1 is 1.20, the matching degree between the image 1 and the section 2 is 1.25, and the matching degree between the image 2 and the section 2 is 1.10. The overall match of the image to each of the 2 combinations of audio intervals for music 1 is then as shown in table 6 below:

TABLE 6

As can be seen from table 6, the total matching degree of the combination 1{ section 1-image 1, section 2-image 2} is 2.40, and the total matching of the combination 2{ section 1-image 2, section 2-image 1} is 2.45, and it can be seen that, in the combination 1, although the matching degree of the image 1 and the section 1 is high, in this combination, the matching degree of the image 2 and the section 2 is low and 1.10, and in the combination 2, although the matching degree of the

images

1,2 and the

sections

1,2 is not greater than that of the image 1 and the section 1 in the combination 1, the matching degree of the poster image and the audio section in the combination 2 is relatively close, and the total matching degree is also greater than that of the combination 1, obviously, the combination using the combination 2 can more comprehensively consider the matching between the poster image and the audio section. .

It is understood that if only whether the total matching satisfies the condition is considered, it may happen that some single Zhang Haibao images do not particularly match the audio interval although the total matching degree satisfies the condition, for example, continuing with the poster image and interval shown in table 5 above, assuming that the total matching degree of combination 1{ interval 1-image 1, interval 2-image 2} is 2.40, the total matching degree of combination 2{ interval 1-image 2, interval 2-image 1} is 2.30, the total matching degree of combination 1 is greater than the total matching degree of combination 2, but the matching degree of interval 1 and image 1 in combination 1 is 0.10, and the total matching degree of interval 2 and image 2 is 2.30, obviously, when the matching degree of interval 1 and image 1 in combination 1 is too low. To avoid this, in some embodiments, the smart tv 100 may calculate the total matching degree in each combination mode only when the matching degree between each poster image and the audio interval is greater than the first preset value in each combination mode, that is, only when each poster image and the audio interval both satisfy the condition, whether the combination mode satisfies the condition is considered.

Specifically, in some embodiments, the smart tv 100 may perform permutation and combination on the M poster images and the respective audio intervals, and then calculate the permutation and combination number of the poster images and the audio intervals by using the following formula (2):

where i represents the number of poster images, j represents the number of audio sections, and a (i, j) represents the number of arrangement combinations of the number of i poster images and the j audio sections. Then, the smart television 100 calculates the total matching degree between the poster image and the background music to be matched in each combination mode, and then executes step 414. The details of the implementation of step 412 will be described in detail below.

And step 414, determining the highest total matching degree from the above combination modes, and then judging whether the highest total matching degree is greater than or equal to a second preset value.

It can be understood that the combination manner of the highest total matching degree determined in step 412 is not necessarily also satisfactory, for example, the highest total matching degree determined in step 412 is only 0.21, so the smart tv 100 needs to compare the highest matching degree determined in step 412 with a second preset value, only when the highest total matching degree is greater than the second preset value, the corresponding combination manner is used as the combination manner capable of forming the dynamic poster image click video, and if the highest total matching degrees are all less than the second preset value, it is indicated that the background music to be matched is not appropriate, and the background music to be matched is to be replaced by the smart tv 100. Specifically, in some embodiments, the smart television 100 needs to determine a combination manner with the highest total matching degree from the multiple combination manners, compare the highest total matching degree with a preset value, and if the highest total matching degree is greater than or equal to a second preset value, execute step 416, that is, generate a click video of the dynamic poster image according to the combination manner with the highest total matching degree; when the highest total matching degree is smaller than the preset value, step 406 is executed, i.e. the background music to be matched is obtained again.

And step 416, generating a click video of the dynamic poster image according to the combination mode with the highest total matching degree.

The smart television 100 generates a click video of the dynamic poster image of the M Zhang Haibao image according to the combination mode with the highest total matching degree and the playing sequence of the audio interval.

And 418, playing the click video of the dynamic poster image.

After generating the dynamic poster image click video, the smart television 100 plays the dynamic poster image click video on the video application home page shown in fig. 1. In some embodiments, the click video of the dynamic poster image may be played during the time when the user opens the smart tv 100 and enters the main interface of the smart tv 100, that is, as the start video of the smart tv 100; or after the user opens the smart television 100 and enters the main interface of the smart television 100, playing the smart television 100 on the main interface; the video application program may be played on the main interface of a certain video application program installed on the smart television 100 after the user opens the video application program, which is not limited in this application.

The general flow of the video generation method of the present application is described above, and details of implementation of each step in the method 400 are described below with reference to the drawings.

In some embodiments, corresponding to step 404, taking poster image 1, poster image 2, and poster image 3 shown in fig. 5 as an example, the method for determining the motion range of poster image 1, poster image 2, and poster image 3 by smart tv 100 includes:

the method for foreground segmentation of the poster image and determination of the foreground object motion range of the poster image by the smart television 100 is as follows: after foreground segmentation is performed on the poster image 1, the poster image 2, and the poster image 3 shown in fig. 5, the smart television 100 obtains a binary image of the poster image, where a white area is a foreground object, a black area is a background object, so that the foreground object of the poster image 1 is P1, the background object is P1', the foreground object of the poster image 2 is P2, the background object is P2', the foreground object of the poster image 3 is P3, and the background object is P3'. After that, the range of motion of the foreground object P1 of the poster image 1, the foreground object P2 of the poster image 2, and the foreground object P3 of the poster image 3 relative to the respective background objects, which can be initially determined by the smart tv 100, can be as shown in the following table 7:

TABLE 7

In the above table 7, "ID number" represents the ID number of the poster image, and the left-right (lateral) up-down (longitudinal) movement range represents the range in which the foreground object of the poster image can move with respect to the background object, up-down, left-right, and up-down; where x _ l denotes a range in which the foreground object is movable toward the left with respect to the background object, x _ r denotes a range in which the foreground object is movable toward the right with respect to the background object, y _ t denotes a range in which the foreground object is movable upward with respect to the background object, and y _ d denotes a range in which the foreground object is movable toward the left with respect to the background object; the range of motion represents the maximum of the poster image foreground object relative to the poster image background object in the lateral range of motion as well as the longitudinal range of motion, for example the lateral range of motion of the poster image 1 is: x _ l + x _ r =4+3=7 pixels (pixel), the longitudinal range of motion of the poster image 1 being: y _ t + y + d =1+0 +1 =1p, the lateral movement range of the poster image 1 is greater than the longitudinal movement range of the poster image 1, so the movement range of the poster image 1 is 7p; the theme tag represents the theme of the poster image.

Optionally, the moving manner of the foreground object of the poster image relative to the background object includes, but is not limited to, up and down left and right moving, and the foreground object may also move in rotation relative to the background object, for example, the moving manner of the foreground object in the dynamic poster image of movie 1 in fig. 2 relative to the background object, or the moving manner of the foreground object in a slant up or slant down relative to the background object, for example, the moving manner of the foreground object in the dynamic poster image of movie 2 in fig. 2 relative to the background object, which is not limited in the present application.

With reference to fig. 5, as can be seen from table 7 above, the theme tag of the poster image 1 is "in mood", the moving range of the foreground object P1 is 7, the theme tag of the poster image 2 is "in comfort", the moving range of the foreground object P2 is 1P, the theme tag of the poster image 3 is "in emergency", and the moving range of the foreground object P3 is 7P.

Then, the smart television 100 performs depth estimation (depth estimation) on the poster image to obtain a depth estimation image of the poster image, wherein the depth estimation image is generally a gray level image, and the deeper the pixel gray level, the larger the gray level value, the smaller the representative depth value, the shallower the pixel gray level, the smaller the gray level value, the larger the representative depth value, and the white the representative depth value is the largest. After the smart television 100 obtains the poster image depth of field estimation map, the poster image is converted into a dynamic poster image by using the depth of field estimation method described above and combining the motion range of the foreground object.

Because the moving range of some poster image foreground objects is large, and partial image pixel values in background objects in the poster images are similar to the foreground object pixel values, when the foreground objects move according to the large moving range and partial images in the background objects which are similar to the foreground object pixel values also move along with the foreground objects, a part of vacancy occurs in the background objects, if no other referenceable poster image content exists around the part of vacancy, and when the smart television 100 fills the color of the part of vacancy, an area which is not suitable for other contents of the poster images is formed, and the visual experience of users is affected. For example, taking the poster image 1 shown in fig. 5 as an example, the partial image in the background object P1' similar to the foreground object P1 in the poster image 1 is P1 ″, since the moving range of the foreground object P1 of the poster image 1 is 7 pixels in the horizontal direction (4 pixels movable to the left and 3 pixels movable to the right), when the foreground object P1 moves 3 pixels to the right together with the image P1 ″, an empty part of P1' ″ will appear in the background object, and at this time, since there is no other referenceable image content on the left side of the poster image 1, after the smart television 100 fills this P1' ″, the formed image color and content are not particularly matched with other parts of the poster image 1, which will affect the visual experience of the user. It is understood that the above-mentioned 7 pixels, 4 pixels and 3 pixels are only exemplary and do not constitute a limitation to the motion range of the foreground object of the image in the present application. In some embodiments, when the poster image size is relatively large, for example 1920 pixels by 1080 pixels, the above-mentioned motion range may become large adaptively, such as 50 pixels, 100 pixels, and so on, which is not limited in this application.

Therefore, in some embodiments, the smart television 100 obtains the depth of field value of each pixel in the poster image, that is, the pixel gray value of the poster image, according to the pixel gray value of each pixel in the poster image, determines the background object part close to the pixel gray value of the foreground object, and then further adjusts the moving range of the foreground object according to the position of the background object part close to the pixel gray value of the foreground object, so that the foreground object of the poster image can move as much as possible without affecting the visual effect of the dynamic poster image presented to the viewer. For example, assuming that the center of the background object portion close to the gray-scale value of the foreground object pixel is located in the image center of the poster image, since there is image content available for reference around the image center of the poster image, the smart tv 100 may not adjust the motion range of the foreground object, or may make a small adjustment.

In particular, continuing with the poster image 1 shown in fig. 5, the range of motion of the poster image to the right can be adjusted from 3P to 1P, at which point the region P1'″ will narrow from 3P to 1P, it being understood that the foreground object P1 can still make a 1P range of motion in the lateral direction relative to the background object P1', and that a 1-pixel color fill does not give the user too much visual discomfort. Optionally, in some embodiments, the smart tv 100 may make an appropriate adjustment to the motion range of the foreground object of the poster image in each direction according to the length, width and color of the poster image, and the change value of the adjustment may be smaller than 1 pixel or larger than 1 pixel. For example, if the poster image 1 itself is a poster image of black and white system, the range of motion of the foreground object P1 to the right can be adjusted from 3 pixels to 2 pixels, even without adjustment, while when the poster image 1 is a poster image of pure color system other than black, white and gray, such as a poster image of pure green, even a slight color difference may give a visual discomfort to the user, and at this time, the range of motion of the foreground object P1 to the right of the poster image 1 can be adjusted from 3 pixels to 0.1 pixel. It should be understood that the present application is not limited to the adjustment manner and the adjustment variation value of the poster image movement range.

In some embodiments, the range of motion of the foreground object of the poster image after the smart tv 100 adjusts the range of motion of the foreground object of the poster image is as shown in table 8 below:

TABLE 8

With reference to fig. 5, as can be seen from table 8 above, the theme tag of the poster image 1 is "in mood", the range of motion of the foreground object P1 is 7P and 5P, the theme tag of the poster image 2 is "in comfort", the range of motion of the foreground object P2 is 1P, the theme tag of the poster image 3 is "in emergency", and the range of motion of the foreground object P3 is 7P.

In some embodiments, corresponding to the step 406, the method for detecting an audio node of the background music to be matched, dividing the audio intervals, and setting a theme tag for each audio interval includes:

the smart tv 100 randomly acquires at least one piece of music from a music library installed on the smart tv 100, and assuming that the smart tv 100 acquires one piece of music from the music library, the ID number of the piece of music is "000012". Optionally, the smart television 100 may further select at least one piece of music as the background music to be selected according to the preference of the user. For example, the smart tv 100 may arrange the music that the user likes to listen to at ordinary times according to the number of times the user listens to the songs by using a big data statistics method, and then the smart tv 100 selects one or more pieces of music from the music as the background music to be selected according to the number of times each piece of music is played. Specifically, assume that 10 pieces of music are listened to by the user at ordinary times, and the ID numbers thereof are: { "00003", "00004", "00005", "00006", "00007", "00008", "00009", "000010", "000011", "000012" }, which are sorted according to the playing times: { "000012", "00006", "00003", "00004", "000010", "000011", "00009", "00008", "00005", "00007" }, so the smart tv 100 preferentially selects music with ID number "000012" as the background music to be selected. It is understood that when the playing sequence of the ID numbers "000012" and "00006" are consistent, the smart television 100 will select the ID numbers "000012" and "00006" as the background music to be selected, participate in the following calculation of the matching degree with the dynamic poster image, and finally determine one of the ID numbers as the background music.

In other embodiments, the smart television 100 may also add style labels of the whole music to each piece of music in the music library in advance, and then preferentially select, according to the most popular theme in the M poster images, the piece of music that is consistent with or similar to the most popular theme from the music library as the background music to be selected. For example, suppose there are 4 pieces of music in the music library, and their style labels are: the music 1-pop, music 2-rock, music 3-heavy metal, music 4-jazz, and the M poster images with the most theme tags being "urgent" are selected, then the smart television 100 preferentially selects music from the music library which is suitable for the theme tag being "urgent" as the background music to be selected, for example, the general dubbing of movies with the theme tag being "urgent" or the type of episode being rock or heavy metal, so the smart television 100 takes the music 2 with the style tag being "rock" and the music 3 with the style tag being "heavy metal" as the background music to be selected to participate in the following calculation of the matching degree with the dynamic poster image, and determines one of them as the background music.

Then, the smart television 100 detects n time nodes, that is, n audio nodes corresponding to the change of the audio rhythm according to the audio rhythm distribution of the whole piece of music, and divides the music into (n + 1) continuous audio intervals according to the n audio nodes, where the duration of the audio interval is the playing duration of the audio in the interval.

In some embodiments, after obtaining (n + 1) audio intervals of a certain background music to be matched, for example, after obtaining (n + 1) audio intervals of the background music to be matched with the ID number "000012" by the smart television 100, the smart television 100 randomly selects the same number of audio intervals as the poster images from the (n + 1) audio intervals as the audio intervals to be matched, and participates in the calculation of the matching degree between the following audio intervals and the sea wave images to determine the audio intervals meeting the condition. For example, taking the poster image shown in table 8 above as an example, the smart tv 100 randomly selects 3 audio intervals as audio intervals to be matched from (n + 1) intervals of background music to be matched with an ID number of "000012", and then participates in the matching degree calculation of the

poster images

1,2, and 3. In other embodiments, the smart television 100 may also select, in order from a first audio interval of a certain piece of music to be selected, audio intervals to be matched, the number of which is the same as that of poster images; or selecting the audio frequency intervals to be matched, the number of which is the same as that of the poster images, at intervals; or, the audio intervals to be matched, which are the same as the number of the poster images, are sequentially selected from any one audio interval of a certain piece of music to be selected, which is not limited in the present application. Taking the smart television 100 as an example to select the same number of audio intervals as the

poster images

1,2 and 3, fig. 6 is a schematic diagram of an example of audio intervals provided by some embodiments. As shown in fig. 6, the smart tv 100 can detect the audio node of the music with ID number "000012" through the audio detection algorithm: A. b, …, E, F, …, J, K, between audio node a and audio node B is audio interval 1 (00; between the audio node E and the audio node F is an audio interval 2 (00; between the audio node J and the audio node K is an audio interval 3 (00. The manner in which the smart tv 100 detects the background music audio node by using the node detection algorithm will be described below, and will not be described herein too much.

After each audio interval in the background music with the ID number of "000012" is determined, the smart television 100 determines an audio change rate value of each audio interval according to a variance of audio amplitude changes in the audio interval, for example, each time in the audio interval corresponds to an audio amplitude of the time, and the smart television 100 may obtain the audio change rate value of the audio interval by calculating a variance of all audio amplitudes in the audio interval.

Then, determining the theme of each audio interval according to the audio change rate value and adding a theme label to each audio interval. In some embodiments, each audio rate value corresponds to a theme tag, which may be a theme tag category obtained by the smart television 100 according to big data statistics of a general category of movie posters, for example, a general movie theme tag includes "urgent" … "burning", "warming", "light" … "relaxing", and the like, and since an audio rate change value may indicate the severity of an audio rhythm in an audio interval, a higher audio rate value in an audio interval indicates a stronger audio rhythm in the audio interval, so the smart television 100 may add the aforementioned theme tags { "urgent" … "burning", "warming", "light" … "relaxing" } in order from big to small audio rate change values. For example, the audio theme labels added by the smart television 100 for the audio change rate value and the audio interval may be as shown in table 9 below:

TABLE 9

Wherein, the background music ID number represents the number of the currently detected background music; the audio interval is divided according to the change rhythm of the audio, so the distribution condition of the audio nodes can be reflected by the audio interval, and the interval duration represents the playing duration of a certain audio interval.

Referring to fig. 6, it can be seen from table 9 that the interval duration of the audio interval 1 is 3 seconds, the audio change rate value is 4, and the theme label is "fuel"; the interval duration of the audio interval 2 is 7 seconds, the audio change rate value is 7 seconds, and the theme label is 'urgent'; the interval duration of the audio interval 3 is 4 seconds, the audio change rate value is 1, and the theme label is 'comfortable'.

In some embodiments, corresponding to the step 410, the method for calculating the matching degree of each poster image and each audio interval to be matched by the smart television 100 includes:

continuing with the poster images shown in table 8 and the audio intervals shown in table 9 as examples, in some embodiments, the smart television 100 calculates the matching degree of the theme tags between each poster image and the audio interval according to the theme tags of 3 poster images and the theme tags of

audio intervals

1,2, and 3 in the music with the ID number of "000012". In some embodiments, the smart tv 100 may calculate the matching degree between a certain poster image and a topic tag in a certain audio interval by using a matching degree calculation formula, and optionally, the matching degree calculation formula may be a euclidean distance formula, a manhattan distance calculation formula, a minkowski distance formula, or a pearson correlation coefficient calculation formula. For example, in some embodiments, the smart tv 100 calculates the matching degree of the poster image theme tag and the audio interval theme tag using the following euclidean distance formula (3):

wherein d is _n (p, q) represents the matching degree of the poster image theme tag and the audio interval theme tag, p represents each poster image theme tag, n represents the number of characteristic values of each poster image theme tag or the theme tag of the audio interval, p _i Representing individual characteristic values of the hashtag of each poster image, q representing the hashtag of each audio interval, q _i Each feature value of the topic tag of each audio interval is represented, it is understood that if the topic tag is represented as a word, the word has a character that can be represented by a text feature, for example, a character string length, a code, and the like of the word, and the application does not limit the number of features of the topic tag.

Since the motion range of the audio interval may not be adapted to the motion range of the foreground object of the poster image, as described above, the motion range of the audio interval is related to the duration of the audio interval and the rate of change of the audio, the motion range of the foreground object of the poster image represents the maximum motion range of the foreground object in the lateral direction or the longitudinal direction, if only whether the poster image is matched with the subject label of the audio interval is calculated, and then only the poster image with the subject label matching degree higher than the preset subject label matching degree is associated with the audio interval, it may happen that the foreground object of the poster image has finished moving within the maximum motion range, but the audio interval has not been played, or the audio interval has been played, but the foreground object of the poster image has not finished moving within the maximum motion range. It will be appreciated that such visual-to-audio inconsistencies may also affect the user experience.

Therefore, in some embodiments, the smart television 100 not only calculates the matching degree between the theme tag of the poster image and the theme tag of the audio interval, but also calculates the matching degree between the motion range of the poster image and the motion range of the audio interval. In some embodiments, the smart tv 100 may calculate the motion range of the poster image foreground object and the motion range of the audio interval using equation (4):

motion range matching = audio interval motion range/poster image interval motion range (4)

Taking the motion range of the foreground object of the poster image 1 and the motion range of the audio interval 2 of the background music to be selected as examples, as can be seen from table 8 above, the motion range of the foreground object of the poster image 1 is 5, as can be seen from table 4 above, and the motion range of the audio interval 2 is 42, then the matching degree of the motion ranges of the poster image 1 and the audio interval 2 is: 5/42=0.11.

Further, as can be seen from the motion range of the background music audio interval to be selected and the motion range of the poster image foreground object, the motion range of the poster image foreground object is generally a single digit, while the motion range of the background music audio interval to be selected is generally a two-digit number, the motion range of the background music audio interval to be selected and the motion range of the poster image foreground object are excessively different in value, the two are directly compared by using the formula (4), and the matching degree result obtained each time is relatively small, and subsequent calculation is not used.

In some embodiments, the smart tv 100 may multiply the foreground object motion range of the poster image by a preset coefficient, so that the motion range of the foreground object of the poster image and the motion range of the audio interval of the background music to be selected are in the same order of magnitude, for example, continuing to take the foreground object motion range of the poster image 1 and the motion range of the audio interval 2 of the background music to be selected as examples, the smart tv 100 multiplies the foreground object motion range of the poster image 1 by a preset coefficient 8, so that the motion range of the foreground object of the poster image 1 is 5 × 8=40, and then the smart tv calculates the motion ranges of the foreground object of the poster image 1 and the audio interval 2 by using equation (5):

motion range matching =1- | poster image foreground object motion range-audio interval motion range |/poster image interval motion range (5)

The matching degree of the motion range of the foreground object of the poster image 1 and the motion range of the audio interval is =1- |40-42|/40=1-0.5=0.95

Finally, the smart television 100 associates the matching degree of the motion range of the foreground object of the poster image and the motion range of the audio interval with the matching degree of the theme label of the poster image and the theme label of the audio interval to obtain the matching degree of the poster image and the audio interval. In some embodiments, the smart tv 100 may calculate the matching degree between the poster image and the audio interval by equation (6):

the matching degree of the poster image and the audio interval = the matching degree of the poster image theme label and the audio interval theme label + the matching degree of the poster image foreground motion range and the background music interval motion range (6)

For example, the smart television 100 adds the matching degree between the motion range of the foreground object in the poster image 1 and the motion range of the audio interval 2 to the matching degree between the theme tag in the poster image 1 and the theme tag in the audio interval 2 to obtain a total matching score =0.95+0.3=1.25 of the poster image 1 and the audio interval 2.

In other embodiments, the smart tv 100 may also set weights for the matching degree of the motion range and the matching degree of the theme tag, for example, the importance of whether the poster image matches the theme tag in the audio interval is higher than the importance of whether the motion range matches, so the weight of the matching degree of the poster image and the theme tag in the audio interval may be set higher, for example, the smart tv 100 sets the weight of the matching degree of the poster image and the theme tag in the audio interval to 80%, then sets the weight of the matching degree of the poster image and the motion range in the audio interval to 20%, and then calculates the matching degree of the poster image and the audio interval by using equation (7):

degree of matching between poster image and audio interval =80% × degree of matching between poster image subject label and audio interval subject label +20% × degree of matching between poster image foreground motion range and audio interval motion range (7)

For example, the smart television 100 multiplies the matching degree between the foreground object motion range of the poster image 1 and the motion range of the audio interval 2 by 80%, multiplies the matching degree between the theme tag of the poster image 1 and the theme tag of the audio interval 2 by 20%, and adds up the matching degrees to obtain the matching degree between the poster image 1 and the audio interval 2= 20% × 0.95% × 0.3=0.43. It should be understood that the present application is not limited to the manner in which the smart tv 100 calculates the matching degree between the poster image and the audio interval according to the matching degree between the motion range of the poster image and the audio interval and the matching degree between the subject tag of the poster image and the audio interval.

In some embodiments, corresponding to step 412, the manner of combining each poster image with each audio interval by the smart tv 100, and the manner of the total matching degree of the poster image with the background music to be matched in each combination manner includes:

continuing with the poster image shown in table 8 and the audio section shown in table 9 as an example, when the smart tv 100 arranges and combines the 3 poster images and 3 audio sections of music having an ID number of "000012", i =3,j =3 in the above formula (2), so a (3,3) =6, and therefore there is a combination scheme of 6 among the poster image 1, the poster image 2, the poster image 3, the audio section 1, the audio section 2, and the audio section 3. It is assumed that the matching degrees of the obtained

poster images

1,2,3 and the

audio intervals

1,2,3 with ID numbers "000012" calculated by the smart television 100 in the step 410 are shown in the following table 10:

watch 10

	Poster image 1	Poster image 2	Poster image 3
				Audio interval 1	1.26	1.23	1.10
Audio interval 2	1.15	1.16	1.81
				Audio interval 3	1.26	1.35	1.38

The total matching degree between the poster image and the music with the ID number "000012" in each of the 6 arrangement combinations of the poster image and the audio interval calculated by the smart tv 100 is shown in the following table 11:

TABLE 11

As can be seen from table 11, the total matching degree of the combination 1{ audio section 1-poster image 1, audio section 2-poster image 2, audio section 3-poster image 3} is 3.59, the total matching degree of the combination 2{ audio section 1-poster image 1, audio section 2-poster image 3, audio section 3-poster image 2} is 4.42, the total matching degree of the combination 3{ audio section 1-poster image 2, audio section 2-poster image 1, audio section 3-poster image 2} is 4.19, the total matching degree of the combination 4{ audio section 1-poster image 2, audio section 2-poster image 2, audio section 3-poster image 1} is 3.41, the total matching degree of the combination 5{ audio section 1-poster image 3, audio section 2-poster image 1, audio section 3-poster image 2} is 4.30, the total matching degree of the combination 6{ audio section 1-poster image 1, audio section 2-poster image 2, audio section 3-poster image 1} is 4.52.

In some embodiments, the smart tv 100 may first determine whether the matching degree between each poster image and the audio interval is greater than or equal to a first preset value in each combination mode, and then, when the matching degree between each poster image and the audio interval is greater than or equal to the first preset value, calculate the total matching degree in the combination mode. For example, continuing with table 10 and table 11 above as an example, assuming that the first preset value is 1.15, since the matching degree of the poster image and the audio interval in combination 1, combination 3, combination 4, and combination 6 is less than 1.15, the smart tv 100 will directly calculate the total matching degree in combination 2 and combination 5. The first preset value is a value statistically set by research personnel by using big data, and the first preset value can be an average value of the matching degree of each poster image and the audio interval in each combination mode. In this way, it is possible to exclude a few poster images that are too poorly matched with the audio section for each combination.

Continuing with table 10 and table 11 as an example, in some embodiments, corresponding to step 414, determining the highest total matching degree from the above various combination manners, and determining whether the highest total matching degree is greater than or equal to the second preset value includes: the smart television 100 determines that the total matching degree of the combination 2{ audio interval 1-poster image 1, audio interval 2-poster image 3, audio interval 3-poster image 2} is 4.42 highest from the above combinations, so the smart television 100 compares the total matching degree of the combination 2 with the second preset value. The second preset value is a preset value counted by research personnel by utilizing big data.

Assuming that the second preset value is 4, the total matching degree of the combination 2 is higher than the second preset value, and the smart television 100 generates the dynamic poster image click video according to the combination 2. That is, the smart tv 100 sets the audio section 1 to the audio section 3 intercepting the music with the ID number "000012" as the background music of the

poster images

1,2,3, and plays the poster images in the order of the poster image 1, the poster image 3, and the poster image 2.

Assuming that the second preset value is 5, the total matching degree of the combination 2 is also smaller than the second preset value, and the smart television 100 reselects three audio intervals from other audio intervals with ID numbers of "000012", and repeats the above-mentioned methods from step 410 to step 412 until the audio interval meeting the above-mentioned condition is determined. For example, as shown in fig. 7, the smart television 100 reselects three audio intervals from the previous audio interval X of the audio interval 1, the audio interval X, the audio interval 1, and the audio interval 2 serve as new audio intervals to be matched, calculates a total match among all combinations of the audio interval X, the audio interval 1, and the audio interval 2, the poster image 1, the poster image 2, and the poster image 3, and determines a group of combinations with the highest total match degree and higher than the second preset value 5 from the total matches to form the dynamic poster click video. In some embodiments, the smart tv 100 stores the foreground object, the background object, the theme tag, the motion range, the audio interval of each background music, the theme tag of the audio interval, and the rate change value of the audio interval of each poster image obtained through the above steps 402 to 410, and the matching degree between each poster image and each audio interval of each background music in the memory of the smart tv 100, so that when determining and calculating the total matching degree between a new audio interval and each poster image in various combinations, the smart tv 100 can directly obtain the obtained matching degree between a poster image and a certain audio interval from the memory, so as to improve the calculation efficiency. For example, taking the determination of the highest matching degree among various combinations of the audio interval X, the audio interval 1, the audio interval 2 and the

poster images

1,2, and 3 as an example, the smart television 100 only needs to recalculate the matching degree of the audio interval X and the

poster images

1,2, and 3, and does not need to recalculate the matching degree of the audio interval 1, the audio interval 2 and the

poster images

1,2, and 3.

Assuming that the total high matching degree of the combination modes of all the audio intervals and the poster images in the background music to be matched with the ID number "000012" is smaller than the second preset value, the smart television 100 will replace the background music to be matched, and will repeat the above steps 406 to 410 until the music meeting the conditions is obtained, and generate the dynamic poster image click video by using the music as the background music, and the specific implementation details may refer to the above description, and will not be described herein again.

In some embodiments, when the user clicks one of the poster images in the dynamic poster image click video and watches the video corresponding to the poster image, the smart television 100 will replace the poster image clicked by the user and determine the background music matching both the new poster image and the original dynamic poster image.

For example, taking the

poster image

1,2,3 shown in table 2 and the dynamic poster card point video composed of the background music with the ID number "000012" as an example, assuming that the user clicks the poster image 1 and watches the video corresponding to the poster image 1, the smart tv 100 will re-acquire the new poster image Y in the manner described in the above step 402, and re-determine the matching degree of each

audio interval

1,2,3 of the background music with the ID number "000012" corresponding to the new poster image Y and the dynamic poster card point video in the manner described in the above steps 402 to 414, when the new poster image Y and each

audio interval

1,2,3 of the background music with the ID number "000012" are re-determined, in order to play the new poster image Y and the original poster image 2 on the same background, the smart tv 100 will re-replace the background music in the manner described in step 406, and calculate whether each audio interval of the new background music to be matched matches with the original poster image Y and the original poster image 2, and combine the new poster image to determine whether the background music matches all the poster image conditions.

In the above embodiment, in order to make the new poster image Y and the

original poster image

2,3 play under the same background music, the background music needs to be replaced again. It is understood that the smart tv 100 may also select a new background music for the new poster image Y alone, and then play the poster image Y under the new background music.

For example, in other embodiments, continuing with the example of the dynamic poster click video composed of the

poster image

1,2,3 and the background music with the ID number of "000012" shown in table 8, assuming that the user clicks the poster image 1 and watches the video corresponding to the poster image 1, the smart television 100 re-acquires a new poster image Y in the manner described in step 402, and repeats the above-mentioned steps 402 to 414 to select a new background music for the new poster image Y, where specific implementation details may refer to the above description and are not described herein again.

As described above, the implementation process of the video generation method of the present application is introduced, and the click video of the dynamic poster image can be obtained by the video generation method, and as described above, the dynamic poster image click video associates the theme of the dynamic poster image with the theme of the audio interval, and the motion range of the foreground object of the dynamic poster image with the motion range of the audio interval, so that when the click video of the dynamic poster image is played, the user can have audiovisual experience as shown in fig. 3. The following briefly introduces the methods related to model training, theme tag addition to the poster image by using the trained model, foreground segmentation of the poster image, and audio node detection in the above steps. It should be understood that the following implementation methods are exemplary and do not limit specific implementation details of the method for generating a video of the present application, and in other embodiments, to achieve the technical effect of the method for generating a video of the present application, other alternative means may also be applied to achieve the following method, which is not described herein again.

In some embodiments, corresponding to step 402, the method for training the image recognition model and the method for adding the theme tag to the poster image by the smart television 100 include:

(1) The smart television 100 trains an image recognition model.

The smart television 100 trains the image recognition model based on the image with the pixel-level label in the existing semantic segmentation data set, so that the trained image recognition model can recognize the content in the poster image, for example, the image recognition model can recognize the person, the room, the automobile, the building and the like in the poster image. The image with the pixel-level labels means that each pixel in the image has a type label corresponding to the pixel, for example, if a part of the pixels corresponds to a person, the labels of the pixels are the persons, if a part of the pixels corresponds to a room, the labels of the pixels are the rooms, if a part of the pixels corresponds to a car, the labels of the pixels are the cars, and the like.

Specifically, the smart television 100 takes an image in the data set with a preset pixel level category label as target data; then, the smart television 100 inputs the target data into the image recognition model to be trained to obtain an image recognition result of the target data, and calculates a loss function of the image recognition model according to the image recognition result of the target data.

In some embodiments, the smart tv 100 may calculate the loss function by equation (8):

L _Seg ＝-y(t)logF(x ^t ) (8)

wherein, F (x) ^t ) Y (t) is a preset pixel level class label of the target data for the result identified by the image identification model, L _Seg A loss function of the model is identified for the image.

Then, the smart television 100 adjusts parameters in the image recognition model according to the result of the loss function, such as a weight value of each layer of neural network in the neural network used by the image recognition model, to reduce the result of the loss function, so that the output result of the image recognition model is the same as or similar to the input result (i.e., the preset pixel level class label), and when the output result of the image recognition model is the same as or similar to the input result, the training of the image recognition model is considered to be completed.

In some embodiments, the data set used for training the image recognition model may be a VOC (nominal visual object classes), MS COCO (systematic micro soft common objects in context), or other 2D data set; or 2.5D data sets such as NYU-D V, SUN-3D, SUN RGB-D and the like; the method can also be a 3D data set such as Stanford 2D-3D, shape-Net Core, and the data set adopted by the training image recognition model is not limited in the application.

In some embodiments, the above-mentioned model for training image recognition may be based on a neural network model architecture such as Full Convolution Networks (FCNs), seg-Net, U-Net, deep-Lab V1-V3, and the like. The application does not limit the type of the architecture of the neural network model used for training the image recognition model.

(2) The smart television 100 identifies the content of the poster image by using the trained image identification model, and adds a theme label to the poster image.

The smart television 100 identifies the content of the poster image by using a trained image identification model, and adds a theme label "a" to the poster image when identifying the content conforming to a certain theme a.

For example, if the smart tv 100 recognizes that elements such as a person, a room, and a sofa are included in the poster image 2 using a trained image recognition model, and if the elements belong to the theme "comfortable", the smart tv 100 will add the theme label "comfortable" to the poster image 2. It should be understood that, when a certain poster image identified by the smart television 100 includes element contents under multiple themes, the smart television 100 classifies the poster image according to the proportion of the element contents under the theme. For example, continuing with the example of the poster image 2, the smart tv 100 recognizes that the poster image 2 includes elements such as a person, a room, a sofa, and a window, and assuming that the elements such as the room, the sofa, and the window belong to the element content with the topic of "comfortable", and the person belongs to the element content with the topic of "action", since the number (3) of the element content with the topic of "comfortable" is greater than the number (1) of the element content with the topic of "action", the smart tv 100 will add the topic label "comfortable" to the poster image 2.

The foregoing is merely an exemplary illustration, and does not constitute a limitation to the method for training the image recognition model and adding the theme tag to the poster image by using the image recognition model in the present application, in other embodiments, the smart television 100 may also train the image recognition model in other manners or add the theme tag to the poster image by using the image recognition model, for example, when the smart television 100 trains the image recognition model, the poster image already carrying the preset theme tag is taken as the target data, so that after the image recognition model is trained, the smart television 100 may directly recognize the content of the poster image by using the image recognition model and simultaneously add the theme tag to the poster image, that is, the image recognition model trained in this manner has the capability of automatically recognizing the content of the poster image without adding the theme tag to the poster image.

In some embodiments, corresponding to step 404, the method for foreground segmentation of the poster image by the smart tv 100 is as follows:

(1) The smart tv 100 trains the image segmentation model.

The principle of the method for training the image segmentation model of the smart television 100 is consistent with that of the method for training the image recognition model of the smart television 100 in the step 402, which may specifically refer to the method for training the image recognition model in the step 402, and details are not repeated here.

(2) The smart television 100 performs foreground segmentation on the poster image by using the trained image segmentation model.

The smart television 100 performs foreground segmentation on the poster image by using the trained image segmentation model, for example, assuming that the poster image contains four different target objects, namely a table, a dog, a cup and a person, when the smart television 100 performs foreground segmentation on the poster image by using the image segmentation model of the training number, four types of region segmentation images, namely a table, a cup, a dog and a person, can be obtained. Then, the smart tv 100 will take a specific segmented image of a certain area as the foreground object of the poster image according to a preset condition to determine the motion range of the foreground object in the dynamic poster image. In some embodiments, the smart tv 100 may determine that a certain region-segmented image is a foreground object of the poster image according to the poster image theme, for example, assuming that the poster image theme is "urgent", the region-segmented image obtained after the poster image is processed by the image segmentation model includes a person, a car, a building, a window, and the like, and in order to match with the poster image theme "urgent", the smart tv 100 may use the region-segmented image corresponding to the car and the building as the foreground object of the poster image. In other embodiments, the smart tv 100 may determine that a certain region-segmented image is a foreground object of the poster image according to the information to be conveyed by the poster image, for example, assuming that the poster image is a fixed-length poster image, the region-segmented image obtained by processing the poster image by the image segmentation model includes a clock, a person, an animal, and the like, and since the information to be conveyed by the poster image is a fixed-length date, the clock may be used as the foreground object of the poster image.

In some embodiments, corresponding to step 406, the method for detecting the background music audio node and dividing the audio interval for the background music according to the audio node by the smart tv 100 is as follows:

to understand the implementation process of audio node detection, the principle of audio node detection is first introduced.

Generally, the frequency of an audio used for background music is generally between 70 hz and 4000 hz, and for some composite audios having multi-frequency signals such as polyphones, since audio nodes cannot be detected by simple peak extraction of time domain signals, the time domain signals of the audio are first converted into frequency domain signals, then the audio signals converted into frequency domain signals are divided into a plurality of windows, and the audio signals of adjacent windows are subjected to difference processing to obtain difference data between adjacent audio windows, where the difference data is used to represent the difference between the amplitudes of the audio signals of the previous window and the audio signals of the current window.

Specifically, assuming that the duration of the audio f (X) with ID number 0000X is t seconds, the sampling rate f _s =44100 hz, and the number of samples k = f _s T, the method for detecting the audio node by the smart television 100 can be divided into the following steps:

(1) The time domain signal of the audio is converted into a frequency domain signal using fourier transform.

F (k) is divided into n window fragments fi (k) of equal size, where 1 < i < >n, the length of each window is L =1024, and the audio duration of each window is

And the time corresponding to the ith window

This time is the time at which the audio starts opposite the point in the window.

And then for each window f _i (k) Obtaining its Fourier transform F _i (w), forming a fourier transform window sequence F = { F = } ₁ (k)，F ₂ (k)，…，F _n (k) Where F represents the frequency domain characteristics, e.g., amplitude, of the audio in each window.

(2) The audio signal is subjected to difference processing to obtain a difference window sequence (difference data sequence).

The smart tv 100 performs difference processing on the fourier window sequence F by using equation (9):

wherein d is _i Representing difference data of adjacent Fourier windows, k representing the number of samples, F _i (k) The frequency domain characteristic of the audio frequency corresponding to the ith window is represented, and 2 < i < n.

Then the smart tv 100 calculates the result according to equation (9), obtaining a differential window sequence D = { D = { (D) } ₁ ，d ₂ ，…，d _n }. Wherein when d _i When the absolute value is larger, t is indicated _i Time F _i (k) Is greater than the previous time t _i-1 F of (A) _i-1 (k) Is stronger, i.e. the audio signal is enhancing, i.e. indicating t _i The moment may be the starting node of the next strong tempo audio, correspondingly when d _i When the value of (A) is negative and the absolute value is large, t is indicated _i Time F _i (k) Is greater than the previous time t _i-1 F of (A) _i-1 (k) Is smaller, i.e. the audio signal is weakening, alsoI.e. indicate t _i The moment may be the starting node of the next weak tempo audio.

(3) And comparing the difference window sequence with a preset difference value, and taking the time node of the Fourier window corresponding to the difference value data larger than the preset difference value as an audio node to obtain an audio node sequence.

In some embodiments, the smart tv 100 compares the difference data with a preset difference, and if a certain difference data is greater than the preset difference, the time node of the fourier window corresponding to the difference data is an audio node, for example, assume that d is greater than d _i If the difference is greater than the preset difference, the smart television 100 will set F _i (k) Viewed as a window of tempo points, i.e. t _i Will be the audio node. In the same manner, the audio node sequence T = { T } of the audio can be obtained ₁ ,t ₂ ,…，t _m And m is less than n.

Alternatively, the preset difference value may be used for the smart tv 100 to calculate the difference window sequence D = { D } using equation (10) above ₁ ，d ₂ ，…，d _n Mean of amplitude difference for each window:

wherein W is the size of the window mean, the mean r of the position i _i Calculated as the mean of its field W.

The smart television 100 may use equation (7) to calculate the average value sequence R = { R } of the amplitude difference values corresponding to the difference window sequence ₁ ，r ₂ ，…，r _n H, then D for any element in D _i Only when d _i >r _i Then, the smart television 100 will F _i (k) Viewed as a window of tempo points, i.e. t _i In the form of an audio node, the audio node, and finally obtaining the audio node sequence T = { T = ₁ ,t ₂ ,…，t _m ,}。

Fig. 8 shows a hardware structure diagram of the smart television 100.

As shown in fig. 8, the smart tv 100 may include a processor 10, a memory 20, a display screen 30, a camera 40, and a speaker 50, wherein the processor 10 is configured to perform foreground segmentation, depth estimation, and the like on the poster images, and the processor 10 is further configured to perform audio node detection on the audio, the memory 20 is configured to store computer program instructions, when the computer program instructions are moved by the smart tv 100, the processor 10 of the smart tv 100 will implement the video generation method described in the foregoing steps 402 to 418, the display screen 30 is configured to display the dynamic poster image click video obtained by the video generation method, and receive a click, slide, and the like touch operation thereon to form a display content adapted to the aforementioned touch operation intention, for example, in some embodiments, a user clicks one of the poster images played in the dynamic poster image click video on the display screen 30, then, a video such as a movie or a television series corresponding to the dynamic poster image is played on the display screen 30, the camera 40 is used for an alternate gesture of the user, for example, the user swings his hand at an alternate space, the camera 40 obtains an image of the motion of the user swinging his hand at an alternate space, and sends the image containing the alternate gesture to the processor 10 for recognition processing, so as to obtain a touch instruction corresponding to the swinging hand, the smart television 100 responds to the touch instruction to form a corresponding display content to be displayed on the display screen 30, for example, the touch instruction corresponding to the swinging of the alternate space indicates "page turning", then when the user makes the alternate gesture of the swinging hand, the content displayed on the display screen 30 is turned from the current page to the next page, the speaker 50 is used for playing the audio described in the above embodiment, and the speaker 50 is also used for playing a specific video content such as a movie or a television series corresponding to the dynamic poster image Audio content.

It should be understood that the hardware structure of the smart tv 100 shown in fig. 8 is only an example, in other embodiments, the smart tv 100 may further include more structures, such as an antenna for communicating with other electronic devices, a touch sensor for receiving a user touch instruction, and the like, and the composition of the hardware structure of the smart tv 100 is not limited in this application.

Fig. 9 is a block diagram of a software configuration of the smart tv 100 according to the embodiment of the present invention.

As shown in fig. 9, the smart tv 100 may be divided into an application layer, an application framework layer, an Android runtime (Android runtime) and system library, and a kernel layer.

Wherein the application layer may include a series of application packages.

As shown in fig. 9, the application package may include applications such as camera, gallery, calendar, phone call, map, navigation, WLAN, bluetooth, music, video, short message, etc. In embodiments of the present application, the application package may include a gallery application or the like.

The application framework layer may include a view system, a gesture recognition system, and the like.

In an embodiment of the present application, the gesture recognition system is configured to recognize a user operation performed on the gallery application by the user on the screen of the smart tv 100.

The view system includes visual controls such as controls to display text, controls to display pictures, and the like. The view system may be used to build a display interface for an application. The display interface may be composed of one or more display elements, where a display element refers to an element in the display interface of an application in the screen of the electronic device. For example, the display elements may include buttons, text, pictures, pop-ups, menus, title bars, lists, or search boxes, among others. The display interface of the application may include at least one display element. In an embodiment of the present application, the view system may be configured to implement a layout scheme of a display interface of an application of the present application, for example, when the application is started, the view system may dynamically adjust a position of a display element in the display interface based on a size of a display area of the display interface of the application in a screen of the smart television 100; meanwhile, the view system can also configure a display style model for the display interface of the application, and when the application is started, the view system uses the display style parameters of the application to calculate the display effect of the display elements in the display interface through the display style model.

The Android Runtime comprises a core library and a virtual machine. The Android runtime is responsible for scheduling and managing an Android system.

The core library comprises two parts: one part is a function which needs to be called by java language, and the other part is a core library of android.

The application layer and the application framework layer run in a virtual machine. And executing java files of the application program layer and the application program framework layer into a binary file by the virtual machine. The virtual machine is used for performing the functions of object life cycle management, stack management, thread management, safety and exception management, garbage collection and the like.

The system library may include a plurality of functional modules. For example: surface managers (surface managers), media Libraries (Media Libraries), three-dimensional graphics processing Libraries (e.g., openGL ES), 2D graphics engines (e.g., SGL), and the like.

The surface manager is used to manage the display subsystem and provide fusion of 2D and 3D layers for multiple applications.

The media library supports a variety of commonly used audio, video format playback and recording, and still image files, among others. The media library may support a variety of audio-video encoding formats, such as MPEG4, h.264, MP3, AAC, AMR, JPG, PNG, and the like.

The three-dimensional graphic processing library is used for realizing three-dimensional graphic drawing, image rendering, synthesis, layer processing and the like.

The 2D graphics engine is a drawing engine for 2D drawing.

The kernel layer is a layer between hardware and software. The inner core layer at least comprises a display driver, a camera driver, an audio driver and a sensor driver.

The above introduces a flow diagram of a method for generating a video on the smart television 100, and it should be understood that the method for generating a video in the present application may also be implemented on a specific video application platform, that is, the method for generating a video in the present application may be implemented on a server, and after the video application program is installed in the smart television 100, a user only needs to click the video application program, and can view the dynamic poster image click video. It should be understood that the principle of implementing the video generation method of the present application on the server 200 is consistent with the principle of implementing the video generation method of the present application on the smart television 100, and a flow of implementing the video generation method of the present application on the server 200 will be briefly described below, where a specific implementation manner of each step is consistent with the manner in the above-mentioned steps 402 to 418, and thus details are not described again.

Fig. 10 shows a flowchart of a method for implementing the video generation method on the server 200. As shown in fig. 10, the method 1000 includes:

step 1002, obtaining M poster images and determining a theme label of each poster

Step 1004, determining the motion range of the foreground object of each poster image

Step 1006, acquiring music as background music to be selected, performing audio interval division on the background music, and determining a theme label of each audio interval of the background music.

Step 1008, determining the motion range of each audio interval in the background music to be selected.

Step 1010, calculating the matching degree of each poster image and each audio interval of the background music to be selected.

And 1012, arranging and combining the poster images and the audio intervals of the background music to be selected, and calculating the total matching degree of the poster images and the background music to be selected in each combination mode.

Step 1014, determining the highest total matching degree from the above combination modes,

judging whether the highest total matching degree is greater than or equal to a second preset value, and executing the step 1016 when the highest total matching degree is greater than or equal to the second preset value; when the highest total matching degree is smaller than the second preset value, step 1006 is executed.

And step 1016, generating the click video of the dynamic poster image according to the combination mode with the highest total matching degree.

Step 1018, sending the click video of the dynamic poster image to the smart tv 100, and playing the click video of the dynamic poster image by the smart tv 100.

The same steps as those of the method 400 in the method 1000 can refer to the related description in the method 400, and are not repeated herein.

In some embodiments, the server 200 establishes a communication connection with the smart tv 100, and when the user opens the video application program (or video application) 300 shown in fig. 8 at the smart tv 100, the interface diagram shown in fig. 1 is displayed on the display screen 30 of the smart tv 100, and the dynamic poster image click video is displayed at the movie poster image display area 110 shown in fig. 1.

Fig. 11 shows a schematic block diagram of a hardware configuration of a server 200.

As shown in fig. 11, the server 200 may include a processor 210, an external memory interface 220, an internal memory 221, a sim card interface 295, a Universal Serial Bus (USB) interface 230, a charging management module 240, a power management module 241, a battery 242, an antenna 1, an antenna 2, a mobile communication module 250, a wireless communication module 260, and the like.

It is to be understood that the illustrated structure of the embodiment of the present invention does not specifically limit the server 200. In other embodiments of the present application, server 200 may include more or fewer components than shown, or combine certain components, or split certain components, or a different arrangement of components. The illustrated components may be implemented in hardware, software, or a combination of software and hardware.

In the above structure constituting the server 200, the processor 210 may include one or more processing units, such as: the processor 210 may include an Application Processor (AP), a modem processor, a Graphics Processing Unit (GPU), an Image Signal Processor (ISP), a controller, a video codec, a Digital Signal Processor (DSP), a baseband processor, and/or a neural-Network Processing Unit (NPU), etc. Wherein, the different processing units may be independent devices or may be integrated in one or more processors. The controller can generate an operation control signal according to the instruction operation code and the timing signal to complete the control of instruction fetching and instruction execution.

A memory may also be provided in processor 210 for storing instructions and data. In some embodiments, the memory in processor 210 is a cache memory. The memory may hold instructions or data that have just been used or recycled by processor 210. If the processor 210 needs to use the instruction or data again, it can be called directly from the memory. Avoiding repeated accesses reduces the latency of the processor 210, thereby increasing the efficiency of the system.

In some embodiments, processor 210 may include one or more interfaces. The external memory interface 220 may be used to connect an external memory card, such as a Micro SD card, to extend the storage capability of the server 200. The external memory card communicates with the processor 210 through the external memory interface 220 to implement a data storage function. For example, files such as music, video, etc. are saved in the external memory card. In the embodiment of the application, a user can record useful information in a video through the recording method of the application while playing a video file stored in an external memory card.

The internal memory 221 may be used to store computer-executable program code, which includes instructions. The internal memory 221 may include a program storage area and a data storage area. The storage program area may store an operating system, an application program (such as a sound playing function, an image playing function, etc.) required by at least one function, and the like. The storage data area may store data (such as audio data, a phonebook, etc.) created during use of the server 200, and the like. In addition, the internal memory 221 may include a high-speed random access memory, and may further include a nonvolatile memory, such as at least one magnetic disk storage device, a flash memory device, a universal flash memory (UFS), and the like. The processor 210 executes various functional applications of the server 200 and data processing by executing instructions stored in the internal memory 221 and/or instructions stored in a memory provided in the processor 210. In the present embodiment, the processor 210 may perform the locking function of the input method application of the server 200 by executing instructions stored in the internal memory 221.

The USB interface 230 is an interface conforming to the USB standard specification, and may specifically be a Mini USB interface, a Micro USB interface, a USB Type C interface, or the like.

It should be understood that the interface connection relationship between the modules in the embodiment of the present invention is only an exemplary illustration, and does not form a structural limitation on the server 200. In other embodiments of the present application, the server 200 may also adopt different interface connection manners in the above embodiments, or a combination of multiple interface connection manners.

The charge management module 240 is configured to receive a charging input from a charger. The power management module 241 is used to connect the battery 242, the charging management module 240 and the processor 210. The power management module 241 receives the input of the battery 242 and/or the charging management module 240, and supplies power to the processor 210, the internal memory 221, the display (not shown), the camera (not shown), and the wireless communication module 260.

The wireless communication function of the server 200 may be implemented by the antenna 1, the antenna 2, the mobile communication module 250, the wireless communication module 260, the modem processor, the baseband processor, and the like.

The

antennas

1 and 2 are used for transmitting and receiving electromagnetic wave signals.

The mobile communication module 250 may provide a solution including 2G/3G/4G/5G wireless communication and the like applied on the server 200.

The wireless communication module 260 may provide a solution for wireless communication applied to the server 200, including Wireless Local Area Networks (WLANs) (e.g., wireless fidelity (Wi-Fi) networks), bluetooth (bluetooth, BT), global Navigation Satellite System (GNSS), frequency Modulation (FM), near Field Communication (NFC), infrared (IR), and the like.

The keys (not shown) include a power-on key (not shown), a volume key (not shown), and the like. The server 200 may receive a key input, and generate a key signal input related to user setting and function control of the server 200.

An embodiment of the present application further provides an electronic device, including: at least one processor, a memory, and a computer program stored in the memory and executable on the at least one processor, the processor implementing the steps of any of the various method embodiments described above when executing the computer program.

The embodiments of the present application further provide a computer-readable storage medium, where a computer program is stored, and when the computer program is executed by a processor, the computer program implements the steps in the above-mentioned method embodiments.

The embodiments of the present application provide a computer program product, which when running on a mobile terminal, enables the mobile terminal to implement the steps in the above method embodiments when executed.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, all or part of the processes in the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium and can implement the steps of the embodiments of the methods described above when the computer program is executed by a processor. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer readable medium may include at least: any entity or device capable of carrying computer program code to a photographing apparatus/terminal apparatus, a recording medium, computer memory, read-only memory (ROM), random Access Memory (RAM), electrical carrier signal, telecommunication signal, and software distribution medium. Such as a usb-disk, a removable hard disk, a magnetic or optical disk, etc. In certain jurisdictions, computer-readable media may not be an electrical carrier signal or a telecommunications signal in accordance with legislative and patent practice.

In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and reference may be made to the related descriptions of other embodiments for parts that are not described or illustrated in a certain embodiment.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the technical solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

In the embodiments provided in the present application, it should be understood that the disclosed apparatus/network device and method may be implemented in other ways. For example, the above-described apparatus/network device embodiments are merely illustrative, and for example, the division of the modules or units is only one logical function division, and other divisions may be realized in practice, for example, multiple units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In the description above, for purposes of explanation and not limitation, specific details are set forth, such as particular system structures, techniques, etc. in order to provide a thorough understanding of the embodiments of the present application. It will be apparent, however, to one skilled in the art that the present application may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present application with unnecessary detail.

It will be understood that the terms "comprises" and/or "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

It should also be understood that the term "and/or" as used in this specification and the appended claims refers to and includes any and all possible combinations of one or more of the associated listed items.

As used in this specification and the appended claims, the term "if" may be interpreted contextually as "when", "upon" or "in response to" determining "or" in response to detecting ". Similarly, the phrase "if it is determined" or "if a [ described condition or event ] is detected" may be interpreted contextually to mean "upon determining" or "in response to determining" or "upon detecting [ described condition or event ]" or "in response to detecting [ described condition or event ]".

Furthermore, in the description of the present application and the appended claims, the terms "first," "second," "third," and the like are used for distinguishing between descriptions and not necessarily for describing or implying relative importance.

Reference throughout this specification to "one embodiment" or "some embodiments," or the like, means that a particular feature, structure, or characteristic described in connection with the embodiment is included in one or more embodiments of the present application. Thus, appearances of the phrases "in one embodiment," "in some embodiments," "in other embodiments," or the like, in various places throughout this specification are not necessarily all referring to the same embodiment, but rather "one or more but not all embodiments" unless specifically stated otherwise. The terms "comprising," "including," "having," and variations thereof mean "including, but not limited to," unless otherwise specifically stated.

The above-mentioned embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not substantially depart from the spirit and scope of the embodiments of the present application and are intended to be included within the scope of the present application.

Claims

1. A video generation method is applied to electronic equipment, and is characterized by comprising the following steps:

dividing background music used for generating the video into N continuous audio intervals, wherein N is a positive integer;

and matching the N images to corresponding audio intervals according to the audio rhythm change of each audio interval and the matching degree between the N images to generate the video, wherein in the playing process of the video, the images matched with the audio intervals are displayed when the audio intervals are played.

2. The method of claim 1, wherein the degree of match comprises at least one of:

a first degree of match between an audio tempo change of the audio interval and a content of an image;

and a second matching degree between the audio rhythm change of the audio interval and the duration of the audio interval and the dynamic motion range of the foreground object in the image.

3. The method of claim 2, wherein the dynamic range of foreground objects of the image is calculated by:

and carrying out foreground segmentation on the image by using a foreground segmentation neural network model to obtain a foreground object and a background object of the image, and obtaining a dynamic motion range of the foreground object in the image according to the position relation of the foreground object relative to the background object.

4. The method of claim 2, wherein the first degree of match is calculated by:

utilizing a content recognition neural network model to perform content recognition on the image, adding a content tag for reflecting the image content to the image according to the recognized content, and adding a rhythm tag for reflecting the audio rhythm change to the audio interval according to the audio rhythm change of the audio interval;

and calculating the matching degree of the content label and the rhythm label to obtain the first matching degree.

5. The method of claim 2, wherein the second degree of match is calculated by:

calculating the motion range of the audio interval according to the audio rhythm change and the duration of the audio interval;

and calculating the matching degree between the motion range of the audio interval and the dynamic motion range of the foreground object in the image to obtain the second matching degree.

6. The method according to claim 2, wherein the matching the N images to the corresponding audio intervals according to the audio rhythm variation of each audio interval and the matching degree between the N images comprises:

respectively allocating the N images to the N continuous audio intervals according to a first sequence, and calculating a plurality of first matching degrees and a plurality of second matching degrees between each audio interval and the allocated images;

calculating the sum of the plurality of first matching degrees and the plurality of second matching degrees under the condition that the plurality of first matching degrees are all larger than a first matching degree threshold value and the plurality of second matching degrees are all larger than a second matching degree threshold value; and the number of the first and second electrodes,

and when the sum of the first matching degrees and the second matching degrees is greater than a total matching degree threshold value, matching the N images to the corresponding audio intervals according to the first sequence.

7. The method according to claim 6, wherein said calculating the sum of the plurality of first matching degrees and the plurality of second matching degrees in the case that the plurality of first matching degrees are all greater than a first matching degree threshold and the plurality of second matching degrees are all greater than a second matching degree threshold comprises:

and calculating the sum of a first matching degree and a second matching degree between each audio interval and the distributed images respectively, and calculating the sum of the plurality of first matching degrees and the plurality of second matching degrees under the condition that the sum of the first matching degree and the second matching degree between each audio interval and the distributed images is larger than a third matching degree threshold value.

8. The method according to claim 2, wherein the matching the N images to the corresponding audio intervals according to the audio rhythm variation of each audio interval and the matching degree between the N images comprises:

respectively allocating the N images to the N continuous audio intervals according to a first sequence, and calculating a plurality of first matching degrees and second matching degrees between each audio interval and the allocated images;

and calculating a sum of the plurality of first matching degrees and the plurality of second matching degrees, and when the sum of the plurality of first matching degrees and the plurality of second matching degrees is greater than a total matching degree threshold value, matching the N images to the corresponding audio sections in the first order.

9. The method according to claim 8, wherein a sum of a first matching degree and a second matching degree between each audio interval and the assigned image is calculated before calculating the sum of the plurality of first matching degrees and the plurality of second matching degrees, and the sum of the plurality of first matching degrees and the plurality of second matching degrees is calculated in a case where the sum of the first matching degree and the second matching degree between each audio interval and the assigned image is greater than a third matching degree threshold.

10. The method according to any one of claims 1 to 9, wherein the dividing background music used for generating the video into N consecutive audio intervals comprises:

and dividing the background music into the N continuous audio intervals according to the rhythm change of the background music.

11. An electronic device, comprising:

a memory storing computer program instructions;

a processor coupled to a memory, the memory storing computer program instructions that, when executed by the processor, cause the electronic device to perform the method of generating video of any of claims 1-10.

12. A computer-readable medium having stored thereon instructions that, when executed on an electronic device, cause the electronic device to perform the method of generating a video according to any one of claims 1 to 10.