CN112685592B

CN112685592B - Method and device for generating sports video soundtrack

Info

Publication number: CN112685592B
Application number: CN202011552969.1A
Authority: CN
Inventors: 胡晨鹏
Original assignee: Shanghai Zhangmen Science and Technology Co Ltd
Current assignee: Shanghai Zhangmen Science and Technology Co Ltd
Priority date: 2020-12-24
Filing date: 2020-12-24
Publication date: 2023-05-26
Anticipated expiration: 2040-12-24
Also published as: CN112685592A

Abstract

The application discloses a method and a device for generating sports video soundtracks, and relates to the technical fields of video processing and cloud computing. The specific embodiment comprises the following steps: acquiring an action rhythm node sequence corresponding to a motion video; searching at least one audio rhythm node sequence matched with the action rhythm node sequence in one or more audio rhythm node sequences corresponding to an audio set, wherein the audio set comprises one or more audio units; and searching an audio unit corresponding to the at least one audio rhythm node sequence in an index for representing the corresponding relation between the audio unit and the audio rhythm node sequence, and taking the audio unit as a music audio unit of the motion video. According to the method and the device, intelligent and automatic music distribution is carried out on the motion video through the action rhythm node and the audio rhythm node, and the accuracy of the music distribution can be effectively improved.

Description

Method and device for generating sports video soundtrack

Technical Field

The embodiment of the application relates to the technical field of computers, in particular to the technical field of video processing and cloud computing, and particularly relates to a method and a device for generating sports video soundtracks.

Background

With the rise of various video forms such as live broadcast and short video, video has become an important technical field of internet services.

In the related art, in order to enhance the infectivity of a video and make the video more vividly attractive, a video may be dubbed. Users often choose favorite music as background music for the video to set aside the atmosphere of the video. For example, if a video is a fun video, the user making the video may select light and happy music as the score.

Disclosure of Invention

Provided are a method, apparatus, electronic device, and storage medium for generating a sports video soundtrack.

According to a first aspect, there is provided a method of generating a sports video soundtrack, comprising: acquiring an action rhythm node sequence corresponding to a motion video, wherein the action rhythm node sequence is a time node sequence obtained by carrying out action rhythm recognition on body key point information of a motion main body in the motion video; searching at least one audio rhythm node sequence matched with the action rhythm node sequence in one or more audio rhythm node sequences corresponding to an audio set, wherein the audio set comprises one or more audio units; and searching an audio unit corresponding to the at least one audio rhythm node sequence in an index for representing the corresponding relation between the audio unit and the audio rhythm node sequence, and taking the audio unit as a music audio unit of the motion video.

According to a second aspect, there is provided an apparatus for generating a sports video soundtrack, comprising: the system comprises an acquisition unit, a motion video generation unit and a motion video generation unit, wherein the acquisition unit is configured to acquire a motion rhythm node sequence corresponding to the motion video, and the motion rhythm node sequence is a time node sequence obtained by performing motion rhythm recognition on body key point information of a moving main body in the motion video; a search unit configured to search for at least one audio rhythm node sequence matched with the action rhythm node sequence in one or more audio rhythm node sequences corresponding to an audio set, wherein the audio set comprises one or more audio units; and the searching unit is configured to search the audio unit corresponding to the at least one audio rhythm node sequence in the index representing the corresponding relation between the audio unit and the audio rhythm node sequence as the music audio unit of the motion video.

According to a third aspect, there is provided an electronic device comprising: one or more processors; and a storage means for storing one or more programs which, when executed by the one or more processors, cause the one or more processors to implement a method as in any of the embodiments of the method of generating a sports video soundtrack.

According to a fourth aspect, there is provided a computer readable storage medium having stored thereon a computer program which when executed by a processor performs a method as any one of the embodiments of the method of generating a sports video soundtrack.

According to the scheme of the application, intelligent and automatic music distribution can be performed on the motion video through the action rhythm node and the audio rhythm node, and the accuracy of the music distribution can be effectively improved.

Drawings

Other features, objects and advantages of the present application will become more apparent upon reading of the detailed description of non-limiting embodiments, made with reference to the following drawings, in which:

FIG. 1 is an exemplary system architecture diagram in which some embodiments of the present application may be applied;

FIG. 2 is a flow chart of one embodiment of a method of generating a sports video soundtrack according to the present application;

FIG. 3 is a schematic illustration of an application scenario of a method of generating a sports video soundtrack according to the present application;

FIG. 4 is a flow chart of yet another embodiment of a method of generating a sports video soundtrack according to the present application;

FIG. 5 is a schematic structural view of one embodiment of an apparatus for generating a sports video soundtrack according to the present application;

fig. 6 is a block diagram of an electronic device for implementing a method of generating sports video soundtracks in accordance with an embodiment of the present application.

Detailed Description

Exemplary embodiments of the present application are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present application to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present application. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

It should be noted that, in the case of no conflict, the embodiments and features in the embodiments may be combined with each other. The present application will be described in detail below with reference to the accompanying drawings in conjunction with embodiments.

Fig. 1 illustrates an exemplary system architecture 100 to which embodiments of a method of generating a sports video soundtrack or an apparatus for generating a sports video soundtrack of the present application may be applied.

As shown in fig. 1, a system architecture 100 may include

terminal devices

101, 102, 103, a network 104, and a server 105. The network 104 is used as a medium to provide communication links between the

terminal devices

101, 102, 103 and the server 105. The network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others.

The user may interact with the server 105 via the network 104 using the

terminal devices

101, 102, 103 to receive or send messages or the like. Various communication client applications, such as video-type applications, live applications, instant messaging tools, mailbox clients, social platform software, etc., may be installed on the

terminal devices

101, 102, 103.

The

terminal devices

101, 102, 103 may be hardware or software. When the

terminal devices

101, 102, 103 are hardware, they may be various electronic devices with display screens, including but not limited to smartphones, tablets, electronic book readers, laptop and desktop computers, and the like. When the

terminal devices

101, 102, 103 are software, they can be installed in the above-listed electronic devices. Which may be implemented as multiple software or software modules (e.g., multiple software or software modules for providing distributed services) or as a single software or software module. The present invention is not particularly limited herein.

The server 105 may be a server providing various services, such as a background server providing support for the

terminal devices

101, 102, 103. The background server may analyze and process the data of the sports video and the like, and feed back the processing result (for example, a music audio unit of the sports video) to the terminal device.

It should be noted that, the method for generating a sports video score provided in the embodiment of the present application may be executed by the server 105 or the

terminal devices

101, 102, 103, and accordingly, the apparatus for generating a sports video score may be provided in the server 105 or the

terminal devices

101, 102, 103.

It should be understood that the number of terminal devices, networks and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

With continued reference to fig. 2, a flow 200 of one embodiment of a method of generating a sports video soundtrack according to the present application is shown. The method for generating the sports video score is used for a server and can comprise the following steps:

step 201, obtaining an action rhythm node sequence corresponding to a motion video, wherein the action rhythm node sequence is a time node sequence obtained by performing action rhythm recognition on body key point information of a motion main body in the motion video.

In this embodiment, an execution body (for example, a server or a terminal device shown in fig. 1) on which the method of generating a sports video score operates may acquire a sports tempo node sequence corresponding to the sports video. In practice, the executing body may acquire the action tempo node sequence determined by other electronic devices (such as a terminal) to acquire the action tempo node sequence, and in addition, the executing body may determine the action tempo node sequence of the motion video in the device to acquire the action tempo node sequence. Sports video in this application refers to video representing the motion of a moving subject. The moving body herein may be a person, an animal, or the like. The body keypoint information may be information reflecting body keypoints of the moving subject, for example, may be positions of the keypoints and/or lines connecting the positions of the keypoints, and the like.

The executing body or the other electronic devices can perform action rhythm recognition (such as recognizing significant changes of actions) on body key point information of the moving body in the motion video, so as to obtain a time node sequence, and the time node sequence is used as an action rhythm node sequence.

In practice, cadence is a regular mutation in natural, social and human activities that accompanies cadence. Various changing factors can be organized in a repeated and corresponding mode to form a sequential whole (namely rhythm) which is continuous in sequence. Rhythms are an important expression of lyrics. The rhythm is not limited to the sound level, but the motion of the object and the motion of emotion also form the rhythm.

Specifically, the time value corresponding to the motion of the motion body can be used as a time node, and the time nodes corresponding to the respective motions can form a time node sequence.

Step 202, searching at least one audio rhythm node sequence matched with the action rhythm node sequence in one or more audio rhythm node sequences corresponding to an audio set, wherein the audio set comprises one or more audio units.

In this embodiment, the executing body may search, in an audio tempo node sequence corresponding to the audio set, an audio tempo node sequence that matches the action tempo node sequence. The audio set includes one or more audio units, each of which may correspond to a sequence of audio tempo nodes. Accordingly, the number of audio tempo node sequences to which the audio set corresponds may be one or more. It should be noted that a plurality of the present application may be two or more.

In particular, the representation of the audio tempo node sequence and the action tempo node sequence may be identical, both consisting of time nodes. In particular, a time node may be noted when the audio changes significantly.

In practice, the action tempo node sequence matches the audio tempo node sequence, which may refer to the same or similar time nodes of both. The similarity here may refer to that the similarity is larger than a threshold value preset for the similarity, or that the rank of the similarity is a preset number before the ranking of all the similarities (in order of the similarity from large to small). The ordering of all similarities may refer to ordering of similarities between the sequence of action tempo nodes and the respective sequence of audio tempo nodes in the audio collection.

Step 203, searching an audio unit corresponding to the at least one audio rhythm node sequence in an index representing a corresponding relation between the audio unit and the audio rhythm node sequence, and using the audio unit as a music audio unit of the motion video.

In this embodiment, the executing body may search for an audio unit corresponding to the at least one audio rhythm node sequence in an index representing a correspondence between the audio unit and the audio rhythm node sequence, and use the searched audio unit as a music audio unit of the sports video. The index here may be obtained from the device or other electronic device in advance or in real time.

The method provided by the embodiment of the application carries out intelligent and automatic music distribution on the motion video through the action rhythm node and the audio rhythm node, and can effectively improve the accuracy of the music distribution.

In some optional implementations of this embodiment, the audio tempo node sequence is a time node sequence indicating tempo changes including amplitude changes and/or frequency changes; the generating of the audio tempo node sequence may include: acquiring an audio unit in the audio set, and identifying a time node of which the rhythm change reaches a change threshold value in the audio unit as a rhythm change node; and combining the rhythm change nodes according to the sequence, and taking the combination result as an audio rhythm node sequence of the audio unit.

In these alternative implementations, the audio tempo node sequence is a time node sequence indicating a tempo change. The change in cadence here may include a change in amplitude, and may also include a change in frequency. In practice, amplitude variations, i.e. volume variations of the audio, and frequency variations, i.e. beat variations of the audio.

The change in the tempo reaching the change threshold may mean that the tempo has changed significantly, i.e. the magnitude of the change in the tempo reaches a preset magnitude threshold, or the speed of the change in the tempo is greater than a preset speed threshold.

These implementations can accurately generate an audio tempo node sequence by identifying tempo changes.

In some optional implementations of this embodiment, the method may further include: and establishing indexes for the audio units and the audio rhythm node sequences in the audio set to obtain indexes for representing the corresponding relation between the audio units and the audio rhythm node sequences.

Specifically, the execution body may set up an index for the audio unit and the audio rhythm node sequence in the audio set, where the index may represent a correspondence between the two.

These implementations may build an index to characterize the correspondence between the audio units and the audio tempo node sequence in order to find the audio tempo node sequence.

Optionally, the change threshold includes an amplitude threshold and/or a frequency threshold, and the cadence change node includes an amplitude change node and/or a frequency change node; the identifying, in the audio unit, a time node when the tempo change reaches a change threshold as a tempo change node includes: determining a time node of the amplitude variation value reaching an amplitude variation threshold value in the audio unit as an amplitude variation node; and/or determining a time node in the audio unit, wherein the time node is used as the frequency change node, and the frequency change value reaches the frequency change threshold value.

Specifically, the execution entity may determine a time node at which the amplitude change value reaches the amplitude change threshold value as the amplitude change node, or may determine a time node at which the frequency change value reaches the frequency change threshold value as the frequency change node. Specifically, the amplitude variation value may refer to the magnitude or speed of the amplitude variation. The frequency change value may refer to the magnitude or speed of the frequency change.

These alternative implementations may determine the tempo change of the audio from all aspects of the audio in an omni-directional manner with amplitude change nodes, frequency change nodes.

In some implementations, in response to the tempo change node comprising the amplitude change node and the frequency change node, the audio tempo node sequence is a union of the amplitude change node and the frequency change node.

Specifically, the audio tempo node sequence may include not only amplitude variation nodes but also frequency variation nodes, i.e. a union of both.

These implementations can collect various tempo changes of the audio entirely through a union of both.

With continued reference to fig. 3, fig. 3 is a schematic diagram of an application scenario of the method of generating a sports video soundtrack according to the present embodiment. The figure shows the motion body poses corresponding to three key time nodes.

In some optional implementations of this embodiment, the generating of the body keypoint information may include: image segmentation is carried out on the video frames of the motion video by using a depth neural network model, so as to obtain an area where a motion main body is located; and detecting key points in the area where the moving body is located, connecting the detected key points by line segments, and taking the result of the line segment connection as the body key point information, wherein the line segments obtained by the line segment connection indicate the body parts among the key points.

In these alternative implementations, the executing body may use the deep neural network model to perform image segmentation on the video frame of the motion video, so as to segment different regions including the region where the moving body is located and other regions. The deep neural network model here may be a pre-trained model of various kinds that can be used for image segmentation, such as a convolutional neural network, a residual neural network, and so on.

The executing body can detect key points in the area where the moving body is located, and connect the detected key points in a line segment, and the result of the line segment connection is body key point information. The line segment connection results in a line segment indicating a body part between the keypoints, e.g., a line between the knee keypoints and the thigh keypoints may indicate that the body part is the thigh.

These implementations can accurately determine body keypoint information through image segmentation and keypoint detection.

Optionally, the region where the moving body is located comprises a joint sub-region and a trunk sub-region; the step of performing action rhythm recognition on body key point information of the moving body in the motion video to obtain a time node sequence may include: for a video frame of the motion video, carrying out gesture detection on the joint sub-region and the trunk sub-region in the video frame, and taking a time node of the video frame as a key time node in response to detecting that obvious gesture changes occur in the joint sub-region and/or the trunk sub-region; and combining the key time nodes of each video frame in the motion video according to the sequence, and taking the combined result as the time node sequence.

In these alternative implementations, the executing subject may perform pose detection on the joint sub-region and the torso sub-region in a video frame (e.g., each video frame) of the motion video. And under the condition that the obvious gesture change of the joint sub-region, the obvious gesture change of the trunk sub-region or the obvious gesture change of both the joint sub-region and the trunk sub-region is detected, the time node of the video frame is used as a key time node. In practice, an apparent gesture change may refer to a gesture change amplitude greater than a preset amplitude threshold, or a gesture change speed greater than a preset speed threshold.

The implementation modes can be used for carrying out gesture detection on the joint sub-region and the trunk sub-region, so that the joints and the trunk of the moving body are carefully monitored, and an accurate time node sequence is obtained.

In some optional implementations of this embodiment, the method may further include: for an audio rhythm node sequence in the at least one audio rhythm node sequence, determining an audio rhythm node sequence segment which is matched with the action rhythm node sequence and has equal duration in the audio rhythm node sequence in response to the difference degree between the audio rhythm node sequence and the action rhythm node sequence being larger than a preset threshold; and according to the audio rhythm node sequence segment, correcting and updating the music audio unit corresponding to the audio rhythm node sequence to obtain a corrected and updated music audio unit.

In these optional implementations, for the audio tempo node sequence (for example, each audio tempo node sequence) in the at least one audio tempo node sequence, when the degree of difference between the audio tempo node sequence and the action tempo node sequence is greater than a preset threshold and is not equal to the action tempo node sequence, the execution body determines an audio tempo node sequence segment in the audio tempo node sequence, where the audio tempo node sequence segment is matched with the action tempo node sequence and has equal duration. The audio tempo node sequence segments indicate the position of the sequence segments in the audio tempo node sequence. The degree of difference here may be a concept opposite to the degree of similarity, that is, the larger the degree of difference, the smaller the degree of similarity.

And then, the execution main body can correct and update the music audio unit corresponding to the audio rhythm node sequence according to the audio rhythm node sequence segment so as to obtain the corrected and updated music audio unit. In practice, the execution body may perform the correction update in various ways, for example, determining the audio tempo node sequence segment, a time node preceding the audio tempo node sequence, and a time node succeeding the last time node, and intercepting the time node sequence between the two time nodes as the corrected and updated score audio unit.

The implementation modes can update the music audio unit by finding the segment which is most matched with the action rhythm node sequence in the audio rhythm node sequence so as to obtain the accurate music audio unit.

Optionally, the correcting and updating the matching audio unit corresponding to the audio rhythm node sequence according to the audio rhythm node sequence segment to obtain a corrected and updated matching audio unit may include: for the music audio unit corresponding to the audio rhythm node sequence, intercepting an audio fragment corresponding to the audio rhythm node sequence segment in the music audio unit; and taking the cut audio fragment as the modified and updated score audio unit.

These alternative implementations may directly intercept the segment of the audio tempo node sequence to obtain the most accurate score audio unit.

In some optional implementations of this embodiment, the method may further include: fusing the sports video and the music audio unit to obtain a sports video after music; or sending the music audio unit to a terminal so that the terminal fuses the sports video and the music audio unit to obtain a sports video after music.

In these alternative implementations, the executing entity may perform the fusion of the sports video and the score audio unit in the present device, or may send the score audio unit to another electronic device, such as a terminal, so that the other electronic device performs the fusion process.

These implementations may be implemented in the present device or by other electronic devices to obtain a video of the sport of interest.

In some optional implementations of any of the foregoing embodiments, searching for at least one audio tempo node sequence matching the action tempo node sequence in the one or more audio tempo node sequences corresponding to the audio set may include: determining the similarity between one or more audio rhythm node sequences corresponding to the audio set and the action rhythm node sequences respectively, and taking the similarity as node similarity; and searching at least one audio rhythm node sequence matched with the action rhythm node sequence in one or more audio rhythm node sequences corresponding to the audio set according to the order of the node similarity corresponding to the audio rhythm node sequences from large to small.

In this embodiment, the executing body may determine similarities between one or more audio rhythm node sequences corresponding to the audio set and the action rhythm node sequences, respectively. And takes the similarity as node similarity.

In practice, the execution subject may determine the similarity in various ways, such as determining a hamming distance, a euclidean distance, and the like.

The execution body can search the audio rhythm node sequences matched with the action rhythm node sequences in the audio rhythm node sequences corresponding to the audio sets according to the order of the node similarity from large to small.

In particular, the searched audio tempo node sequence may be a preset number. The execution body may sort the node similarities, and then search in the sorted sequence, so as to realize that at least one audio rhythm node sequence is searched according to the order of the node similarities from large to small.

According to the embodiment, the node similarity of the rhythms can be used for quantifying the rhythms, and the accuracy of searching the audio rhythms is improved.

In some optional implementations of this embodiment, the determining the similarity between the one or more audio rhythm node sequences corresponding to the audio set and the action rhythm node sequence, as the node similarity, may include: determining the similarity of a time node sequence indicating amplitude variation in the audio rhythm node sequence and the action rhythm node sequence as amplitude node similarity; determining the similarity of a time node sequence indicating frequency change in the audio rhythm node sequence and the action rhythm node sequence as frequency node similarity; determining a weighted average of the amplitude node similarity and the frequency node similarity; and determining the node similarity between the audio rhythm node sequence and the action rhythm node sequence according to the weighted average value.

In these alternative implementations, the execution body may determine, as the amplitude node similarity, a similarity of a time node sequence indicating an amplitude change in the audio tempo node sequence to the action tempo node sequence. The execution body may further determine a similarity between a time node sequence indicating a frequency change in the audio rhythm node sequence and the action rhythm node sequence as the frequency node similarity. Then, the execution body may perform weighted average on the amplitude node similarity and the frequency node similarity according to the weight set for the amplitude node similarity and the weight set for the frequency node similarity, to obtain a weighted average.

The executing body can determine the node similarity between the audio rhythm node sequence and the action rhythm node sequence according to the weighted average value in various modes. For example, the executing body may directly determine the weighted average as the node similarity between the audio tempo node sequence and the action tempo node sequence. Alternatively, the execution body may perform a specifying process on the weighted average, and use a result of the specifying process as the node similarity, for example, the specifying process may be multiplication by a specifying coefficient or input of a specifying model.

The implementation modes can accurately determine the node similarity by setting respective weights on the amplitude and the frequency and weighting the similarity corresponding to the amplitude and the frequency.

With further reference to fig. 4, a flow 400 of yet another embodiment of a method of processing video frames is shown. The process 400 is used for a terminal, and may include the following steps:

step 401, performing action rhythm recognition on body key point information of a moving body in the motion video to obtain a time node sequence, wherein the time node sequence is used as an action rhythm node sequence corresponding to the motion video; step 402, the action rhythm node sequence is sent to a server, where the server searches at least one audio rhythm node sequence matched with the action rhythm node sequence in one or more audio rhythm node sequences corresponding to an audio set, searches an audio unit corresponding to the at least one audio rhythm node sequence in an index representing a correspondence between the audio unit and the audio rhythm node sequence, and uses the audio unit as a music audio unit of the motion video, and the audio set includes one or more audio units.

In some optional implementations of this embodiment, the audio tempo node sequence is a time node sequence indicating tempo changes including amplitude changes and/or frequency changes; the method further comprises the steps of: acquiring an audio unit in the audio set, and identifying a time node of which the rhythm change reaches a change threshold value in the audio unit as a rhythm change node; and combining the rhythm change nodes according to the sequence, and taking the combination result as an audio rhythm node sequence of the audio unit.

In some optional implementations of this embodiment, in response to the tempo change node comprising the amplitude change node and the frequency change node, the audio tempo node sequence is a union of the amplitude change node and the frequency change node.

In some optional implementations of this embodiment, the method further includes: image segmentation is carried out on the video frames of the motion video by using a depth neural network model, so as to obtain an area where a motion main body is located; and detecting key points in the area where the moving body is located, connecting the detected key points by line segments, and taking the result of the line segment connection as the body key point information, wherein the line segments obtained by the line segment connection indicate the body parts among the key points.

In some optional implementations of this embodiment, the identifying the action rhythm for the body keypoint information of the moving body in the motion video to obtain the time node sequence includes: for a video frame of the motion video, carrying out gesture detection on the joint sub-region and the trunk sub-region in the video frame, and taking a time node of the video frame as a key time node in response to detecting that obvious gesture changes occur in the joint sub-region and/or the trunk sub-region; and combining the key time nodes of each video frame in the motion video according to the sequence, and taking the combined result as the time node sequence.

In some optional implementations of this embodiment, the method further includes: acquiring an initial motion video to be detected, and performing motion detection on each video frame in the motion video to detect whether the effective video duration including motion in the motion video reaches a first preset duration; and if the detection result is that the initial motion video is reached, taking the initial motion video as the motion video.

In these alternative implementations, the executing body may perform motion detection on the video when the initial motion video is a video that has been recorded, so as to determine whether a duration of the video including the motion in the motion video is long enough, that is, the first preset duration is reached. If the first preset duration is reached, the executing body may use the initial motion video as the motion video. Specifically, the video duration including the action may be the effective video duration.

These implementations may screen the motion video to obtain more efficient, more accurate motion video.

In some optional implementations of this embodiment, the method further includes: acquiring moving images in real time for the moving body; performing target recognition on the acquired moving images to monitor the duration of action which is not recognized in all continuous moving images; and outputting a reminding message in response to the duration reaching a second preset duration.

In some optional implementations of this embodiment, the method further includes: acquiring moving images in real time for the moving body; detecting the motion of the collected moving images so as to monitor the duration of motion which is not recognized in all continuous moving images; and outputting a reminding message in response to the duration reaching a second preset duration.

In these alternative implementations, the executing body may perform motion detection on the acquired moving image to detect the duration of the ineffective video in the case where the moving image is acquired in real time to generate the moving video in real time. If the duration is longer than the second preset duration, a reminding message can be output. The reminding message can remind the user of the terminal to adjust, so that the terminal can acquire the effective video.

The implementation methods can monitor the duration of the invalid video under the condition of collecting the motion video in real time, so that the validity of the video collected in real time can be ensured.

In some optional implementations of this embodiment, the method further includes: receiving the match audio unit from a server, and fusing the sports video with the match audio unit to obtain a match sports video; or receiving a video of the post-game motion from a server, wherein the video of the post-game motion is obtained by fusing the video of the motion and the game audio unit by the server.

In these alternative implementations, the execution body may locally fuse the sports video and the score audio unit, or directly receive the fusion result of the server, so as to implement score on the sports video.

With further reference to fig. 5, as an implementation of the method shown in the foregoing figures, the present application provides an embodiment of a device for generating a sports video soundtrack, where the embodiment of the device corresponds to the embodiment of the method shown in fig. 2, and the embodiment of the device may further include the same or corresponding features or effects as the embodiment of the method shown in fig. 2, except for the features described below. The device can be applied to various electronic equipment.

As shown in fig. 5, the apparatus 500 for generating a sports video soundtrack of the present embodiment includes: an acquisition unit 501, a search unit 502, and a search unit 503. The acquiring unit 501 is configured to acquire an action rhythm node sequence corresponding to a motion video, where the action rhythm node sequence is a time node sequence obtained by performing action rhythm recognition on body key point information of a motion main body in the motion video; a search unit 502 configured to search, in one or more audio tempo node sequences corresponding to an audio set, for at least one audio tempo node sequence that matches the action tempo node sequence, where the audio set includes one or more audio units; the searching unit 503 is configured to search, in an index representing a correspondence between audio units and the audio rhythm node sequence, for an audio unit corresponding to the at least one audio rhythm node sequence as a soundtrack audio unit of the sports video.

According to embodiments of the present application, an electronic device and a readable storage medium are also provided.

As shown in fig. 6, is a block diagram of an electronic device of a method of generating a sports video soundtrack according to an embodiment of the present application. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the application described and/or claimed herein.

As shown in fig. 6, the electronic device includes: one or more processors 601, memory 602, and interfaces for connecting the components, including high-speed interfaces and low-speed interfaces. The various components are interconnected using different buses and may be mounted on a common motherboard or in other manners as desired. The processor may process instructions executing within the electronic device, including instructions stored in or on memory to display graphical information of the GUI on an external input/output device, such as a display device coupled to the interface. In other embodiments, multiple processors and/or multiple buses may be used, if desired, along with multiple memories and multiple memories. Also, multiple electronic devices may be connected, each providing a portion of the necessary operations (e.g., as a server array, a set of blade servers, or a multiprocessor system). One processor 601 is illustrated in fig. 6.

Memory 602 is a non-transitory computer-readable storage medium provided herein. The memory stores instructions executable by the at least one processor to cause the at least one processor to perform the method of generating a sports video soundtrack provided herein. The non-transitory computer readable storage medium of the present application stores computer instructions for causing a computer to perform the method of generating a sports video soundtrack provided herein.

The memory 602 is used as a non-transitory computer readable storage medium for storing a non-transitory software program, a non-transitory computer executable program, and modules, such as program instructions/modules (e.g., the acquisition unit 501, the search unit 502, and the search unit 503 shown in fig. 5) corresponding to the method of generating a sports video soundtrack in the embodiment of the present application. The processor 601 performs various functional applications of the server and data processing, i.e., implements the method of generating a sports video soundtrack in the above-described method embodiment, by running non-transitory software programs, instructions, and modules stored in the memory 602.

The memory 602 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, at least one application program required for a function; the storage data area may store data created from the use of the electronic device generating the sports video soundtrack, etc. In addition, the memory 602 may include high-speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid-state storage device. In some embodiments, memory 602 may optionally include memory remotely located relative to processor 601, which may be connected to the electronic device generating the sports video soundtrack via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The electronic device of the method of generating a sports video soundtrack may further comprise: an input device 603 and an output device 604. The processor 601, memory 602, input device 603 and output device 604 may be connected by a bus or otherwise, for example in fig. 6.

The input device 603 may receive input numeric or character information and generate key signal inputs related to user settings and function controls of the electronic device generating the sports video soundtrack, such as a touch screen, a keypad, a mouse, a track pad, a touch pad, a joystick, one or more mouse buttons, a track ball, a joystick, or the like. The output means 604 may include a display device, auxiliary lighting means (e.g., LEDs), tactile feedback means (e.g., vibration motors), and the like. The display device may include, but is not limited to, a Liquid Crystal Display (LCD), a Light Emitting Diode (LED) display, and a plasma display. In some implementations, the display device may be a touch screen.

Various implementations of the systems and techniques described here can be realized in digital electronic circuitry, integrated circuitry, application specific ASIC (application specific integrated circuit), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.

These computing programs (also referred to as programs, software applications, or code) include machine instructions for a programmable processor, and may be implemented in a high-level procedural and/or object-oriented programming language, and/or in assembly/machine language. As used herein, the terms "machine-readable medium" and "computer-readable medium" refer to any computer program product, apparatus, and/or device (e.g., magnetic discs, optical disks, memory, programmable Logic Devices (PLDs)) used to provide machine instructions and/or data to a programmable processor, including a machine-readable medium that receives machine instructions as a machine-readable signal. The term "machine-readable signal" refers to any signal used to provide machine instructions and/or data to a programmable processor.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the internet.

The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units involved in the embodiments of the present application may be implemented by software, or may be implemented by hardware. The described units may also be provided in a processor, for example, described as: a processor includes an acquisition unit, a search unit, and a lookup unit. The names of these units do not limit the unit itself in some cases, and for example, the acquisition unit may also be described as "a unit that acquires a sequence of action rhythm nodes corresponding to a motion video".

As another aspect, the present application also provides a computer-readable medium that may be contained in the apparatus described in the above embodiments; or may be present alone without being fitted into the device. The computer readable medium carries one or more programs which, when executed by the apparatus, cause the apparatus to: acquiring an action rhythm node sequence corresponding to a motion video, wherein the action rhythm node sequence is a time node sequence obtained by carrying out action rhythm recognition on body key point information of a motion main body in the motion video; searching at least one audio rhythm node sequence matched with the action rhythm node sequence in one or more audio rhythm node sequences corresponding to an audio set, wherein the audio set comprises one or more audio units; and searching an audio unit corresponding to the at least one audio rhythm node sequence in an index for representing the corresponding relation between the audio unit and the audio rhythm node sequence, and taking the audio unit as a music audio unit of the motion video.

The foregoing description is only of the preferred embodiments of the present application and is presented as a description of the principles of the technology being utilized. It will be appreciated by persons skilled in the art that the scope of the invention referred to in this application is not limited to the specific combinations of features described above, but it is intended to cover other embodiments in which any combination of features described above or equivalents thereof is possible without departing from the spirit of the invention. Such as the above-described features and technical features having similar functions (but not limited to) disclosed in the present application are replaced with each other.

Claims

1. A method of generating a sports video soundtrack for a server side, the method comprising:

acquiring an action rhythm node sequence corresponding to a motion video, wherein the action rhythm node sequence is a time node sequence obtained by carrying out action rhythm recognition on body key point information of a motion main body in the motion video;

searching at least one audio rhythm node sequence matched with the action rhythm node sequence in one or more audio rhythm node sequences corresponding to an audio set, wherein the audio set comprises one or more audio units;

Searching an audio unit corresponding to the at least one audio rhythm node sequence in an index for representing the corresponding relation between the audio unit and the audio rhythm node sequence as a music audio unit of the motion video;

the generating step of the audio rhythm node sequence comprises the following steps:

acquiring an audio unit in the audio set, and identifying a time node of which the rhythm changes to reach a change threshold value in the audio unit as a rhythm change node, wherein the change threshold value comprises an amplitude threshold value and/or a frequency threshold value, and the rhythm change node comprises an amplitude change node and/or a frequency change node;

the identifying, in the audio unit, a time node when the tempo change reaches a change threshold as a tempo change node includes:

determining a time node of the amplitude variation value reaching an amplitude variation threshold value in the audio unit as an amplitude variation node; and/or determining a time node in the audio unit, wherein the time node is used as a frequency change node, and the frequency change value of the time node reaches a frequency change threshold value;

the searching at least one audio rhythm node sequence matched with the action rhythm node sequence in one or more audio rhythm node sequences corresponding to the audio set comprises the following steps:

Determining the similarity between one or more audio rhythm node sequences corresponding to the audio set and the action rhythm node sequences respectively, wherein the similarity is used as node similarity and comprises the following steps: determining the similarity of a time node sequence indicating amplitude variation in the audio rhythm node sequence and the action rhythm node sequence as amplitude node similarity; determining the similarity of a time node sequence indicating frequency change in the audio rhythm node sequence and the action rhythm node sequence as frequency node similarity; determining a weighted average of the amplitude node similarity and the frequency node similarity; and determining the node similarity between the audio rhythm node sequence and the action rhythm node sequence according to the weighted average value.

2. The method according to claim 1, wherein the audio tempo node sequence is a time node sequence indicating tempo changes, including amplitude changes and/or frequency changes; the generating step of the audio rhythm node sequence further includes:

and combining the rhythm change nodes according to the sequence, and taking the combination result as an audio rhythm node sequence of the audio unit.

3. The method according to claim 1 or 2, wherein the method further comprises:

and establishing indexes for the audio units and the audio rhythm node sequences in the audio set to obtain indexes for representing the corresponding relation between the audio units and the audio rhythm node sequences.

4. The method of claim 1, wherein the audio cadence node sequence is a union of the amplitude variation node and the frequency variation node in response to the cadence variation node comprising the amplitude variation node and the frequency variation node.

5. The method according to claim 1 or 2, wherein the searching for at least one audio tempo node sequence matching the action tempo node sequence in one or more audio tempo node sequences corresponding to an audio set further comprises:

and searching at least one audio rhythm node sequence matched with the action rhythm node sequence in one or more audio rhythm node sequences corresponding to the audio set according to the order of the node similarity corresponding to the audio rhythm node sequences from large to small.

6. The method of claim 1, wherein the generating of the body keypoint information comprises:

Image segmentation is carried out on the video frames of the motion video by using a depth neural network model, so as to obtain an area where a motion main body is located;

and detecting key points in the area where the moving body is located, connecting the detected key points by line segments, and taking the result of the line segment connection as the body key point information, wherein the line segments obtained by the line segment connection indicate the body parts among the key points.

7. The method of claim 6, wherein the region in which the moving body is located comprises a joint sub-region and a torso sub-region;

the step of identifying the action rhythm of the body key point information of the moving body in the moving video to obtain a time node sequence comprises the following steps:

for a video frame of the motion video, carrying out gesture detection on the joint sub-region and the trunk sub-region in the video frame, and taking a time node of the video frame as a key time node in response to detecting that obvious gesture changes occur in the joint sub-region and/or the trunk sub-region;

and combining the key time nodes of each video frame in the motion video according to the sequence, and taking the combined result as the time node sequence.

8. The method according to claim 1 or 2, wherein the method further comprises:

for an audio rhythm node sequence in the at least one audio rhythm node sequence, determining an audio rhythm node sequence segment which is matched with the action rhythm node sequence and has equal duration in the audio rhythm node sequence in response to the difference degree between the audio rhythm node sequence and the action rhythm node sequence being larger than a preset threshold;

and according to the audio rhythm node sequence segment, correcting and updating the music audio unit corresponding to the audio rhythm node sequence to obtain a corrected and updated music audio unit.

9. The method of claim 8, wherein the performing, according to the audio tempo node sequence segment, correction update on the score audio unit corresponding to the audio tempo node sequence to obtain a corrected and updated score audio unit includes:

for the music audio unit corresponding to the audio rhythm node sequence, intercepting an audio fragment corresponding to the audio rhythm node sequence segment in the music audio unit;

and taking the cut audio fragment as the modified and updated score audio unit.

10. The method of claim 1, wherein the method further comprises:

fusing the sports video and the music audio unit to obtain a sports video after music; or (b)

And sending the music audio unit to a terminal so that the terminal fuses the sports video and the music audio unit to obtain a sports video after music.

11. A method of generating a sports video soundtrack for a terminal, the method comprising:

performing action rhythm recognition on body key point information of a moving body in the motion video to obtain a time node sequence, wherein the time node sequence is used as an action rhythm node sequence corresponding to the motion video;

the action rhythm node sequence is sent to a server, wherein the server searches at least one audio rhythm node sequence matched with the action rhythm node sequence in one or more audio rhythm node sequences corresponding to an audio set, and searches an audio unit corresponding to the at least one audio rhythm node sequence in an index for representing the corresponding relation between the audio unit and the audio rhythm node sequence to serve as a music audio unit of the motion video, and the audio set comprises one or more audio units;

The generating step of the audio rhythm node sequence comprises the following steps: acquiring an audio unit in the audio set, and identifying a time node of which the rhythm changes to reach a change threshold value in the audio unit as a rhythm change node, wherein the change threshold value comprises an amplitude threshold value and/or a frequency threshold value, and the rhythm change node comprises an amplitude change node and/or a frequency change node;

the identifying, in the audio unit, a time node when the tempo change reaches a change threshold as a tempo change node includes: determining a time node of the amplitude variation value reaching an amplitude variation threshold value in the audio unit as an amplitude variation node; and/or determining a time node in the audio unit, wherein the time node is used as a frequency change node, and the frequency change value of the time node reaches a frequency change threshold value;

12. The method of claim 11, wherein,

in response to the tempo change node comprising the amplitude change node and the frequency change node, the audio tempo node sequence is a union of the amplitude change node and the frequency change node.

13. The method according to one of claims 11-12, wherein the method further comprises:

14. The method of claim 13, wherein the identifying the action rhythm of the body key point information of the moving body in the motion video to obtain the time node sequence includes:

15. The method of claim 11, wherein the method further comprises:

acquiring an initial motion video to be detected, and performing motion detection on each video frame in the motion video to detect whether the effective video duration including motion in the motion video reaches a first preset duration;

and if the detection result is that the initial motion video is reached, taking the initial motion video as the motion video.

16. The method of claim 11, wherein the method further comprises:

acquiring moving images in real time for the moving body;

detecting the motion of the collected moving images so as to monitor the duration of motion which is not recognized in all continuous moving images;

and outputting a reminding message in response to the duration reaching a second preset duration.

17. The method of claim 11, wherein the method further comprises:

receiving the match audio unit from a server, and fusing the sports video with the match audio unit to obtain a match sports video; or (b)

And receiving a post-match sports video from a server, wherein the post-match sports video is obtained by fusing the sports video and the match audio unit by the server.

18. An electronic device, comprising:

one or more processors;

storage means for storing one or more programs,

when executed by the one or more processors, causes the one or more processors to implement the method of any of claims 1-10.

19. A computer readable storage medium having stored thereon a computer program, wherein the program when executed by a processor implements the method of any of claims 1-10.