CN112380396A

CN112380396A - Video processing method and device, computer readable storage medium and electronic equipment

Info

Publication number: CN112380396A
Application number: CN202011253155.8A
Authority: CN
Inventors: 何重龙; 孙静
Original assignee: Netease Hangzhou Network Co Ltd
Current assignee: Netease Hangzhou Network Co Ltd
Priority date: 2020-11-11
Filing date: 2020-11-11
Publication date: 2021-02-19
Anticipated expiration: 2040-11-11

Abstract

The disclosure provides a video processing method and device, a computer readable storage medium and electronic equipment, and relates to the technical field of video processing. The video processing method comprises the following steps: acquiring object state information of an object in a video to be processed; acquiring audio data and determining audio characteristic information of the audio data; and generating target audio and video data according to the object state information and the audio characteristic information. The method and the device realize automatic adjustment of the video playing speed according to the music rhythm, and improve the matching accuracy of the video content and the music rhythm climax point.

Description

Video processing method and device, computer readable storage medium and electronic equipment

Technical Field

The present disclosure relates to the field of video processing technologies, and in particular, to a video processing method and apparatus, a computer-readable storage medium, and an electronic device.

Background

With the rapid development of mobile internet, short video enters a stage of vigorous development. The method is suitable for the characteristics of fragmented propagation of mobile social media, the content of the small video is continuously innovated, and the appearance of the click video is more and more popular with people. The click video generation technology is a video technology that a generated picture is matched with the rhythm of music and the picture is smoothly switched at the rhythm point of the music. Generally, music with strong rhythm is selected for producing the click video, and the rhythm of the music and the rhythm of picture switching are required to be consistent. This video production method is often used in short video production mainly by judder.

In the prior art, there are two main methods for generating a stuck point video. The first method is to match the uploaded video or photo with an existing music template and generate a click video by one key. The second method is to manually make a stuck point video using video editing software.

The first method is fast and convenient, but music cannot be selected at will, the number of segments of the video or the number of photos is fixed, and personalized customization of the click video cannot be realized. In the second method, because of manual production, the generation efficiency of the stuck point video is extremely low, and the position of the music rhythm point is completely determined manually, so that the accuracy of the determined position of the music rhythm point is poor.

Disclosure of Invention

The present disclosure is directed to a video processing method and apparatus, a computer-readable storage medium, and an electronic device, so as to overcome the problems that manual production is required and the matching accuracy of music tempo points and video content is poor due to the limitations and disadvantages of the related art at least to a certain extent.

According to a first aspect of the present disclosure, there is provided a video processing method comprising: acquiring object state information of an object in a video to be processed; acquiring audio data and determining audio characteristic information of the audio data; and generating target audio and video data according to the object state information and the audio characteristic information.

Optionally, the obtaining of the state information of the object in the video to be processed includes: acquiring video data and depth data of a video to be processed; object state information is determined from the video data and the depth data.

Optionally, determining the object state information according to the video data and the depth data includes: determining the area of an object in each frame of picture of the video to be processed based on the video data to obtain object area information; determining depth data corresponding to the region where the object is located in each frame of picture of the video to be processed based on the region where the object is located and the depth data to obtain object depth information; based on the object area information and the object depth information, object state information is determined.

Optionally, determining an area of an object in each frame of a picture of the video to be processed based on the video data includes: inputting each frame of picture of a video to be processed into a trained image recognition model; the output of the image recognition model is the area where the object is located in each frame of picture of the video to be processed; acquiring a result output by the image recognition model as an area where an object is located; and determining the area of the object in each frame of the video to be processed according to the area of the object.

Optionally, determining the object state information based on the object area information and the object depth information includes: calculating the ratio of the object area information to the corresponding object depth information in each frame of the video to be processed to obtain the ratio information of each frame of the video; obtaining a first ratio change curve according to the ratio information; the abscissa of the first ratio change curve is a time point corresponding to each frame of picture in the video to be processed, and the ordinate is ratio information; and determining the object state information according to the first ratio change curve.

Optionally, when the number of the objects is multiple, calculating a ratio of object area information to corresponding object depth information in each frame of picture of the video to be processed to obtain ratio information of each frame of picture, where the ratio information includes: respectively calculating the ratio of the object area information of each object in each frame of picture of the video to be processed to the corresponding object depth information to obtain a plurality of intermediate ratios; and calculating the average value of the plurality of intermediate ratios as the ratio information of each frame of picture.

Optionally, when the number of the objects is multiple, calculating a ratio of object area information to corresponding object depth information in each frame of picture of the video to be processed to obtain ratio information of each frame of picture, and further including: respectively calculating the ratio of the object area information of each object in each frame of picture of the video to be processed to the corresponding object depth information to obtain a plurality of intermediate ratios; and weighting the intermediate ratios to obtain the ratio information of each frame of picture.

Optionally, acquiring audio data and determining audio feature information includes: acquiring audio data and determining frequency spectrum characteristic information corresponding to the audio data; obtaining a first audio characteristic curve according to the frequency spectrum characteristic information; the abscissa of the first audio characteristic curve is a time point corresponding to the audio data, and the ordinate is frequency spectrum characteristic information; and determining audio characteristic information according to the first audio characteristic curve.

Optionally, generating target audio/video data according to the object state information and the audio characteristic information includes: determining the number of wave crests contained in a first ratio change curve according to the first ratio change curve in the object state information; determining the number of wave crests contained in a first audio characteristic curve according to the first audio characteristic curve in the audio characteristic information; and generating target audio and video data according to the number of wave crests contained in the first ratio change curve and the number of wave crests contained in the first audio characteristic curve.

Optionally, generating target audio/video data according to the number of peaks included in the first ratio variation curve and the number of peaks included in the first audio characteristic curve, including: if the number of wave crests contained in the first ratio change curve is different from the number of wave crests contained in the first audio characteristic curve, filtering part of wave crests in the first ratio change curve and wave crests in the first audio characteristic curve according to a preset threshold value to obtain a second ratio change curve and a second audio characteristic curve; the second ratio variation curve and the second audio characteristic curve contain the same number of peaks; and generating target audio and video data according to the second ratio change curve and the second audio characteristic curve.

Optionally, generating target audio/video data according to the second ratio variation curve and the second audio characteristic curve, including: determining the corresponding position of each peak in the second ratio change curve in the video data according to the second ratio change curve to obtain the position of the ratio peak; determining the corresponding positions of all wave crests in the second audio characteristic curve in the audio data according to the second audio characteristic curve to obtain the positions of the audio wave crests; and generating target audio and video data according to the specific value peak position and the audio peak position.

Optionally, generating target audio/video data according to the ratio peak position and the audio peak position, including: and if the time point corresponding to the position of the specific peak is different from the time point corresponding to the position of the audio peak, adjusting the playing speed of the video to be processed to generate target audio and video data.

Optionally, generating target audio/video data according to the ratio peak position and the audio peak position, further comprising: and if the time point corresponding to the position of the specific peak is different from the time point corresponding to the position of the audio peak, cutting the video to be processed to generate target audio and video data.

Optionally, generating target audio/video data according to the ratio peak position and the audio peak position, further comprising: and if the time point corresponding to the specific peak position is different from the time point corresponding to the audio peak position, adjusting the playing speed of the video to be processed and cutting the video to be processed to generate target audio and video data.

According to a second aspect of the present disclosure, there is provided a video processing apparatus comprising: the device comprises a state information acquisition module, an audio information acquisition module and a target data generation module.

Specifically, the state information obtaining module may be configured to obtain object state information of an object in the video to be processed; the audio information acquisition module can be used for acquiring audio data and determining audio characteristic information of the audio data; the target data generation module can be used for generating target audio and video data according to the object state information and the audio characteristic information.

Optionally, the state information obtaining module may be configured to perform: acquiring video data and depth data of a video to be processed; object state information is determined from the video data and the depth data.

Optionally, the state information obtaining module may be configured to perform: determining the area of an object in each frame of picture of the video to be processed based on the video data to obtain object area information; determining depth data corresponding to the region where the object is located in each frame of picture of the video to be processed based on the region where the object is located and the depth data to obtain object depth information; based on the object area information and the object depth information, object state information is determined.

Optionally, the state information obtaining module may be configured to perform: inputting each frame of picture of a video to be processed into a trained image recognition model; the output of the image recognition model is the area where the object is located in each frame of picture of the video to be processed; acquiring a result output by the image recognition model as an area where an object is located; and determining the area of the object in each frame of the video to be processed according to the area of the object.

Optionally, the state information obtaining module may be configured to perform: calculating the ratio of the object area information to the corresponding object depth information in each frame of the video to be processed to obtain the ratio information of each frame of the video; obtaining a first ratio change curve according to the ratio information; the abscissa of the first ratio change curve is a time point corresponding to each frame of picture in the video to be processed, and the ordinate is ratio information; and determining the object state information according to the first ratio change curve.

Optionally, the state information obtaining module may be configured to perform: respectively calculating the ratio of the object area information of each object in each frame of picture of the video to be processed to the corresponding object depth information to obtain a plurality of intermediate ratios; and calculating the average value of the plurality of intermediate ratios as the ratio information of each frame of picture.

Optionally, the state information obtaining module may be configured to perform: respectively calculating the ratio of the object area information of each object in each frame of picture of the video to be processed to the corresponding object depth information to obtain a plurality of intermediate ratios; and weighting the intermediate ratios to obtain the ratio information of each frame of picture.

Optionally, the audio information obtaining module may be configured to perform: acquiring audio data and determining frequency spectrum characteristic information corresponding to the audio data; obtaining a first audio characteristic curve according to the frequency spectrum characteristic information; the abscissa of the first audio characteristic curve is a time point corresponding to the audio data, and the ordinate is frequency spectrum characteristic information; and determining audio characteristic information according to the first audio characteristic curve.

Optionally, the target data generation module may be configured to perform: determining the number of wave crests contained in a first ratio change curve according to the first ratio change curve in the object state information; determining the number of wave crests contained in a first audio characteristic curve according to the first audio characteristic curve in the audio characteristic information; and generating target audio and video data according to the number of wave crests contained in the first ratio change curve and the number of wave crests contained in the first audio characteristic curve.

Optionally, the target data generation module may be configured to perform: if the number of wave crests contained in the first ratio change curve is different from the number of wave crests contained in the first audio characteristic curve, filtering part of wave crests in the first ratio change curve and wave crests in the first audio characteristic curve according to a preset threshold value to obtain a second ratio change curve and a second audio characteristic curve; the second ratio variation curve and the second audio characteristic curve contain the same number of peaks; and generating target audio and video data according to the second ratio change curve and the second audio characteristic curve.

Optionally, the target data generation module may be configured to perform: determining the corresponding position of each peak in the second ratio change curve in the video data according to the second ratio change curve to obtain the position of the ratio peak; determining the corresponding positions of all wave crests in the second audio characteristic curve in the audio data according to the second audio characteristic curve to obtain the positions of the audio wave crests; and generating target audio and video data according to the specific value peak position and the audio peak position.

Optionally, the target data generation module may be configured to perform: and if the time point corresponding to the position of the specific peak is different from the time point corresponding to the position of the audio peak, adjusting the playing speed of the video to be processed to generate target audio and video data.

Optionally, the target data generation module may be configured to perform: and if the time point corresponding to the position of the specific peak is different from the time point corresponding to the position of the audio peak, cutting the video to be processed to generate target audio and video data.

Optionally, it may be configured to perform: and if the time point corresponding to the specific peak position is different from the time point corresponding to the audio peak position, adjusting the playing speed of the video to be processed and cutting the video to be processed to generate target audio and video data.

According to a third aspect of the present disclosure, there is provided a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements any of the live methods described above.

According to a fourth aspect of the present disclosure, there is provided an electronic device comprising: a processor; and a memory for storing executable instructions for the processor; wherein the processor is configured to perform any of the live methods described above via execution of executable instructions.

In the technical solutions provided by some embodiments of the present disclosure, object state information of an object in a video to be processed is first obtained; acquiring audio data and determining audio characteristic information of the audio data; and generating target audio and video data according to the object state information and the audio characteristic information. The generated target audio and video data is the processed video matched with the music rhythm climax point of the determined audio. The video processing method provided by the disclosure realizes automatic generation of the stuck point video, can automatically adjust the playing speed of the video according to the action of the object in the video and the music content, does not need manual operation to cut and splice the video, improves the convenience of generating the stuck point video, and simultaneously improves the accuracy of matching the video content with the music climax.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and together with the description, serve to explain the principles of the disclosure. It is to be understood that the drawings in the following description are merely exemplary of the disclosure, and that other drawings may be derived from those drawings by one of ordinary skill in the art without the exercise of inventive faculty. In the drawings:

FIG. 1 schematically illustrates a flow diagram for generating a current stuck point video;

fig. 2 schematically shows a flow chart of a video processing method according to an exemplary embodiment of the present disclosure;

fig. 3 schematically illustrates an effect diagram for determining the kind and number of objects according to a video to be processed according to an exemplary embodiment of the present disclosure;

FIG. 4 schematically illustrates an identification flow diagram of the YOLO algorithm according to an exemplary embodiment of the present disclosure;

fig. 5 schematically illustrates an object detection result of a certain frame picture using the YOLO algorithm according to an exemplary embodiment of the present disclosure;

fig. 6 schematically shows a schematic diagram of an object detection result in another frame picture using the YOLO algorithm according to an exemplary embodiment of the present disclosure;

FIG. 7 schematically illustrates a graph plotting a first ratio change as a function of K value according to an exemplary embodiment of the present disclosure;

fig. 8 schematically illustrates a first audio characteristic curve diagram plotted according to characteristic information of volume, frequency spectrum, etc. according to an exemplary embodiment of the present disclosure;

FIG. 9 schematically illustrates a second ratio variation graph after filtering out a portion of the peaks according to a threshold, according to an exemplary embodiment of the present disclosure;

FIG. 10 schematically illustrates a second audio signature graph after filtering out a portion of peaks according to a threshold, according to an exemplary embodiment of the present disclosure;

FIG. 11 schematically illustrates a comparison of a first ratio change curve and a second ratio change curve according to an exemplary embodiment of the present disclosure;

FIG. 12 schematically illustrates a comparison graph of a first audio characteristic curve and a second audio characteristic curve according to an exemplary embodiment of the present disclosure;

FIG. 13 schematically illustrates a second ratio variation curve versus a second audio characteristic curve according to an exemplary embodiment of the present disclosure;

fig. 14 schematically shows a block diagram of a video processing apparatus of an exemplary embodiment of the present disclosure;

fig. 15 schematically shows a block diagram of an electronic device according to an exemplary embodiment of the present disclosure.

Detailed Description

Example embodiments will now be described more fully with reference to the accompanying drawings. Example embodiments may, however, be embodied in many different forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of example embodiments to those skilled in the art. The described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided to give a thorough understanding of embodiments of the disclosure. One skilled in the relevant art will recognize, however, that the subject matter of the present disclosure can be practiced without one or more of the specific details, or with other methods, components, devices, steps, and the like. In other instances, well-known technical solutions have not been shown or described in detail to avoid obscuring aspects of the present disclosure.

Furthermore, the drawings are merely schematic illustrations of the present disclosure and are not necessarily drawn to scale. The same reference numerals in the drawings denote the same or similar parts, and thus their repetitive description will be omitted. Some of the block diagrams shown in the figures are functional entities and do not necessarily correspond to physically or logically separate entities. These functional entities may be implemented in the form of software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor devices and/or microcontroller devices.

The flow charts shown in the drawings are merely illustrative and do not necessarily include all of the steps. For example, some steps may be decomposed, and some steps may be combined or partially combined, so that the actual execution sequence may be changed according to the actual situation.

With the rapid development of mobile internet, short video enters a stage of vigorous development. More and more people are watching and making small videos.

A video in which a video frame switching action corresponds to a music tempo is colloquially referred to as a "stuck point video". Generally, music with strong rhythm is selected for producing the click video, and the rhythm of the music and the rhythm of picture switching are required to be consistent. This video production method is often used in short video production mainly by judder. Fig. 1 schematically shows a current generation flow chart of a stuck point video, which includes that a video to be spliced is selected from a mobile phone, a section of background music is selected, and finally, according to the rhythm climax of the music, the video is spliced manually, so that the spliced point is matched with the rhythm climax of the music.

At present, most of production of the click videos is carried out manually by adopting nonlinear editing software of a PC (personal computer) end or a mobile phone end, and producers identify the beats of music through hearing and finally achieve the purpose of consistent rhythm of music and picture switching through switching of visual matching pictures. The App or PC terminal making software mainly based on the tremble provides various templates, the templates keep the music beat and the picture switching beat consistent manually in advance and generate engineering files, and users can achieve the effect of click by manually replacing pictures and generate new videos.

However, the click video generated by fusing the video and the music through manual operation needs to manually judge whether the music rhythm is matched with the picture content, and the time of the music rhythm point is kept consistent with the time of picture switching by means of subjective feeling. The method is time-consuming and labor-consuming, the generated click video is poor in effect, and the problem that the matching accuracy of the video content and the music rhythm point is low is easily caused. In view of this, a new video processing method is needed.

The various steps of the object detection method of the exemplary embodiments of the present disclosure may be generally performed by a mobile phone. However, aspects of the present disclosure may also be implemented with a server or other terminal device, wherein other terminal devices may include, but are not limited to, a tablet, a personal computer, and the like.

Fig. 2 schematically shows a flow chart of a video processing method of an exemplary embodiment of the present disclosure. Referring to fig. 2, the video processing method may include the steps of:

and S22, acquiring object state information of an object in the video to be processed.

In an exemplary embodiment of the present disclosure, the video to be processed may be a video shot by a user in real time, or may be a video shot previously; the objects in the video to be processed can be one or more of a human, a hand of the human, a head of the human, a body of the human, and the like, and can also be various objects or animals, such as a fan, a puppy, and the like; the object state information includes type information, area information, distance information of the object from the camera, and area-to-distance ratio information of the object. On the basis of the acquired original video data and depth data, the information can be calculated.

In the embodiment of the present disclosure, the method for acquiring the object state information of the object in the video to be processed may be to acquire video data and depth data of the video to be processed; object state information is determined from the video data and the depth data. The video data is original video data, the depth data is data included in a depth map, and the depth map can be acquired by a monocular camera, a binocular camera, a Time of Flight (TOF) camera, a structured light camera and other devices.

In embodiments of the present disclosure, depth data of raw video may be acquired by a TOF camera onboard the cell phone. When the video is shot, the TOF function is turned on, so that the depth map of each frame of the video can be obtained while the video is shot by using a common camera. The working principle of the TOF is as follows: the distance between the object and the camera is measured by lighting the target object and measuring the transmission time of the light between the lens and the object, and the distance between the object and the camera is judged according to the data, so that the distance between the object and the camera is known, a depth map is obtained, and the gray value of each pixel point in the depth map can be used for representing the distance between a certain point in the picture and the camera.

In the embodiment of the disclosure, the object state information is determined according to the video data and the depth data, and the method may be that, based on the video data, the region where the object is located in each frame of the video to be processed is determined to obtain the object area information; determining depth data corresponding to the region where the object is located in each frame of picture of the video to be processed based on the region where the object is located and the depth data to obtain object depth information; based on the object area information and the object depth information, object state information is determined. The object area information is the area occupied by the object in the picture, the area can be obtained by calculation according to the area of the object in the picture, and the area can be obtained by an image identification method; the depth data is data contained in a depth map, the depth map can be acquired by equipment such as a monocular camera, a binocular camera, a TOF camera and a structured light camera, the data can reflect the distance between a certain point in a picture and the camera, and the depth information of the object is depth data corresponding to the area where the object is located.

In an exemplary embodiment of the present disclosure, based on the video data, the method for determining the area of the object in each frame of the video to be processed may be that each frame of the video to be processed is input into a trained image recognition model; the image recognition model can output the area of the object in each frame of the video to be processed; acquiring a result output by the image recognition model as an area where an object is located; and determining the area of the object in each frame of the video to be processed according to the area of the object.

In an exemplary embodiment of the present disclosure, fig. 3 schematically shows an effect diagram for determining the kind and number of objects from video data. As shown in fig. 3, by performing image recognition on the acquired original video data 31, the type and number of objects included in the video can be detected. The number of the objects to be detected can be set according to the requirements of the user, and can also be automatically selected according to the video content. When only one object needs to be detected, a detection frame 33 of a target object can be obtained on the basis of the original image 32 by an image recognition method, and the detection frame represents a detected human body area; when a plurality of objects need to be detected, a plurality of target object detection frames 35, 36, 37 can be obtained on the basis of the original image 34 by an image recognition method, and respectively represent a head region, a body region, and a hand region.

In an exemplary embodiment of the present disclosure, the image recognition method may use Fast-RCNN (Fast-Region Convolutional Neural Networks), SSD (Single Shot multi box Detector), YOLO (You Look Only Once), or the like. Taking a YOLO target detection method as an example, firstly establishing a data set for YOLO model training, obtaining a model with identification accuracy meeting practical application requirements after carrying out iterative training on the model for multiple times, and then calling the model to detect a picture or shipi you needing target detection. Specifically, fig. 4 schematically shows an identification flowchart of the YOLO algorithm, in an embodiment of the present invention, the original video data is first input into a YOLO algorithm model, after a picture of each frame of the original video passes through a plurality of convolution layers and residual layers, the algorithm detects 3 target frames for each target object, then a detection frame closest to a real target object region is obtained by using a non-maximum suppression method, by using the target detection method, the type and number of objects in each frame of the picture of the original video can be accurately obtained, and the area of the region where each object is located can be determined by the size of the obtained target detection frame.

In an exemplary embodiment of the present disclosure, fig. 5 schematically illustrates an object detection result of a certain frame of picture obtained by using the YOLO algorithm, where 51 is an original picture, and 52, 53, and 54 are target frames of a detected head region, a detected body region, and a detected hand region, respectively. Fig. 6 also schematically shows a schematic diagram of an object detection result in another frame of picture obtained by using the YOLO algorithm, where 61 is an original picture, and 62, 63, and 64 are target frames of the detected head region, body region, and hand region, respectively. The area of the region where each object is located can be calculated through the target frame obtained through detection. It is easy to see that the hand area 54 in fig. 5 is farther from the camera than the hand area 64 in fig. 6 and occupies a smaller area on the screen, while the head area 52 in fig. 5 is closer to the camera than the head area 62 in fig. 6 and occupies a larger area on the screen, so that the video can be selected to be processed according to the area of the object on the screen and the distance from the camera.

In an exemplary embodiment of the present disclosure, based on the object area information and the object depth information, the method for determining the object state information may be that a ratio of the object area information to the corresponding object depth information in each frame of the video to be processed is calculated to obtain ratio information of each frame of the video; obtaining a first ratio change curve according to the ratio information; the abscissa of the first ratio change curve is a time point corresponding to each frame of picture in the video to be processed, and the ordinate is the ratio information; and determining the object state information according to the first ratio change curve.

In an exemplary embodiment of the present disclosure, the distance information of the object from the camera may be determined by depth data, and the depth data of the raw video may be acquired by a TOF camera of the mobile phone itself. And after the types and the number of the objects and the areas of the areas are obtained, calculating the distance from the area where each object is located to the camera. The distance may be an average distance from each object to the camera, a distance from a middle point of the object to the camera, or a distance from several points randomly selected in the object to the camera, and an average distance from the several points to the camera, which is not limited in the present disclosure.

In an exemplary embodiment of the present disclosure, in a case that the number of the objects is multiple, a ratio between object area information and corresponding object depth information in each frame of a to-be-processed video is calculated, and a method of obtaining the ratio information of each frame of the to-be-processed video may be that a ratio between object area information and corresponding object depth information in each frame of the to-be-processed video of each object is calculated respectively to obtain multiple intermediate ratios; and calculating the average value of the plurality of intermediate ratios as the ratio information of each frame of picture. Wherein, the intermediate ratio refers to the ratio that needs to be calculated before the final calculation result is obtained. It should be noted that the calculation method for obtaining the final ratio information may be to calculate an average value of a plurality of intermediate ratios, or may also be to take a median value or directly use a sum of the intermediate ratios as the ratio information, and all of the methods belong to the protection scope of the present disclosure.

In an exemplary embodiment of the present disclosure, after determining a calculation method for an area where an object is located and a distance between the object and a camera, an area where the object is located (denoted as S) and a distance from the camera (denoted as L) in each frame of a picture of an original video may be calculated, and then a ratio between the area where the object is located and the corresponding distance from the camera in each frame of the picture is calculated to obtain ratio data (denoted as K) of each frame of the picture, where a calculation formula is as follows:

when there are multiple objects in the picture, the area of the region where each object is located is marked as S₁，S₂，S₃，S₄···S_nAnd the corresponding distance from the camera is recorded as L₁，L₂，L₃，L₄···L_nThe value of K at this time is:

fig. 7 schematically shows a graph of a first ratio change curve plotted against K, which is an original ratio change curve plotted against original video data. The horizontal coordinate is a time point T corresponding to the original video, and the vertical coordinate is a numerical value of K; the smaller the K value is, the farther the distance between the object in the picture and the camera is, the larger the K value is, the closer the distance between the object in the picture and the camera is, and the wave crest represents that the object in the picture is far away from the camera after approaching the camera. The first ratio change curve can represent the object state information of the object.

In an exemplary embodiment of the present disclosure, in a case that the number of the objects is multiple, the ratio between the object area information in each frame of the to-be-processed video and the corresponding object depth information is calculated, and the method for obtaining the ratio information of each frame of the to-be-processed video may further be that the ratio between the object area information in each frame of the to-be-processed video and the corresponding object depth information of each object is calculated respectively, so as to obtain multiple intermediate ratios; and weighting the intermediate ratios to obtain the ratio information of each frame of picture. In the weighting process, different weights are assigned to different objects, and the K value is calculated by combining the weights.

For example, when it is detected that the objects in the video include 20, 15, and 10 areas of the head, the hand, and the leg, respectively, the weight of the head may be preset to 0.5, the weight of the hand may be preset to 0.3, and the weight of the leg may be preset to 0.2, the area of the head after the weighting process is 20 × 0.5 — 10, and the areas of the hand and the leg may be respectively 4.5 and 2, and then the final ratio information may be calculated based on the areas after the weighting process.

It should be noted that the calculation method of the K value in the exemplary embodiment is only one of the calculation methods included in the present disclosure, including but not limited to only one calculation method, and any method involving video processing using the object area and the distance from the camera shall fall within the scope of the present disclosure.

S24, audio data are obtained, and audio characteristic information of the audio data is determined.

In an exemplary embodiment of the present disclosure, the audio data may be background music to be fused with an original video, the audio data may be selected by a user or may be selected automatically, the audio feature information may be information such as volume of music, spectral feature, and the like, the audio feature information may be set in advance, each piece of music has one corresponding piece of audio feature information, or may be extracted in real time.

In an exemplary embodiment of the present disclosure, audio data is acquired, and spectral feature information corresponding to the audio data is determined; obtaining a first audio characteristic curve according to the frequency spectrum characteristic information; the abscissa of the first audio characteristic curve is a time point corresponding to the audio data, and the ordinate is frequency spectrum characteristic information; and determining audio characteristic information according to the first audio characteristic curve.

Specifically, a first audio characteristic curve is obtained according to characteristic information of the volume, the frequency spectrum and the like of the music, and the curve is drawn according to original audio data. For example, the light of the main track is on the ground, and all dark places are illuminated, and the part of the song with the music rhythm climax corresponds to the peak position of the curve. Fig. 8 schematically shows a first audio characteristic curve plotted according to characteristic information such as volume, frequency spectrum, etc., wherein an abscissa represents time T corresponding to audio data, an ordinate represents audio characteristic value M, and a peak represents a climax part of the audio data.

And S26, generating target audio and video data according to the object state information and the audio characteristic information.

In an exemplary embodiment of the present disclosure, the object state information may be a first ratio variation curve, which is drawn according to a ratio of an area where the object is located in the screen to a distance from the area to the camera; the audio characteristic information can be a first audio characteristic curve drawn according to audio characteristic information such as music volume, spectral characteristics and the like, and the target audio and video data is processed video data, namely, a stuck point video with video content matched with music rhythm (climax). And generating a stuck point video with the video content matched with the music rhythm (climax) according to the first ratio change curve and the first audio characteristic curve.

In an exemplary embodiment of the present disclosure, a number of peaks included in a first ratio change curve is determined according to the first ratio change curve in the object state information; determining the number of wave crests contained in a first audio characteristic curve according to the first audio characteristic curve in the audio characteristic information; and generating target audio and video data according to the number of wave crests contained in the first ratio change curve and the number of wave crests contained in the first audio characteristic curve.

In an exemplary embodiment of the present disclosure, in the first ratio change curve, when the object in the video is closer to the device used for shooting the video, the larger the K value is, the number of peaks included in the first ratio change curve may reflect the number of times that the object in the video approaches the device used for shooting the video; the first audio characteristic curve comprises wave crest numbers which can reflect the number of music rhythm points or music climax points in the audio, and the click point video which is matched with the time point of the music rhythm point or the climax point when the object in the video is close to the equipment used for shooting the video is generated according to the frequency of the object in the video close to the equipment used for shooting the video and the number of the music rhythm points or the music climax points in the audio.

In an exemplary embodiment of the present disclosure, if the number of peaks included in the first ratio change curve is different from the number of peaks included in the first audio characteristic curve, filtering out part of peaks in the first ratio change curve and peaks in the first audio characteristic curve according to a preset threshold to obtain a second ratio change curve and a second audio characteristic curve; the second ratio variation curve and the second audio characteristic curve contain the same number of peaks; and generating target audio and video data according to the second ratio change curve and the second audio characteristic curve.

In an exemplary embodiment of the present disclosure, the target audio/video data is obtained by matching peaks in the first ratio variation curve with peaks in the first audio characteristic curve, and if the number of the peaks in the first ratio variation curve is different from the number of the peaks in the first audio characteristic curve, the matching cannot be completed, or the target audio/video data generated after the matching has a poor effect. Therefore, when the number of the peaks in the first ratio change curve is different from the number of the peaks in the first audio characteristic curve, part of the peaks in the first ratio change curve and the peaks in the first audio characteristic curve can be filtered according to a preset threshold value, so that a second ratio change curve and a second audio characteristic curve with the same number of the peaks are obtained.

In an exemplary embodiment of the present disclosure, according to the first ratio variation curve in fig. 7, it may be determined that the number of peaks included in the curve is 6, that is, the number of times that an object in a video approaches a camera may be obtained is 6. According to the first audio characteristic curve in fig. 8, it can be determined that the number of peaks included in the curve is 7, that is, 7 places where the rhythm climax in music can be obtained. It is not difficult to find that the number of times that an object in a video approaches a camera is not equal to the number of climax of music, that is, matching of video content and music content cannot be completed at this time, so that part of unobvious peaks can be filtered out by presetting a peak threshold of a first ratio change curve and a peak threshold of a first audio characteristic curve, and the number of peaks in the first ratio change curve is the same as the number of peaks in the first audio characteristic curve, so that matching of the video content and the music content can be conveniently completed, and matching degree of the content can also be improved. It should be noted that the threshold here may be preset manually or may be set automatically by the mobile phone.

In an exemplary embodiment of the present disclosure, after filtering out some insignificant peaks according to a threshold, two graphs with the same number of new peaks are obtained. Fig. 9 schematically shows a graph of a second ratio variation after filtering part of the peaks according to the threshold, and fig. 10 schematically shows a graph of a second audio characteristic after filtering part of the peaks according to the threshold. Fig. 11 is a comparison graph of the first ratio change curve and the second ratio change curve, and fig. 12 is a comparison graph of the first audio characteristic curve and the second audio characteristic curve, and it can be seen that, compared with the curve before processing, some peaks with lower peaks are filtered out, the peaks are more obvious, and the number of peaks in the second ratio change curve is the same as that of peaks in the second audio characteristic change curve, and is 3.

In an exemplary embodiment of the present disclosure, according to the second ratio variation curve, determining a corresponding position of each peak in the second ratio variation curve in the video data, to obtain a ratio peak position; determining the corresponding positions of all wave crests in the second audio characteristic curve in the audio data according to the second audio characteristic curve to obtain the positions of the audio wave crests; and generating target audio and video data according to the specific value peak position and the audio peak position. The corresponding position of each peak in the second ratio change curve in the video data is the corresponding time point of each peak in the second ratio change curve in the video data, the corresponding position of each peak in the second audio characteristic curve in the audio data is the corresponding time point of each peak in the second audio characteristic curve in the audio data, and the target audio and video data are generated according to the corresponding time point of the ratio peak in the video data and the corresponding time point of the audio peak in the audio data.

In an exemplary embodiment of the disclosure, if the time point corresponding to the peak position of the ratio is different from the time point corresponding to the peak position of the audio frequency, the playing speed of the video to be processed is adjusted to generate the target audio/video data. Fig. 13 schematically shows a comparison graph of the second ratio variation curve and the second audio characteristic curve, and after the second ratio variation curve and the second audio characteristic curve are obtained, it can be found through comparison that the positions of the horizontal axes corresponding to the peaks in the two curves are different, that is, the corresponding time points are different. If the time points corresponding to the peaks in the two curves are different, that is, the time point of the object in the video close to the camera is not matched with the time point of the music climax, the playing speed of the video needs to be adjusted, so that the time point of the object in the video close to the camera is the same as the time point of the music climax.

In an exemplary embodiment of the present disclosure, the video content may be divided into a plurality of sections according to the position of the peak of the second ratio variation curve. For example, the second ratio variation curve in the exemplary embodiment of the present disclosure has 3 peaks, the video may be divided into 4 intervals. And respectively adjusting the playing speed of the video in each interval to ensure that the time point corresponding to the wave peak in the second ratio change curve is the same as the time point corresponding to the wave peak of the second audio characteristic curve, so as to obtain target audio and video data, namely the processed video. And fusing the processed video and the music to obtain the video with the background music, wherein the video content of the video is matched with the music climax.

In another exemplary embodiment of the present disclosure, the method for generating target audio-video data may further include: and if the time point corresponding to the position of the specific peak is different from the time point corresponding to the position of the audio peak, cutting the video to be processed to generate target audio and video data.

Referring to fig. 13, fig. 13 schematically shows a comparison graph of the second ratio variation curve and the second audio characteristic curve, after the second ratio variation curve and the second audio characteristic curve are obtained, it can be found through comparison that the positions of the horizontal axes corresponding to the peaks in the two curves are different, that is, the corresponding time points are different. If the time points corresponding to the wave crests in the two curves are different, that is, the time point of the object in the video close to the camera is not matched with the time point of the music climax, at the moment, the content of the video can be cut, so that the time point of the object in the video close to the camera is the same as the time point of the music climax.

In an exemplary embodiment of the present disclosure, the video content may be divided into a plurality of sections according to the position of the peak of the second ratio variation curve. For example, the second ratio variation curve in the exemplary embodiment of the present disclosure has 3 peaks, the video may be divided into 4 intervals. If the time point corresponding to the position of the peak of the second ratio change curve is behind the time point corresponding to the position of the peak of the second audio characteristic curve, the video content in each interval is cut, so that the time point corresponding to the peak in the second ratio change curve is the same as the time point corresponding to the peak of the second audio characteristic curve, and the target audio and video data, namely the processed video, is obtained. And fusing the processed video and the music to obtain the video with the background music, wherein the video content of the video is matched with the music climax.

In another exemplary embodiment of the present disclosure, the method for generating target audio-video data may further include: and if the time point corresponding to the specific peak position is different from the time point corresponding to the audio peak position, adjusting the playing speed of the video to be processed and cutting the video to be processed to generate the target audio and video data.

Referring to fig. 13, fig. 13 schematically shows a comparison graph of the second ratio variation curve and the second audio characteristic curve, after the second ratio variation curve and the second audio characteristic curve are obtained, it can be found through comparison that the positions of the horizontal axes corresponding to the peaks in the two curves are different, that is, the corresponding time points are different. If the time points corresponding to the wave crests in the two curves are different, namely the time point of the object in the video close to the camera is not matched with the time point of the music climax, at the moment, the time point of the object in the video close to the camera is the same as the time point of the music climax by adjusting the playing speed of the video and cutting the video content.

In an exemplary embodiment of the present disclosure, the video content may be divided into a plurality of sections according to the position of the peak of the second ratio variation curve. For example, the second ratio variation curve in the exemplary embodiment of the present disclosure has 3 peaks, the video may be divided into 4 intervals. The video in the interval with longer video time can be cut or the playing speed of the video can be accelerated, and the video in the interval with shorter video time can be adjusted to slow down the playing speed, so that the time point corresponding to the wave peak in the second ratio change curve is the same as the time point corresponding to the wave peak of the second audio characteristic curve, and the target audio and video data, namely the processed video, can be obtained. And fusing the processed video and the music to obtain the video with the background music, wherein the video content of the video is matched with the music climax.

In another exemplary embodiment of the present disclosure, the distance between the object and the camera may also be calculated by only the size of the area occupied by the object in the video picture. For example, when an object is far away from the camera, the area occupied by the object in the shot picture is smaller; when the object is close to the camera, the area occupied by the object in the shot picture is larger, so that the distance between the object and the camera can be calculated only by the size of the area occupied by the object in the video picture.

In another exemplary embodiment of the present disclosure, each frame of a video to be processed is input into a trained image recognition model by an image recognition method; the image recognition model can output the area of the object in each frame of the video to be processed; acquiring a result output by the image recognition model as an area where an object is located; determining the area of the object in each frame of picture of the video to be processed according to the area of the object; after the area of the object in each frame of picture is obtained, the numerical values corresponding to the area can be represented in a coordinate system and connected into a curve to obtain a first object area curve; and generating target audio and video data according to the first object area curve and the first audio characteristic curve.

The method for generating the target audio and video data according to the first object area curve and the first audio characteristic curve is the same as the method for generating the target audio and video according to the first ratio change curve and the first audio characteristic curve, namely, peaks in part of the first object area curve and the first audio characteristic curve are filtered according to a preset threshold value to obtain a second object area curve and a second audio characteristic curve, so that the number of the peaks contained in the second object area curve and the second audio characteristic curve is the same, and then the video to be processed is cut and/or the playing speed is adjusted according to time points corresponding to the positions of the peaks in the curves to generate the target audio and video data.

It should be noted that although the various steps of the methods of the present disclosure are depicted in the drawings in a particular order, this does not require or imply that these steps must be performed in this particular order, or that all of the depicted steps must be performed, to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step execution, and/or one step broken down into multiple step executions, etc.

Further, the present exemplary embodiment also provides a video processing apparatus.

Fig. 14 schematically shows a block diagram of a video processing apparatus according to an exemplary embodiment of the present disclosure. Referring to fig. 14, the video processing apparatus 14 according to an exemplary embodiment of the present disclosure may include a status information acquisition module 141, an audio information acquisition module 143, and a target data generation module 145.

Specifically, the state information obtaining module 141 may be configured to obtain object state information of an object in the video to be processed; the audio information obtaining module 143 may be configured to obtain audio data and determine audio feature information of the audio data; the target data generating module 145 may be configured to generate target audio/video data according to the object state information and the audio feature information.

In an exemplary embodiment of the present disclosure, the state information acquisition module 141 may be configured to perform: acquiring video data and depth data of a video to be processed; object state information is determined from the video data and the depth data.

In an exemplary embodiment of the present disclosure, the state information acquisition module 141 may be configured to perform: determining the area of an object in each frame of picture of the video to be processed based on the video data to obtain object area information; determining depth data corresponding to the region where the object is located in each frame of picture of the video to be processed based on the region where the object is located and the depth data to obtain object depth information; based on the object area information and the object depth information, object state information is determined.

In an exemplary embodiment of the present disclosure, the state information acquisition module 141 may be configured to perform: inputting each frame of picture of a video to be processed into a trained image recognition model; the output of the image recognition model is the area where the object is located in each frame of picture of the video to be processed; acquiring a result output by the image recognition model as an area where an object is located; and determining the area of the object in each frame of the video to be processed according to the area of the object.

In an exemplary embodiment of the present disclosure, the state information acquisition module 141 may be configured to perform: calculating the ratio of the object area information to the corresponding object depth information in each frame of the video to be processed to obtain the ratio information of each frame of the video; obtaining a first ratio change curve according to the ratio information; the abscissa of the first ratio change curve is a time point corresponding to each frame of picture in the video to be processed, and the ordinate is ratio information; and determining the object state information according to the first ratio change curve.

In an exemplary embodiment of the present disclosure, the state information acquisition module 141 may be configured to perform: respectively calculating the ratio of the object area information of each object in each frame of picture of the video to be processed to the corresponding object depth information to obtain a plurality of intermediate ratios; and calculating the average value of the plurality of intermediate ratios as the ratio information of each frame of picture.

In an exemplary embodiment of the present disclosure, the state information acquisition module 141 may be configured to perform: respectively calculating the ratio of the object area information of each object in each frame of picture of the video to be processed to the corresponding object depth information to obtain a plurality of intermediate ratios; and weighting the intermediate ratios to obtain the ratio information of each frame of picture.

In an exemplary embodiment of the present disclosure, the audio information acquisition module 143 may be configured to perform: acquiring audio data and determining frequency spectrum characteristic information corresponding to the audio data; obtaining a first audio characteristic curve according to the frequency spectrum characteristic information; the abscissa of the first audio characteristic curve is a time point corresponding to the audio data, and the ordinate is frequency spectrum characteristic information; and determining audio characteristic information according to the first audio characteristic curve.

In an exemplary embodiment of the present disclosure, the target data generation module 145 may be configured to perform: determining the number of wave crests contained in a first ratio change curve according to the first ratio change curve in the object state information; determining the number of wave crests contained in a first audio characteristic curve according to the first audio characteristic curve in the audio characteristic information; and generating target audio and video data according to the number of wave crests contained in the first ratio change curve and the number of wave crests contained in the first audio characteristic curve.

In an exemplary embodiment of the present disclosure, the target data generation module 145 may be configured to perform: if the number of wave crests contained in the first ratio change curve is different from the number of wave crests contained in the first audio characteristic curve, filtering part of wave crests in the first ratio change curve and wave crests in the first audio characteristic curve according to a preset threshold value to obtain a second ratio change curve and a second audio characteristic curve; the second ratio variation curve and the second audio characteristic curve contain the same number of peaks; and generating target audio and video data according to the second ratio change curve and the second audio characteristic curve.

In an exemplary embodiment of the present disclosure, the target data generation module 145 may be configured to perform: determining the corresponding position of each peak in the second ratio change curve in the video data according to the second ratio change curve to obtain the position of the ratio peak; determining the corresponding positions of all wave crests in the second audio characteristic curve in the audio data according to the second audio characteristic curve to obtain the positions of the audio wave crests; and generating target audio and video data according to the specific value peak position and the audio peak position.

In an exemplary embodiment of the present disclosure, the target data generation module 145 may be configured to perform: and if the time point corresponding to the position of the specific peak is different from the time point corresponding to the position of the audio peak, adjusting the playing speed of the video to be processed to generate target audio and video data.

In an exemplary embodiment of the present disclosure, the target data generation module 145 may be configured to perform: and if the time point corresponding to the position of the specific peak is different from the time point corresponding to the position of the audio peak, cutting the video to be processed to generate target audio and video data.

In an exemplary embodiment of the present disclosure, the target data generation module 145 may be configured to perform: and if the time point corresponding to the specific peak position is different from the time point corresponding to the audio peak position, adjusting the playing speed of the video to be processed and cutting the video to be processed to generate target audio and video data.

In an exemplary embodiment of the present disclosure, there is also provided a computer-readable storage medium having stored thereon a program product capable of implementing the above-described method of the present specification. In some possible embodiments, aspects of the invention may also be implemented in the form of a program product comprising program code means for causing a terminal device to carry out the steps according to various exemplary embodiments of the invention described in the above section "exemplary methods" of the present description, when said program product is run on the terminal device.

The program product for implementing the above method according to an embodiment of the present invention may employ a portable compact disc read only memory (CD-ROM) and include program codes, and may be run on a terminal device, such as a personal computer. However, the program product of the present invention is not limited in this regard and, in the present document, a readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

The program product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical disk, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

A computer readable signal medium may include a propagated data signal with readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A readable signal medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.

Program code embodied on a readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

Program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server. In the case of a remote computing device, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., through the internet using an internet service provider).

In an exemplary embodiment of the present disclosure, an electronic device capable of implementing the above method is also provided.

As will be appreciated by one skilled in the art, aspects of the present invention may be embodied as a system, method or program product. Thus, various aspects of the invention may be embodied in the form of: an entirely hardware embodiment, an entirely software embodiment (including firmware, microcode, etc.) or an embodiment combining hardware and software aspects that may all generally be referred to herein as a "circuit," module "or" system.

An electronic device 1500 according to this embodiment of the invention is described below with reference to fig. 15. The electronic device 1500 shown in fig. 15 is only an example and should not bring any limitation to the functions and the scope of use of the embodiments of the present invention.

As shown in fig. 15, electronic device 1500 is in the form of a general purpose computing device. Components of electronic device 1500 may include, but are not limited to: the at least one processing unit 1510, the at least one storage unit 1520, a bus 1530 connecting different system components (including the storage unit 1520 and the processing unit 1510), and a display unit 1540.

Wherein the memory unit stores program code that is executable by the processing unit 1510 to cause the processing unit 1510 to perform steps according to various exemplary embodiments of the present invention as described in the above section "exemplary methods" of the present specification.

The storage unit 1520 may include readable media in the form of volatile storage units, such as a random access memory unit (RAM)15201 and/or a cache memory unit 15202, and may further include a read only memory unit (ROM) 15203.

Storage unit 1520 may also include a program/utility 15204 having a set (at least one) of program modules 15205, such program modules 15205 including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each of which, or some combination thereof, may comprise an implementation of a network environment.

Bus 1530 may be any bus representing one or more of several types of bus structures, including a memory unit bus or memory unit controller, a peripheral bus, an accelerated graphics port, a processing unit, or a local bus using any of a variety of bus architectures.

The electronic device 1500 can also communicate with one or more external devices 1600 (e.g., keyboard, pointing device, bluetooth device, etc.), with one or more devices that enable a user to interact with the electronic device 1500, and/or with any devices (e.g., router, modem, etc.) that enable the electronic device 1500 to communicate with one or more other computing devices. Such communication may occur via input/output (I/O) interface 1550. Also, the electronic device 1500 may communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network, such as the internet) via the network adapter 1560. As shown, the network adapter 1560 communicates with the other modules of the electronic device 1500 over the bus 1530. It should be appreciated that although not shown, other hardware and/or software modules may be used in conjunction with the electronic device 1500, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage systems, among others.

Through the above description of the embodiments, those skilled in the art will readily understand that the exemplary embodiments described herein may be implemented by software, or by software in combination with necessary hardware. Therefore, the technical solution according to the embodiments of the present disclosure may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (which may be a CD-ROM, a usb disk, a removable hard disk, etc.) or on a network, and includes several instructions to enable a computing device (which may be a personal computer, a server, a terminal device, or a network device, etc.) to execute the method according to the embodiments of the present disclosure.

Furthermore, the above-described figures are merely schematic illustrations of processes involved in methods according to exemplary embodiments of the invention, and are not intended to be limiting. It will be readily understood that the processes shown in the above figures are not intended to indicate or limit the chronological order of the processes. In addition, it is also readily understood that these processes may be performed synchronously or asynchronously, e.g., in multiple modules.

It should be noted that although in the above detailed description several modules or units of the device for action execution are mentioned, such a division is not mandatory. Indeed, the features and functionality of two or more modules or units described above may be embodied in one module or unit, according to embodiments of the present disclosure. Conversely, the features and functions of one module or unit described above may be further divided into embodiments by a plurality of modules or units.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is to be limited only by the terms of the appended claims.

Claims

1. A video processing method, comprising:

acquiring object state information of an object in a video to be processed;

acquiring audio data and determining audio characteristic information of the audio data;

and generating target audio and video data according to the object state information and the audio characteristic information.

2. The video processing method according to claim 1, wherein obtaining the state information of the object in the video to be processed comprises:

acquiring video data and depth data of the video to be processed;

and determining the object state information according to the video data and the depth data.

3. The video processing method of claim 2, wherein determining the object state information from the video data and the depth data comprises:

determining the area of the object in each frame of the video to be processed based on the video data to obtain object area information;

determining depth data corresponding to the region where the object is located in each frame of the video to be processed based on the region where the object is located and the depth data to obtain object depth information;

determining the object state information based on the object area information and the object depth information.

4. The video processing method according to claim 3, wherein determining the area of the object in each frame of the video to be processed based on the video data comprises:

inputting each frame of picture of the video to be processed into a trained image recognition model; the image recognition model can output the area of the object in each frame of the video to be processed;

obtaining a result output by the image recognition model as an area where the object is located;

and determining the area of the object in each frame of the video to be processed according to the area of the object.

5. The video processing method of claim 3, wherein determining the object state information based on the object area information and the object depth information comprises:

calculating the ratio of the object area information to the corresponding object depth information in each frame of the video to be processed to obtain the ratio information of each frame of the video;

obtaining a first ratio change curve according to the ratio information; the abscissa of the first ratio change curve is a time point corresponding to each frame of picture in the video to be processed, and the ordinate is the ratio information;

and determining the object state information according to the first ratio change curve.

6. The video processing method according to claim 5, wherein in a case that the number of the objects is multiple, calculating a ratio of the object area information to the corresponding object depth information in each frame of the video to be processed to obtain ratio information of each frame of the video, the method comprises:

respectively calculating the ratio of the object area information of each object in each frame of picture of the video to be processed to the corresponding object depth information to obtain a plurality of intermediate ratios;

and calculating the average value of the plurality of intermediate ratios as the ratio information of each frame of picture.

7. The video processing method according to claim 5, wherein when the number of the objects is multiple, calculating a ratio between the object area information and the corresponding object depth information in each frame of the video to be processed to obtain ratio information of each frame, further comprising:

respectively calculating the ratio of the object area information of each object in each frame of the video to be processed to the corresponding object depth information to obtain a plurality of intermediate ratios;

and weighting the intermediate ratios to obtain the ratio information of each frame of picture.

8. The video processing method of claim 5, wherein obtaining audio data and determining the audio characteristic information comprises:

acquiring the audio data and determining the frequency spectrum characteristic information corresponding to the audio data;

obtaining a first audio characteristic curve according to the frequency spectrum characteristic information; the abscissa of the first audio characteristic curve is a time point corresponding to the audio data, and the ordinate is the frequency spectrum characteristic information;

and determining the audio characteristic information according to the first audio characteristic curve.

9. The video processing method according to claim 8, wherein generating target audio-video data according to the object state information and the audio feature information comprises:

determining the number of wave crests contained in a first ratio change curve according to the first ratio change curve in the object state information;

determining the number of peaks contained in a first audio characteristic curve according to the first audio characteristic curve in the audio characteristic information;

and generating the target audio and video data according to the number of wave crests contained in the first ratio change curve and the number of wave crests contained in the first audio characteristic curve.

10. The video processing method according to claim 9, wherein generating the target audio-video data according to the number of peaks included in the first ratio variation curve and the number of peaks included in the first audio characteristic curve includes:

if the number of peaks contained in the first ratio change curve is different from the number of peaks contained in the first audio characteristic curve, filtering out part of peaks in the first ratio change curve and peaks in the first audio characteristic curve according to a preset threshold value to obtain a second ratio change curve and a second audio characteristic curve; the second ratio variation curve and the second audio characteristic curve have the same number of peaks;

and generating the target audio and video data according to the second ratio change curve and the second audio characteristic curve.

11. The video processing method according to claim 10, wherein generating the target audio-video data according to the second ratio variation curve and the second audio characteristic curve comprises:

determining the corresponding position of each peak in the second ratio change curve in the video data according to the second ratio change curve to obtain the position of the ratio peak;

determining the corresponding positions of all wave crests in the second audio characteristic curve in the audio data according to the second audio characteristic curve to obtain the positions of the audio wave crests;

and generating the target audio and video data according to the ratio peak position and the audio peak position.

12. The video processing method according to claim 11, wherein generating the target audio-video data according to the ratio peak position and the audio peak position comprises:

and if the time point corresponding to the position of the ratio wave crest is different from the time point corresponding to the position of the audio wave crest, adjusting the playing speed of the video to be processed to generate the target audio and video data.

13. The video processing method according to claim 11, wherein generating the target audio-video data according to the ratio peak position and the audio peak position further comprises:

and if the time point corresponding to the position of the ratio wave crest is different from the time point corresponding to the position of the audio wave crest, cutting the video to be processed to generate the target audio and video data.

14. The video processing method according to claim 11, wherein generating the target audio-video data according to the ratio peak position and the audio peak position further comprises:

and if the time point corresponding to the ratio peak position is different from the time point corresponding to the audio peak position, adjusting the playing speed of the video to be processed and cutting the video to be processed to generate the target audio and video data.

15. A video processing apparatus, comprising:

the state information acquisition module is used for acquiring the object state information of an object in the video to be processed;

the audio information acquisition module is used for acquiring audio data and determining audio characteristic information of the audio data;

and the target data generation module is used for generating target audio and video data according to the object state information and the audio characteristic information.

16. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the video processing method of any one of claims 1 to 14.

17. An electronic device, comprising:

a processor; and

a memory for storing executable instructions of the processor;

wherein the processor is configured to perform the video processing method of any of claims 1-14 via execution of the executable instructions.