WO2023179692A1 - 运动视频生成方法、装置、终端设备以及存储介质 - Google Patents

运动视频生成方法、装置、终端设备以及存储介质 Download PDF

Info

Publication number
WO2023179692A1
WO2023179692A1 PCT/CN2023/083187 CN2023083187W WO2023179692A1 WO 2023179692 A1 WO2023179692 A1 WO 2023179692A1 CN 2023083187 W CN2023083187 W CN 2023083187W WO 2023179692 A1 WO2023179692 A1 WO 2023179692A1
Authority
WO
WIPO (PCT)
Prior art keywords
visual target
video
target
tracking
visual
Prior art date
Application number
PCT/CN2023/083187
Other languages
English (en)
French (fr)
Inventor
龙良曲
郭士嘉
姜文杰
Original Assignee
影石创新科技股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 影石创新科技股份有限公司 filed Critical 影石创新科技股份有限公司
Publication of WO2023179692A1 publication Critical patent/WO2023179692A1/zh

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • G06T7/246Analysis of motion using feature-based methods, e.g. the tracking of corners or segments
    • G06T7/248Analysis of motion using feature-based methods, e.g. the tracking of corners or segments involving reference images or patches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/09Supervised learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • G06T7/246Analysis of motion using feature-based methods, e.g. the tracking of corners or segments
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N5/00Details of television systems
    • H04N5/222Studio circuitry; Studio devices; Studio equipment
    • H04N5/262Studio circuits, e.g. for mixing, switching-over, change of character of image, other special effects ; Cameras specially adapted for the electronic generation of special effects
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N5/00Details of television systems
    • H04N5/222Studio circuitry; Studio devices; Studio equipment
    • H04N5/262Studio circuits, e.g. for mixing, switching-over, change of character of image, other special effects ; Cameras specially adapted for the electronic generation of special effects
    • H04N5/265Mixing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10016Video; Image sequence
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/30Subject of image; Context of image processing
    • G06T2207/30241Trajectory

Definitions

  • Embodiments of the present invention relate to the field of video processing technology, and in particular to a sports video generation method, device, terminal equipment and storage medium.
  • panoramic video records all visual information of the 360-degree spherical surface where the camera is located. Users do not need to move the camera to view the scene when shooting, they only need to After completion, manually select the video content from a specific perspective for export, and you can obtain videos of any visual target.
  • it is often necessary to manually view all the pictures of the panoramic video select the wonderful pictures for content export, and the export process requires the user to select the perspective target at each timestamp for export, which is cumbersome and inefficient.
  • Embodiments of the present invention provide a sports video generation method, device, terminal equipment and storage medium, which can automatically evaluate the excitement of perspective objects in panoramic videos, select exciting perspective objects for tracking and export 2D videos.
  • embodiments of the present invention provide a motion video generation method, which is applied to electronic terminal equipment and uses a target frame to mark at least one visual target of a key frame in a panoramic video; the key frame is any image in the panoramic video Frame; use the neural network model to score the splendor of objects in the video, extract the RGB features of the corresponding pixels of each visual target based on the uniformly sized target frame, and classify each visual target based on the RGB features corresponding to each visual target.
  • Perform a brilliance evaluation select at least one visual target as a tracking visual target based on the brilliance evaluation result; track the tracking visual target in each frame of the panoramic video, and generate a movement trajectory of the tracking visual target in the panoramic video sequence; according to the motion trajectory sequence, project the target frame corresponding to the tracking visual target object into a planar image in the image area occupied by each frame of the panoramic video, and obtain The motion video of the tracking visual target.
  • the above sports video generation method detects and marks the visual targets displayed in the panoramic video, uses target frames to select visual targets in the panoramic video image frames, and uses a preset neural network model to score the splendor of objects in the video to combine different visual targets.
  • the corresponding target frame is of uniform size, extracts the RGB features of the pixels in the unified size target frame, calculates the brilliance score of each visual target based on the RGB features corresponding to each visual target, sorts the visual targets according to the brilliance score, and selects Several visual targets with higher scores are used as tracking visual targets.
  • the purpose of obtaining a moving video of tracking a visual target on a plane perpendicular to the line of sight is to automatically evaluate the splendor of the visual target displayed in the panoramic video, and to output a video of the tracking visual target that performs brilliantly in the panoramic video. For example, it can automatically output videos of rare objects, moving objects, and other visual targets that attract the user's attention, and obtain the moving video of the tracked visual target without manually viewing the panoramic video, and the operation is simple.
  • a target frame is used to mark at least one visual target of a key frame in the panoramic video, including:
  • the tracking visual target is tracked in each frame of the panoramic video, and a motion trajectory sequence of the tracking visual target in the panoramic video is generated.
  • the neural network model for scoring the splendor of objects in the video is set in the following way:
  • Each object in the panoramic image is marked with a comprehensive score according to the evaluation criteria of the degree of excitement in multiple dimensions;
  • the multiple dimensions include: target category, motion status, character attributes, and salience;
  • the multi-layer neural network is trained multiple times using panoramic images carrying annotations, until the difference between the wonderfulness score output by the multi-layer neural network for the object and the corresponding annotation comprehensive score is less than the preset threshold, the multi-layer neural network that has been trained multiple times will be used as The neural network model for scoring the splendor of objects in videos.
  • the method further includes:
  • the corresponding target motion videos are sequentially selected as the videos to be edited;
  • a segment that matches the time length of the video is intercepted from the video to be edited, and a motion video of the display object specified by the user is obtained.
  • At least one visual target is selected as the tracking visual target based on the brilliance evaluation result, including:
  • the visual target with the highest brilliance score among all the visual targets of the key frame is determined as the tracking visual target.
  • At least one visual target is selected as the tracking visual target based on the brilliance evaluation result, including:
  • Corresponding visual targets are selected as the tracking visual targets in descending order according to the brilliance score, until the number of the tracking visual targets meets the preset number.
  • a target frame is used to mark at least one visual target of a key frame in the panoramic video, including:
  • the RGB features of the corresponding pixels of each visual target are extracted based on the uniformly sized target frame, and each visual target is highlighted based on the RGB features corresponding to each visual target.
  • Degree assessment including:
  • a tracking visual target that meets a preset condition is output.
  • the method further includes:
  • the brilliance score of any visual object is greater than the brilliance score of the tracking visual object, the arbitrary visual object is tracked in each frame of the panoramic video.
  • embodiments of the present invention provide a sports video generation device, which is provided in an electronic terminal device.
  • the device includes:
  • a marking module configured to use a target frame to mark at least one visual target of a key frame in the panoramic video; the key frame is any image frame in the panoramic video;
  • the evaluation module is used to use the neural network model to score the splendor of objects in the video, extract the RGB features of the corresponding pixels of each visual target based on the uniformly sized target frame, and evaluate each visual target based on the RGB features corresponding to each visual target. Assess the splendor of each visual target;
  • a selection module configured to select at least one visual target as the tracking visual target based on the brilliance evaluation result
  • a trajectory generation module configured to track the tracking visual target in each frame of the panoramic video, and generate a motion trajectory sequence of the tracking visual target in the panoramic video;
  • a projection module configured to project the target frame corresponding to the tracking visual target object into a plane image in the image area occupied by each frame of the panoramic video according to the motion trajectory sequence, and obtain the motion video of the tracking visual target.
  • the marking module is specifically configured to mark the position coordinates of an object for the at least one visual target
  • the trajectory generation module is specifically configured to track the tracking visual target in each frame of the panoramic video according to the position coordinates, and generate a motion trajectory sequence of the tracking visual target in the panoramic video.
  • the device further includes a neural network training module, and the neural network training module is specifically used to:
  • Each object in the panoramic image is marked with a comprehensive score according to the evaluation criteria of the degree of excitement in multiple dimensions;
  • the multiple dimensions include: target category, motion status, character attributes, and salience;
  • the multi-layer neural network is trained multiple times using panoramic images carrying annotations, until the difference between the wonderfulness score output by the multi-layer neural network for the object and the corresponding annotation comprehensive score is less than the preset threshold, the multi-layer neural network that has been trained multiple times will be used as The god who scores the splendor of objects in videos via network model.
  • the device further includes:
  • the response module is used to respond to the editing instructions specified by the user and obtain the object to be displayed and the video duration;
  • An acquisition module is used to obtain multiple tracking visual targets that match the object to be displayed;
  • a selection module configured to sequentially select the corresponding target motion videos as the videos to be edited according to the corresponding brilliance scores of the plurality of motion videos tracking visual targets;
  • An interception module configured to intercept segments that match the time length of the video from the video to be edited, and obtain a motion video of the display object specified by the user.
  • the evaluation module is specifically configured to determine the visual target with the highest brilliance score among all the visual targets of the key frame as the tracking visual target.
  • the evaluation module is specifically configured to select corresponding visual targets as the tracking visual targets in descending order according to the brilliance score, until the number of the tracking visual targets meets a preset number.
  • the marking module is specifically configured to mark the object type for the at least one visual target
  • the evaluation module includes:
  • a scoring submodule used to use the preset neural network model to score the splendor of objects in the video, and score the splendor of the at least one visual target;
  • An output submodule is configured to output tracking visual targets that meet preset conditions according to the brilliance score of the at least one visual target and the type of the at least one visual target.
  • the device further includes:
  • An extraction module for extracting RGB features of pixels corresponding to different visual targets from image frames that track the tracking visual target using a neural network model that scores the splendor of objects in the video;
  • a tracking module is configured to track the arbitrary visual target in each frame of the panoramic video when the brilliance score of any visual target is greater than the brilliance score of the tracked visual target.
  • embodiments of the present invention provide a terminal device, including: at least one processor; and at least one memory communicatively connected to the processor, wherein: the memory stores a program that can be executed by the processor. instructions, the processor calls the program instructions to perform the first aspect provided method.
  • embodiments of the present invention provide a non-transitory computer-readable storage medium.
  • the non-transitory computer-readable storage medium stores computer instructions, and the computer instructions cause the computer to execute the method provided in the first aspect.
  • Figure 1 is a flow chart of steps for training a multi-layer neural network according to an embodiment of the present invention
  • Figure 2 is a step flow chart of the sports video generation method proposed by the embodiment of the present invention.
  • Figure 3 is a model structure diagram based on the implementation of the sports video generation method according to the embodiment of the present invention.
  • Figure 4 is a schematic diagram of key frames output by an example Detection model of the present invention.
  • Figure 5 is a schematic diagram of the motion trajectory sequence output by an example MOTracker model of the present invention.
  • Figure 6 is a structural diagram of another model based on another method of generating motion videos according to an embodiment of the present invention.
  • Figure 7 is a step flow chart of another sports video generation method proposed by an embodiment of the present invention.
  • Figure 8 is a data flow diagram for executing the sports video generation method according to the embodiment of the present invention.
  • Figure 9 is a schematic diagram of panoramic video key frame A in an example of the present invention.
  • Figure 10 is a functional module diagram of the sports video generation device proposed by the embodiment of the present invention.
  • Figure 11 is a schematic structural diagram of an electronic terminal device provided by an embodiment of the present invention.
  • Figure 12 is a schematic structural diagram of a terminal device provided by an embodiment of this specification.
  • the inventor proposed to train a multi-layer neural network with panoramic images carrying splendor scores to obtain a visual A neural network model that evaluates the splendor of visual targets from multiple aspects such as target category, motion status, character attributes, and saliency.
  • the applicant has pre-trained a model that can rate the splendor of objects in panoramic images.
  • Figure 1 is a flowchart of steps for training a multi-layer neural network according to an embodiment of the present invention.
  • the embodiment of the present invention trains a multi-layer neural network to obtain multiple features such as visual target categories, motion states, character attributes, and salience.
  • the steps of the neural network model in terms of evaluating the importance of visual objects and regressing their importance scores include:
  • Panoramic video can be collected for a certain area and the image frames of the panoramic video can be extracted as panoramic images. You can also shoot panoramic images directly or obtain panoramic images from the database.
  • S102 Label each object in the panoramic image with a comprehensive score based on multiple dimensions of evaluation criteria for splendor; the multiple dimensions include: target category, motion status, character attributes, and salience.
  • the visual objects in the panoramic image can be first detected by a detector, and the detected visual objects can be pre-labeled with bbox boxes.
  • the importance is discretely divided into 4 levels:
  • levels 1-4 correspond to specific quantitative scores. For example, levels 1-4 are scored as -5, 0, 3, and 5 points respectively. Each panoramic image will be annotated by multiple people, and the current score will be weighted and averaged. For example, for the same goal, it will be rated by 5 people. If the rating result is
  • the corresponding scores are [0,3,3,0,3] respectively, and the weighted score of the target is 1.8 points.
  • S103 Use the panoramic images carrying the annotations to train the multi-layer neural network multiple times until the difference between the wonderfulness score output by the multi-layer neural network for the object and the corresponding annotation comprehensive score is less than the preset threshold, the multi-layer neural network will be trained multiple times.
  • the network serves as the neural network model for scoring the splendor of objects in the video.
  • the comprehensive score of the object annotation displayed on the panoramic image is used as the supervision condition, and the multi-layer neural network is supervised and trained.
  • the multi-layer neural network extracts the RGB features of the corresponding pixels of the object in the panoramic image and performs multi-dimensional scoring on the object. , calculate the loss value of the comprehensive score of multi-dimensional scoring and annotation, and adjust the parameters of the multi-layer neural network based on the loss value until the model converges, and obtain a neural network (Ranking) model for scoring the splendor of objects in the video.
  • the applicant further proposes to detect visual targets displayed in key frames in panoramic videos and mark visual targets displayed in key frames in panoramic videos , input the latitude and longitude diagram carrying marked key frames into a neural network (Ranking) model used to score the splendor of objects in the video, score the splendor of the at least one visual target and output a tracking visual target that meets the preset conditions,
  • Figure 2 is a step flow chart of the sports video generation method proposed by the embodiment of the present invention.
  • Figure 3 is a model structure diagram based on the sports video generation method according to the embodiment of the present invention. As shown in Figure 3, the sports video generation method is performed by the embodiment of the present invention.
  • the models based on include: Detection model, Ranking model and MOTracker model.
  • the process of executing the motion video generation method includes:
  • S201 Use a target frame to mark at least one visual target of a key frame in the panoramic video; the key frame is any image frame in the panoramic video.
  • the key frame can be the first frame of the panoramic video or the highest quality image frame in the panoramic video.
  • the embodiment of the present invention can use the Detection model to perform step S201 to detect the visual target of the key frame in the panoramic video.
  • the Detection model analyzes the key frames of the panoramic video, detects all rectangular bounding boxes bbox of objects belonging to predefined categories in the panoramic video, and provides one or more visual targets to be tracked as candidate targets, which can be used for subsequent detection of objects in the video.
  • the neural network model for brilliance scoring performs importance evaluation.
  • the Detection model can use industry-standard target detectors, such as Faster RCNN, RetinaNet, CentreNet, etc., and can be trained based on panoramic annotated images.
  • a predefined category is determined, and the user input of animal instructions is used to determine the animal category as a predefined category.
  • marking at least one visual target in the key frame of the panoramic video includes: detecting visual targets belonging to predefined categories in the key frame, such as buildings, people, pets, landscape objects, etc., and targeting the detected visual target
  • the target generation target box bbox selects multiple pixels corresponding to the visual target.
  • Figure 4 is a schematic diagram of an example Detection model outputting key frames of the present invention.
  • the Detection model detects the time target in the key frame and generates a target box bbox to select the visual target.
  • the pixels in the marked frame can be used as visual target frames to select multiple corresponding pixels.
  • S202 Use the neural network model to score the splendor of objects in the video, extract the RGB features of the corresponding pixels of each visual target based on the uniformly sized target frame, and extract the RGB features corresponding to each visual target based on the target frame. Evaluate the brilliance of each visual object based on the RGB characteristics of the visual object.
  • the neural network model (Ranking model) that scores the splendor of objects in the video is connected to the Detection model.
  • S203 Select at least one visual target as the tracking visual target based on the brilliance evaluation result.
  • Implementation methods for selecting at least one visual target as a tracking visual target based on the results of the brilliance evaluation include:
  • the visual target with the highest brilliance score among all the visual targets of the key frame is determined as the tracking visual target.
  • the implementation method of selecting at least one visual target as the tracking visual target based on the brilliance evaluation result also includes:
  • Corresponding visual targets are selected as the tracking visual targets in descending order according to the brilliance score, until the number of the tracking visual targets meets the preset number.
  • the labeled key frames are input to the Ranking model.
  • the Ranking model obtains the position of the bbox of each visual target, obtains the RGB features of the corresponding pixels of the visual target based on the coordinates of the bbox, and scales the RGB features of the corresponding pixels of each visual target to Unified in size, the Ranking model can predict and obtain the excitement score of each visual target based on RGB features.
  • the Ranking model outputs tracking visual targets that meet preset conditions.
  • the preset condition is the Top-k visual targets with the highest brilliance scores.
  • the Ranking model sorts the visual targets according to the brilliance scores of the visual targets, and sequentially selects the Top-k visual targets with the highest scores.
  • the target is given to the MOTracker model, which tracks the Top-k visual targets, derives the image area occupied by the tracked visual targets in each frame of the panoramic video, and obtains the motion video of the tracked visual targets.
  • the MOTracker model tracks Top-k visual targets, and derives the image area occupied by the tracked visual targets in each frame of the panoramic video.
  • the process of obtaining the motion video of the tracked visual targets includes:
  • the MOTracker model accepts the Top-k target object bboxes output by the Ranking model, and uses open source deep tracking models or traditional tracking algorithms for tracking. For example, it can use multiple Open source single target tracking algorithms such as STAPLE and LightTrack can achieve the purpose of multi-target tracking, and can also be achieved using a single multi-target tracking algorithm such as FairMOT. Track the motion trajectory sequence of each bbox generated perspective and save it to an offline file. For example, you can save structured panoramic video data in a json file.
  • S204 Track the tracking visual target in each frame of the panoramic video, and generate a motion trajectory sequence of the tracking visual target in the panoramic video;
  • S205 According to the motion trajectory sequence, project the target frame corresponding to the tracking visual target object into a plane image in the image area occupied by each frame of the panoramic video, and obtain the motion video of the tracking visual target.
  • the flat image may be an image displayed on a plane perpendicular to the user's line of sight.
  • the panoramic projection algorithm can be used to project any trajectory and generate a 2D motion trajectory video from each perspective, thus achieving the purpose of automatic editing of panoramic videos.
  • the Detection model can also output the position coordinates of the visual target.
  • S201 includes sub-step S2011: labeling the at least one visual target with the position coordinates of the object.
  • Tracking the tracking visual target in each frame of the panoramic video, and generating a motion trajectory sequence of the tracking visual target in the panoramic video includes: tracking in each frame of the panoramic video according to the position coordinates.
  • the tracking visual target generates a motion trajectory sequence of the tracking visual target in the panoramic video.
  • Figure 5 is a schematic diagram of a motion trajectory sequence output by an example MOTracker model of the present invention.
  • the tracking visual target is tracked in each frame of the panoramic video, and generating a motion trajectory sequence of the tracking visual target in the panoramic video includes sub-steps: S2031 to S2032.
  • S2031 Track the tracking visual target in the panoramic video according to the position coordinates of the tracking visual target, and obtain multiple target frames that completely display the tracking visual target;
  • S2032 Connect the positions of the tracking visual target in the multiple target frames to obtain the motion trajectory sequence of the tracking visual target
  • S2033 According to the motion trajectory sequence, project the image area occupied by the tracking visual target in each frame of the panoramic video to a plane perpendicular to the user's line of sight to obtain a motion video of the tracking visual target.
  • the embodiment of the present invention can use the Detection model to perform step S201 to detect the visual target of the key frame in the panoramic video.
  • the Detection model analyzes the key frames of the panoramic video, detects all rectangular bounding boxes bbox of objects belonging to predefined categories in the panoramic video, and provides one or more visual targets to be tracked as candidate targets, which can be used for subsequent detection of objects in the video.
  • the neural network model that performs brilliance scoring evaluates importance and tracks visual targets.
  • the Detection model can use industry-standard target detectors, such as Faster RCNN, RetinaNet, CentreNet, etc., and can be trained based on panoramic annotated images.
  • Figure 6 is a structural diagram of another model based on which the embodiment of the present invention performs another sports video generation method.
  • the models based on which the embodiment of the present invention performs the sports video generation method include: Detection model, Ranking model, MOTracker model and AutoEditor model.
  • the editing model analyzes the tracking sequences of multiple panoramic videos, sorts each tracking sequence according to the brilliance score, and selects the Top-P sequences with higher scores as target sequences to be edited. For each target sequence, the duration is different and can be edited according to the user's template duration or set duration.
  • a heuristic search algorithm is used to search for the 3s segment with the highest score and edit it.
  • the first Top-P sequences are edited to obtain P video clips whose duration meets the conditions, and the final video collection can be obtained through splicing.
  • FIG. 7 is a step flow chart of another sports video generation method proposed by an embodiment of the present invention. As shown in Figures 6 and 7, the steps of another sports video generation method include:
  • S701 Use target frames to mark multiple visual targets of key frames in the panoramic video; the key frames are any image frames in the panoramic video.
  • S702 Utilize the neural network model that scores the splendor of objects in the video, extract the RGB features of the corresponding pixels of each visual target based on the uniformly sized target frame, and calculate each visual target based on the RGB features corresponding to each visual target. brilliance rating.
  • S703 Use the neural network model to select corresponding visual targets as tracking visual targets in descending order according to the brilliance score until the number of tracking visual targets meets the preset number.
  • S704 Track the tracking visual target in each frame of the panoramic video, and generate a motion trajectory sequence of the tracking visual target in the panoramic video.
  • S705 According to the motion trajectory sequence, project the image area occupied by the tracking visual target in each frame of the panoramic video to a plane perpendicular to the user's line of sight to obtain a motion video of the tracking visual target.
  • S706 Respond to the editing instruction specified by the user and obtain the object to be displayed and the video duration.
  • S709 Extract segments that match the video time length from the video to be edited, and obtain a motion video of the user-specified display object.
  • the threshold of the specified category is adjusted according to the category instruction input by the user.
  • the category command input by the user is a pet
  • the standard threshold for the visual target of the pet can be used. Assume that the lowest score among the top-K visual targets in the brilliance score is M 1 , the brilliance score of the pet category visual target A is M 2 , M 2 ⁇ M 1 , and the pet category visual target A can be selected as the tracking visual target .
  • step S201 also includes sub-steps: S201-1.
  • S201-1 Label the at least one visual target with an object type.
  • Step S202 includes sub-steps: S202-1 and S202-2.
  • S202-1 Use a neural network model that is preset to score the splendor of objects in the video to rate the splendor of the at least one visual target;
  • S202-2 According to the brilliance score of the at least one visual target and the type of the at least one visual target, output a tracking visual target that meets the preset conditions.
  • the visual target is output first as the tracking visual target.
  • the visual targets of key frames in panoramic videos include building visual targets and pet visual targets.
  • the type of demand in response to user input is building visual targets.
  • the building visual targets and pet visual targets are equally exciting, and the output building visual target is tracking vision. Target.
  • Figure 8 is a data flow diagram for executing a motion video generation method according to an embodiment of the present invention.
  • Figure 9 is a schematic diagram of panoramic video key frame A in an example of the present invention. As shown in Figures 8 and 9, an example of the present invention is The process of executing the sports video generation method is as follows:
  • K11 Input the key frame A (panoramic picture) in the panoramic video into the panoramic detector (Detection model).
  • the panoramic detector marks the visual target of the panoramic picture and outputs bbox-1, bbox-2 and bbox-3, and bbox- 1.
  • the object type of bbox-1 is a building
  • the object type of bbox-2 is a person
  • the object type of bbox-3 is a telephone pole.
  • K12 Resize bbox-1, bbox-2 and bbox-3 to the same size. According to the position coordinates of bbox-1, bbox-2 and bbox-3, extract the RGB features of the pixels in bbox-1, bbox-2 and bbox-3 of the same size.
  • K13 Input the RGB features into the Ranking model.
  • the Ranking model scores bbox-1, bbox-2, and bbox-3 for their brilliance, and outputs bbox-1 with the highest score.
  • K14 The MOTracker model tracks bbox-1 and obtains the motion trajectory sequence of the visual target corresponding to bbox-1.
  • K15 Finally, select the appropriate FOV parameters to render the perspective motion trajectory sequence and generate a 2D video of the motion trajectory sequence.
  • the FOV parameters can be adaptively adjusted according to the position and size of the visual target. For example, for visual targets with larger height/width, you can Use a larger FOV for rendering; for smaller height/width visual targets, you can choose smaller FOV parameters.
  • the method further includes: using a neural network model to score the splendor of objects in the video to extract differences in the image frames that track the tracking visual target.
  • the visual target corresponds to the RGB feature of the pixel; when the brilliance score of any visual target is greater than the brilliance score of the tracking visual target, the arbitrary visual target is tracked in each frame of the panoramic video.
  • An example of the present invention proposes an implementation method of a panoramic video export method.
  • the first frame of the panoramic image of the panoramic video is obtained as a key frame, the visual target of the first frame of the panoramic image is marked, and the first frame of the panoramic image carrying the mark is input into the pair.
  • the neural network model that scores the splendor of objects in the video detects the two tracking visual targets with the highest splendor scores: animal A and human B. They are in the second position of the panoramic video.
  • the panoramic image of the frame tracks animal A and human B, marks the visual target of the panoramic image of the second frame, and inputs the panoramic image of the second frame with the mark into the neural network model that scores the splendor of the objects in the video, and evaluates the splendor of the objects in the video.
  • the degree scoring neural network model detects the second frame of the panoramic image and outputs a wonderful degree score
  • the two highest tracking visual targets are: animal A and animal C. Track animal A, human B and animal C in the panoramic video.
  • FIG 10 is a functional module diagram of a sports video generation device proposed by an embodiment of the present invention.
  • the above sports video generation device is provided in a terminal device.
  • the device includes:
  • the marking module 10 is configured to use a target frame to mark at least one visual target of a key frame in the panoramic video; the key frame is any image frame in the panoramic video;
  • Evaluation module 11 is used to use the neural network model to score the splendor of objects in the video, extract the RGB features of the corresponding pixels of each visual target based on the unified size of the target frame, and compare the RGB features of each visual target based on the RGB features corresponding to each visual target. Each visual object is evaluated for its brilliance;
  • the selection module 12 is used to select at least one visual target as the tracking visual target according to the brilliance evaluation result
  • the trajectory generation module 13 is used to track the tracking visual target in each frame of the panoramic video, and generate a motion trajectory sequence of the tracking visual target in the panoramic video;
  • the projection module 14 is configured to project the target frame corresponding to the tracking visual target object into a planar image in the image area occupied by each frame of the panoramic video according to the motion trajectory sequence, and obtain the movement of the tracking visual target. video.
  • the sports video generation device provided by the embodiment shown in Figure 10 can be used to execute the technical solutions of the method embodiments shown in Figures 1 to 9 of this specification. For its implementation principles and technical effects, further reference can be made to the relevant descriptions in the method embodiments.
  • the marking module is specifically configured to mark the position coordinates of the object on the at least one visual target
  • the trajectory generation module is specifically configured to track the tracking visual target in each frame of the panoramic video according to the position coordinates, and generate a motion trajectory sequence of the tracking visual target in the panoramic video.
  • the device further includes a neural network training module, which is specifically used to:
  • the multi-layer neural network is trained multiple times using panoramic images carrying annotations, until the difference between the wonderfulness score output by the multi-layer neural network for the object and the corresponding annotation comprehensive score is less than the preset threshold, the multi-layer neural network that has been trained multiple times will be used as The neural network model for scoring the splendor of objects in videos.
  • the device also includes:
  • the response module is used to respond to the editing instructions specified by the user and obtain the object to be displayed and the video duration;
  • An acquisition module is used to obtain multiple tracking visual targets that match the object to be displayed;
  • a selection module configured to sequentially select the corresponding target motion videos as the videos to be edited according to the corresponding brilliance scores of the plurality of motion videos tracking visual targets;
  • An interception module configured to intercept segments that match the time length of the video from the video to be edited, and obtain a motion video of the display object specified by the user.
  • the evaluation module is specifically configured to determine the visual target with the highest brilliance score among all the visual targets of the key frame as the tracking visual target.
  • the evaluation module is specifically configured to select corresponding visual targets as the tracking visual targets in descending order according to the brilliance score, until the number of the tracking visual targets meets a preset number.
  • the marking module is specifically used to mark the object type for the at least one visual target
  • the evaluation module includes:
  • a scoring submodule used to use the preset neural network model to score the splendor of objects in the video, and score the splendor of the at least one visual target;
  • An output submodule is configured to output tracking visual targets that meet preset conditions according to the brilliance score of the at least one visual target and the type of the at least one visual target.
  • the device also includes:
  • An extraction module for extracting RGB features of pixels corresponding to different visual targets from image frames that track the tracking visual target using a neural network model that scores the splendor of objects in the video;
  • a tracking module is configured to track the arbitrary visual target in each frame of the panoramic video when the brilliance score of any visual target is greater than the brilliance score of the tracked visual target.
  • the device provided in the above-described embodiments may be, for example, a chip or a chip module.
  • the devices provided by the above-described embodiments are used to execute the technical solutions of the above-described method embodiments. For its implementation principles and technical effects, further reference can be made to the relevant descriptions in the method embodiments, which will not be described again here.
  • each module/unit included in each device described in the above embodiment it may be a software module/unit or a hardware module/unit, or it may be partly a software module/unit and partly a hardware module/unit.
  • each module/unit included therein can be implemented in the form of hardware such as circuits, or at least some of the modules/units can be implemented in the form of a software program that runs on
  • the remaining modules/units can be implemented using circuits and other hardware methods; for each device applied or integrated into the chip module, each module/unit included in it can be implemented using circuits and other hardware methods.
  • different modules/units can be located in the same component (such as chip, circuit module, etc.) or in different components of the chip module, or at least some of the modules/units can be implemented in the form of software programs that run on the chip module
  • the remaining modules/units can be implemented using circuits and other hardware methods; for each device applied or integrated into electronic terminal equipment, each module/unit included in it can be implemented using circuits and other hardware methods.
  • Different modules/units can be located in the same component (e.g., chip, circuit module, etc.) or in different components within the electronic terminal equipment, or at least some of the modules/units can be implemented in the form of software programs that run on the electronic terminal equipment.
  • the remaining (if any) modules/units can be implemented using circuits and other hardware methods.
  • FIG 11 is a schematic structural diagram of an electronic terminal device provided by an embodiment of the present invention.
  • the electronic terminal device 1100 includes a processor 1110, a memory 1111, and a computer program stored on the memory 1111 and capable of running on the processor 1110.
  • the processor 1110 executes the program, it implements the steps in the foregoing method embodiments.
  • the electronic terminal equipment provided by the embodiments can be used to execute the technical solutions of the method embodiments shown above. For its implementation principles and technical effects, please refer to the method further. The relevant descriptions in the embodiments will not be repeated here.
  • Figure 12 is a schematic structural diagram of a terminal device provided by an embodiment of this specification. As shown in Figure 12 It is shown that the above-mentioned terminal device may include at least one processor; and at least one memory communicatively connected with the above-mentioned processor, wherein: the memory stores program instructions that can be executed by the processor, and the above-mentioned processor can execute the instructions shown in this specification by calling the above-mentioned program instructions. 1 to the sports video generation method provided by the embodiment shown in FIG. 9 .
  • the structure illustrated in the embodiment of the present invention does not constitute a specific limitation on the terminal device 100.
  • the terminal device 100 may include more or less components than shown in the figures, or combine some components, or split some components, or arrange different components.
  • the components illustrated may be implemented in hardware, software, or a combination of software and hardware.
  • the terminal device 100 may include a processor 110, an external memory interface 120, an internal memory 121, a mobile communication module 150, a wireless communication module 160, an indicator 192, a camera 193, a display screen 194, etc.
  • the processor 110 may include one or more processing units.
  • the processor 110 may include an application processor (application processor, AP), a modem processor, a graphics processing unit (GPU), and an image signal processor. (image signal processor, ISP), controller, video codec, digital signal processor (digital signal processor, DSP), baseband processor, and/or neural network processor (neural-network processing unit, NPU), etc.
  • application processor application processor, AP
  • modem processor graphics processing unit
  • GPU graphics processing unit
  • image signal processor image signal processor
  • ISP image signal processor
  • controller video codec
  • digital signal processor digital signal processor
  • DSP digital signal processor
  • baseband processor baseband processor
  • neural network processor neural-network processing unit
  • the controller can generate operation control signals based on the instruction operation code and timing signals to complete the control of fetching and executing instructions.
  • the processor 110 may also be provided with a memory for storing instructions and data.
  • the memory in processor 110 is cache memory. This memory may hold instructions or data that have been recently used or recycled by processor 110 . If the processor 110 needs to use the instructions or data again, it can be called directly from the memory. Repeated access is avoided and the waiting time of the processor 110 is reduced, thus improving the efficiency of the system.
  • the processor 110 executes various functional applications and data processing by running programs stored in the internal memory 121, for example, implementing the sports video generation method provided by the embodiments shown in FIGS. 1 to 9 of the present invention.
  • the wireless communication function of the terminal device 100 can be implemented through the antenna 1, the antenna 2, the mobile communication module 150, the wireless communication module 160, the modem processor and the baseband processor.
  • Antenna 1 and Antenna 2 are used to transmit and receive electromagnetic wave signals.
  • Each antenna in terminal device 100 may be used to cover a single or multiple communication frequency bands. Different antennas can also be reused to improve antenna utilization. For example: Antenna 1 can be reused as a diversity antenna for a wireless LAN. In other embodiments, antennas may be used in conjunction with tuning switches.
  • the terminal device 100 implements display functions through a GPU, a display screen 194, an application processor, and the like.
  • the GPU is an image processing microprocessor and is connected to the display screen 194 and the application processor. GPUs are used to perform mathematical and geometric calculations for graphics rendering.
  • Processor 110 may include one or more GPUs that execute program instructions to generate or alter display information.
  • the display screen 194 is used to display images, videos, etc.
  • Display 194 includes a display panel.
  • the display panel can use a liquid crystal display (LCD), an organic light-emitting diode (OLED), an active matrix organic light emitting diode or an active matrix organic light emitting diode (active-matrix organic light emitting diode).
  • LCD liquid crystal display
  • OLED organic light-emitting diode
  • AMOLED organic light-emitting diode
  • FLED flexible light-emitting diode
  • Miniled MicroLed, Micro-oLed, quantum dot light emitting diode (QLED), etc.
  • the terminal device 100 may include 1 or N display screens 194, where N is a positive integer greater than 1.
  • the terminal device 100 can implement the shooting function through the ISP, camera 193, video codec, GPU, display screen 194, application processor, etc.
  • the ISP is used to process the data fed back by the camera 193. For example, when taking a photo, the shutter is opened, the light is transmitted to the camera sensor through the lens, the optical signal is converted into an electrical signal, and the camera sensor passes the electrical signal to the ISP for processing, and converts it into an image visible to the naked eye. ISP can also perform algorithm optimization on image noise, brightness, and skin color. ISP can also optimize the exposure, color temperature and other parameters of the shooting scene. In some embodiments, the ISP may be provided in the camera 193.
  • Camera 193 is used to capture still images or video.
  • the object passes through the lens to produce an optical image that is projected onto the photosensitive element.
  • the photosensitive element may be a charge coupled device (CCD) or a complementary metal-oxide-semiconductor (CMOS) phototransistor.
  • CMOS complementary metal-oxide-semiconductor
  • the photosensitive element converts the optical signal into an electrical signal, and then passes the electrical signal to the ISP to convert it into a digital image signal.
  • ISP outputs digital image signals to DSP for processing.
  • DSP converts digital image signals into standard RGB, YUV and other format image signals.
  • the terminal device 100 may include 1 or N cameras 193, where N is a positive integer greater than 1.
  • Digital signal processors are used to process digital signals. In addition to digital image signals, they can also process other digital signals. For example, when the terminal device 100 selects a frequency point, the digital signal processor is used to perform Fourier transform on the frequency point energy.
  • Video codecs are used to compress or decompress digital video.
  • the terminal device 100 may support one or more video codecs. In this way, the terminal device 100 can play or record videos in multiple encoding formats, such as moving picture experts group (MPEG) 1, MPEG2, MPEG3, MPEG4, etc.
  • MPEG moving picture experts group
  • MPEG2 MPEG2, MPEG3, MPEG4, etc.
  • Internal memory 121 may be used to store computer executable program code, which includes instructions.
  • the internal memory 121 may include a program storage area and a data storage area.
  • the stored program area can store an operating system, at least one application program required for a function (such as a sound playback function, an image playback function, etc.).
  • the storage data area may store data created during use of the terminal device 100 (such as audio data, phone book, etc.).
  • the internal memory 121 may include high-speed random access memory, and may also include non-volatile memory, such as at least one disk storage device, flash memory device, universal flash storage (UFS), etc.
  • the processor 110 executes various functional applications and data processing of the terminal device 100 by executing instructions stored in the internal memory 121 and/or instructions stored in a memory provided in the processor.
  • Embodiments of the present invention provide a non-transitory computer-readable storage medium.
  • the non-transitory computer-readable storage medium stores computer instructions.
  • the computer instructions cause the computer to execute the embodiments shown in Figures 1 to 9 of this specification.
  • Non-transitory computer-readable storage media may refer to non-volatile computer storage media.
  • the above-mentioned non-transitory computer-readable storage medium may adopt any combination of one or more computer-readable media.
  • the computer-readable medium may be a computer-readable signal medium or a computer-readable storage medium.
  • the computer-readable storage medium may be, for example, but is not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus or device, or any combination thereof.
  • Non-exhaustive list of computer-readable storage media include: electrical connections having one or more conductors, portable computer disks, hard drives, random access memory (RAM), read only memory , ROM), erasable programmable read only memory (EPROM) or flash memory, optical fiber, portable compact disk read-only memory (CD-ROM), optical storage device, magnetic storage device, or any of the above The right combination.
  • the computer A readable storage medium may be any tangible medium that contains or stores a program for use by or in connection with an instruction execution system, apparatus, or device.
  • a computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave carrying computer-readable program code therein. Such propagated data signals may take a variety of forms, including - but not limited to - electromagnetic signals, optical signals, or any suitable combination of the above.
  • a computer-readable signal medium may also be any computer-readable medium other than a computer-readable storage medium that can send, propagate, or transmit a program for use by or in connection with an instruction execution system, apparatus, or device .
  • Program code embodied on a computer-readable medium may be transmitted using any suitable medium, including but not limited to wireless, wire, optical cable, radio frequency (RF), etc., or any suitable combination of the foregoing.
  • RF radio frequency
  • Computer program code for performing the operations described herein may be written in one or more programming languages, including object-oriented programming languages such as Java, Smalltalk, C++, and conventional Procedural programming language—such as "C" or a similar programming language.
  • the program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server.
  • the remote computer can be connected to the user's computer through any kind of network, including a local area network (LAN) or a wide area network (WAN), or it can be connected to an external computer (e.g. Use an Internet service provider to connect via the Internet).
  • LAN local area network
  • WAN wide area network
  • first and second are used for descriptive purposes only and cannot be understood as indicating or implying relative importance or implicitly indicating the quantity of indicated technical features. Therefore, features defined as “first” and “second” may explicitly or implicitly include at least one of these features.
  • plurality means at least two, such as two, three, etc., unless otherwise clearly and specifically limited.
  • the word “if” as used herein may be interpreted as “when” or “when” or “in response to determination” or “in response to detection.”
  • the phrase “if determined” or “if (stated condition or event) is detected” may be interpreted as “when determined” or “in response to determining” or “when (stated condition or event) is detected )” or “in response to detecting (a stated condition or event)”.
  • terminals involved in the embodiments of the present invention may include, but are not limited to, personal computers (PCs), personal digital assistants (PDAs), wireless handheld devices, tablet computers, Mobile phones, MP3 players, MP4 players, etc.
  • PCs personal computers
  • PDAs personal digital assistants
  • wireless handheld devices tablet computers
  • Mobile phones MP3 players
  • MP4 players etc.
  • the disclosed systems, devices and methods can be implemented in other ways.
  • the device embodiments described above are only illustrative.
  • the division of the units is only a logical function division. In actual implementation, there may be other division methods.
  • multiple units or components may be combined. Either it can be integrated into another system, or some features can be ignored, or not implemented.
  • Another point, the mutual coupling shown or discussed The coupling or direct coupling or communication connection may be through some interfaces, indirect coupling or communication connection of devices or units, and may be in electrical, mechanical or other forms.
  • each functional unit in each embodiment of this specification may be integrated into one processing unit, or each unit may exist physically alone, or two or more units may be integrated into one unit.
  • the above integrated unit can be implemented in the form of hardware or in the form of hardware plus software functional units.
  • the above-mentioned integrated unit implemented in the form of a software functional unit can be stored in a computer-readable storage medium.
  • the above-mentioned software functional unit is stored in a storage medium and includes a number of instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) or processor to execute the methods described in various embodiments of this specification. Some steps.
  • the aforementioned storage media include: U disk, mobile hard disk, read-only memory (ROM), random access memory (RAM), magnetic disk or optical disk and other various media that can store program codes.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Biomedical Technology (AREA)
  • Evolutionary Computation (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Signal Processing (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Image Analysis (AREA)

Abstract

本发明实施例提出了一种运动视频生成方法、装置、终端设备以及存储介质,涉及视频处理技术领域;能够自动评价全景视频中视角物体的精彩程度,选择精彩的视角物体进行跟踪导出2D视频。所述方法包括:标记全景视频中关键帧的至少一个视觉目标;利用预设对视频中物体进行精彩程度评分的神经网络模型,对所述至少一个视觉目标进行精彩程度评分输出满足预设条件的跟踪视觉目标;导出所述跟踪视觉目标分别在所述全景视频中每帧画面占据的图像区域,获得所述跟踪视觉目标的运动视频。

Description

运动视频生成方法、装置、终端设备以及存储介质 【技术领域】
本发明实施例涉及视频处理技术领域,尤其涉及一种运动视频生成方法、装置、终端设备以及存储介质。
【背景技术】
为解决常规摄像装置的局限性:录制视频的视角极其狭小,无法记录许多重要的细节,全景视频记录了相机所在360度球面的所有视觉信息,用户在拍摄时无需运镜取景,只需在拍摄完后手动选定特定视角的视频内容进行导出,可以获得任意视觉目标的视频。但目前往往需要人工查看全景视频的所有画面,选择精彩的画面进行内容导出,并且导出过程需要用户在每个时间戳上选定视角目标进行导出,操作繁琐且低效。
【发明内容】
本发明实施例提供了一种运动视频生成方法、装置、终端设备以及存储介质,能够自动评价全景视频中视角物体的精彩程度,选择精彩的视角物体进行跟踪导出2D视频。
第一方面,本发明实施例提供一种运动视频生成方法,应用于电子终端设备,采用目标框标记全景视频中关键帧的至少一个视觉目标;所述关键帧为所述全景视频中的任意图像帧;利用对视频中物体进行精彩程度评分的神经网络模型,基于统一大小后的目标框提取每个视觉目标对应像素点的RGB特征,并根据对应每个视觉目标的RGB特征对每个视觉目标进行精彩程度评估;根据精彩程度评估结果选择至少一个视觉目标作为跟踪视觉目标;在所述全景视频的每帧图像追踪所述跟踪视觉目标,生成所述跟踪视觉目标在所述全景视频的运动轨迹序列;根据所述运动轨迹序列,将所述跟踪视觉目标对象对应的目标框在所述全景视频的每帧画面占据的图像区域投影成平面图像,获得 所述跟踪视觉目标的运动视频。
上述运动视频生成方法,检测并标记全景视频显示的视觉目标,采用目标框框选出全景视频图像帧中的视觉目标,利用预设对视频中物体进行精彩程度评分的神经网络模型,将不同视觉目标对应的目标框统一大小,提取统一大小后的目标框内像素点的RGB特征,根据对应每个视觉目标的RGB特征计算每个视觉目标的精彩程度评分,按照精彩程度评分对视觉目标排序,选择得分较高的若干个视觉目标作为跟踪视觉目标。在全景视频的每帧图像追踪跟踪视觉目标,生成跟踪视觉目标在全景视频的运动轨迹序列;根据运动轨迹序列,将跟踪视觉目标在所述全景视频的每帧画面占据的图像区域投影到与用户视线垂直的平面,获得跟踪视觉目标的运动视频,实现自动评价全景视频显示的视觉目标的精彩程度,输出在全景视频表现精彩的跟踪视觉目标的视频的目的。例如自动输出罕见的物体、运动的物体等吸引用户眼球的视觉目标的视频,获得所述跟踪视觉目标的运动视频,无需人工查看全景视频,操作简便。
其中一种可能的实现方式中,采用目标框标记全景视频中关键帧的至少一个视觉目标,包括:
对所述至少一个视觉目标标注物体的位置坐标;
在所述全景视频的每帧图像追踪所述跟踪视觉目标,生成所述跟踪视觉目标在所述全景视频的运动轨迹序列,包括:
根据所述位置坐标,在所述全景视频的每帧图像追踪所述跟踪视觉目标,生成所述跟踪视觉目标在所述全景视频的运动轨迹序列。
其中一种可能的实现方式中,所述对视频中物体进行精彩程度评分的神经网络模型通过以下方式设定:
获得全景图像;
根据多个维度对精彩程度的评价标准,对所述全景图像中每个物体标注综合得分;所述多个维度包括:目标类别、运动状态、人物属性、显著性;
利用携带标注的全景图像多次训练多层神经网络,直至所述多层神经网络针对物体输出的精彩程度评分与对应标注综合得分相差程度小于预设阈值,将经过多次训练多层神经网络作为所述对视频中物体进行精彩程度评分的神经网络模型。
其中一种可能的实现方式中,所述方法还包括:
响应用户指定的剪辑指令,获得待显示物体和视频时间长度;
获得与待显示物体匹配的多个跟踪视觉目标;
按照所述多个跟踪视觉目标的运动视频各自对应的精彩程度评分大小,顺序选取对应目标运动视频作为待剪辑视频;
从所述待剪辑视频截取符合所述视频时间长度的片段,获得用户指定的显示物体的运动视频。
其中一种可能的实现方式中,根据精彩程度评估结果选择至少一个视觉目标作为跟踪视觉目标,包括:
将所述关键帧的所有视觉目标中精彩程度评分最高的视觉目标确定为跟踪视觉目标。
其中一种可能的实现方式中,根据精彩程度评估结果选择至少一个视觉目标作为跟踪视觉目标,包括:
按照精彩程度评分从大到小顺序选取对应视觉目标作为所述跟踪视觉目标,直至所述跟踪视觉目标的数量满足预设数量。
其中一种可能的实现方式中,采用目标框标记全景视频中关键帧的至少一个视觉目标,包括:
对所述至少一个视觉目标标注物体类型;
利用对视频中物体进行精彩程度评分的神经网络模型,基于统一大小后的目标框提取每个视觉目标对应像素点的RGB特征,并根据对应每个视觉目标的RGB特征对每个视觉目标进行精彩程度评估,包括:
利用所述预设对视频中物体进行精彩程度评分的神经网络模型,对所述至少一个视觉目标进行精彩程度评分;
根据所述至少一个视觉目标的精彩程度评分和所述至少一个视觉目标的类型,输出满足预设条件的跟踪视觉目标。其中一种可能的实现方式中,在所述全景视频的每帧图像追踪到所述跟踪视觉目标后,所述方法还包括:
利用所述对视频中物体进行精彩程度评分的神经网络模型对追踪到所述跟踪视觉目标的图像帧提取与所述跟踪视觉目标不同的视觉目标对应像素点RGB特征;
当任意视觉目标的精彩程度评分大于所述跟踪视觉目标的精彩程度评分,在所述全景视频的每帧图像追踪该任意视觉目标。
第二方面,本发明实施例提供一种运动视频生成装置,设置在电子终端设备中,所述装置包括:
标记模块,用于采用目标框标记全景视频中关键帧的至少一个视觉目标;所述关键帧为所述全景视频中的任意图像帧;
评估模块,用于利用对视频中物体进行精彩程度评分的神经网络模型,基于统一大小后的目标框提取每个视觉目标对应像素点的RGB特征,并根据对应每个视觉目标的RGB特征对每个视觉目标进行精彩程度评估;
选择模块,用于根据精彩程度评估结果选择至少一个视觉目标作为跟踪视觉目标;
轨迹生成模块,用于在所述全景视频的每帧图像追踪所述跟踪视觉目标,生成所述跟踪视觉目标在所述全景视频的运动轨迹序列;
投影模块,用于根据所述运动轨迹序列,将所述跟踪视觉目标对象对应的目标框在所述全景视频的每帧画面占据的图像区域投影成平面图像,获得所述跟踪视觉目标的运动视频。
其中一种可能的实现方式中,所述标记模块具体用于对所述至少一个视觉目标标注物体的位置坐标;
所述轨迹生成模块具体用于根据所述位置坐标,在所述全景视频的每帧图像追踪所述跟踪视觉目标,生成所述跟踪视觉目标在所述全景视频的运动轨迹序列。
其中一种可能的实现方式中,所述装置还包括神经网络训练模块,所述神经网络训练模块具体用于:
获得全景图像;
根据多个维度对精彩程度的评价标准,对所述全景图像中每个物体标注综合得分;所述多个维度包括:目标类别、运动状态、人物属性、显著性;
利用携带标注的全景图像多次训练多层神经网络,直至所述多层神经网络针对物体输出的精彩程度评分与对应标注综合得分相差程度小于预设阈值,将经过多次训练多层神经网络作为所述对视频中物体进行精彩程度评分的神 经网络模型。
其中一种可能的实现方式中,所述装置还包括:
响应模块,用于响应用户指定的剪辑指令,获得待显示物体和视频时间长度;
获得模块,用于获得与待显示物体匹配的多个跟踪视觉目标;
选取模块,用于按照所述多个跟踪视觉目标的运动视频各自对应的精彩程度评分大小,顺序选取对应目标运动视频作为待剪辑视频;
截取模块,用于从所述待剪辑视频截取符合所述视频时间长度的片段,获得用户指定的显示物体的运动视频。
其中一种可能的实现方式中,所述评估模块具体用于将所述关键帧的所有视觉目标中精彩程度评分最高的视觉目标确定为跟踪视觉目标。
其中一种可能的实现方式中,所述评估模块具体用于按照精彩程度评分从大到小顺序选取对应视觉目标作为所述跟踪视觉目标,直至所述跟踪视觉目标的数量满足预设数量。
其中一种可能的实现方式中,所述标记模块具体用于对所述至少一个视觉目标标注物体类型;
所述评估模块包括:
评分子模块,用于利用所述预设对视频中物体进行精彩程度评分的神经网络模型,对所述至少一个视觉目标进行精彩程度评分;
输出子模块,用于根据所述至少一个视觉目标的精彩程度评分和所述至少一个视觉目标的类型,输出满足预设条件的跟踪视觉目标。
其中一种可能的实现方式中,所述装置还包括:
提取模块,用于利用对视频中物体进行精彩程度评分的神经网络模型对追踪到所述跟踪视觉目标的图像帧提取不同视觉目标对应像素点RGB特征;
追踪模块,用于当任意视觉目标的精彩程度评分大于所述跟踪视觉目标的精彩程度评分,在所述全景视频的每帧图像追踪该任意视觉目标。
第三方面,本发明实施例提供一种终端设备,包括:至少一个处理器;以及与所述处理器通信连接的至少一个存储器,其中:所述存储器存储有可被所述处理器执行的程序指令,所述处理器调用所述程序指令能够执行第一方面 提供的方法。
第四方面,本发明实施例提供一种非暂态计算机可读存储介质,所述非暂态计算机可读存储介质存储计算机指令,所述计算机指令使所述计算机执行第一方面提供的方法。
应当理解的是,本发明实施例的第二~四方面与本发明实施例的第一方面的技术方案一致,各方面及对应的可行实施方式所取得的有益效果相似,不再赘述。
【附图说明】
为了更清楚地说明本发明实施例的技术方案,下面将对实施例中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本说明书的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其它的附图。
图1是本发明实施例训练多层神经网络的步骤流程图;
图2是本发明实施例提出的运动视频生成方法的步骤流程图;
图3是本发明实施例执行运动视频生成方法基于的模型结构图;
图4是本发明一种示例Detection模型输出关键帧的示意图;
图5是本发明一种示例MOTracker模型输出的运动轨迹序列示意图;
图6是本发明实施例执行另一种运动视频生成方法基于的另一种模型结构图;
图7是本发明实施例提出的另一种运动视频生成方法的步骤流程图;
图8是本发明实施例执行运动视频生成方法数据流向图;
图9是本发明一种示例中全景视频关键帧A的示意图;
图10是本发明实施例提出的运动视频生成装置的功能模块图;
图11为本发明实施例提供的一种电子终端设备的结构示意图;
图12为本说明书一个实施例提供的终端设备的结构示意图。
【具体实施方式】
为了更好的理解本说明书的技术方案,下面结合附图对本发明实施例进行详细描述。
应当明确,所描述的实施例仅仅是本说明书一部分实施例,而不是全部的实施例。基于本说明书中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其它实施例,都属于本说明书保护的范围。
在本发明实施例中使用的术语是仅仅出于描述特定实施例的目的,而非旨在限制本说明书。在本发明实施例和所附权利要求书中所使用的单数形式的“一种”、“所述”和“该”也旨在包括多数形式,除非上下文清楚地表示其他含义。
为了自动评价全景视频中视角物体的精彩程度,选择精彩的视角物体进行跟踪导出2D视频,发明人提出,以携带精彩程度评分的全景图像,对搭建的多层神经网络进行训练,得到能够从视觉目标类别、运动状态、人物属性、显著性等多个方面评价视觉目标的精彩程度的神经网络模型。申请人预先训练得到能够对全景图像中的物体进行精彩程度评分的模型。
图1是本发明实施例训练多层神经网络的步骤流程图,如图1所示本发明实施例训练多层神经网络,获得能够从视觉目标类别、运动状态、人物属性、显著性等多个方面评价视觉目标的重要性,并回归其重要性得分的神经网络模型的步骤包括:
S101:获得全景图像。
可以针对某区域采集全景视频,提取全景视频的图像帧作为全景图像。也可以直接拍摄全景图像,或从数据库获取全景图像。
S102:根据多个维度对精彩程度的评价标准,对所述全景图像中每个物体标注综合得分;所述多个维度包括:目标类别、运动状态、人物属性、显著性。
根据多个维度对精彩程度的评价标准,对所述全景图像中每个物体标注综合得分可以先通过检测器检测出全景图像中的视觉物体,对检测出的视觉物体预标注bbox框。
在预标注bbox的全景图片上,对每个视觉目标(proposal)从目标类别、 运动状态、人物属性、显著性等多个方面进行综合判断,获得全景图像中每个物体标注综合得分。对于全景图片中的某个视觉目标,从多个角度:是否完整无遮挡,是否独立鲜明,目标是否具有精彩性,是否具有美感,是否稀缺性等角度等,对其进行多维度的评分。
本发明一种示例中,为了便于标注人员进行评价,将重要性离散划分为4个等级:
1.日常普通目标。
2.较为精彩目标。
3.非常精彩目标。
4.无聊目标。
这4个等级分别对应到具体的量化分数,例如1-4等级分别得分为-5,0,3,5分。每张全景图片将由多人进行标注,并对目前的得分进行加权平均。例如,对于同一个目标,将由5人进行评分,若评分结果为
普通,比较精彩,比较精彩,普通,比较精彩
则对应的得分分别为[0,3,3,0,3],目标的加权得分为1.8分。
S103:利用携带标注的全景图像多次训练多层神经网络,直至所述多层神经网络针对物体输出的精彩程度评分与对应标注综合得分相差程度小于预设阈值,将经过多次训练多层神经网络作为所述对视频中物体进行精彩程度评分的神经网络模型。
在训练过程中,将针对全景图像显示物体标注的综合得分作为监督条件,对多层神经网络进行监督训练,多层神经网络提取全景图像中物体对应像素点的RGB特征,对物体进行多维度评分,计算多维度评分与标注的综合得分的损失值,根据损失值,调整多层神经网络的参数,直至模型收敛,获得用于对视频中物体进行精彩程度评分的神经网络(Ranking)模型。
基于上述预先训练得到的用于对视频中物体进行精彩程度评分的神经网络(Ranking)模型,申请人进一步提出检测全景视频中关键帧显示的视觉目标,并标记全景视频中关键帧显示的视觉目标,将携带标记的关键帧的经纬图输入用于对视频中物体进行精彩程度评分的神经网络(Ranking)模型,对所述至少一个视觉目标进行精彩程度评分输出满足预设条件的跟踪视觉目标, 根据跟踪视觉目标获得跟踪视觉目标的跟踪轨迹序列,生成跟踪视觉目标的运动视频的技术方案。
图2是本发明实施例提出的运动视频生成方法的步骤流程图,图3是本发明实施例执行运动视频生成方法基于的模型结构图;如图3所示本发明实施例执行运动视频生成方法基于的模型包括:Detection模型、Ranking模型以及MOTracker模型。
如图2和图3所示,执行运动视频生成方法的过程包括:
S201:采用目标框标记全景视频中关键帧的至少一个视觉目标;所述关键帧为所述全景视频中的任意图像帧。
关键帧可以是全景视频的第一帧或者是全景视频中质量最高的图像帧。
本发明实施例可以采用Detection模型执行步骤S201,以检测全景视频中关键帧的视觉目标。Detection模型通过分析全景视频的关键帧,检测出全景视频中所有属于预定义类别物体的矩形边界框bbox,提供一个或者多个待跟踪的视觉目标作为候选目标,从而用于后续的对视频中物体进行精彩程度评分的神经网络模型进行重要性评价。Detection模型可采用业界标准的目标检测器,例如Faster RCNN、RetinaNet、CentreNet等,基于全景标注的图片进行训练即可。
基于用户指令,确定预定义类别,利用用户输入动物指令,确定动物类别为预定义类别。
在本发明一种实施例中,标记全景视频的关键帧中至少一个视觉目标包括:检测关键帧中属于预定义类别的视觉目标,例如建筑、人物、宠物、风景物体等,针对检测出的视觉目标生成目标框bbox选视觉目标对应的多个像素点。
图4是本发明一种示例Detection模型输出关键帧的示意图,如图4所示,Detection模型检测出关键帧中的时间目标,并生成目标框bbox选出视觉目标。如图4所示,标记框中的像素点可以作为视觉目标框选对应的多个像素点。
S202:利用对视频中物体进行精彩程度评分的神经网络模型,基于统一大小后的目标框提取每个视觉目标对应像素点的RGB特征,并根据对应每个视 觉目标的RGB特征对每个视觉目标进行精彩程度评估。
如图3所示,对视频中物体进行精彩程度评分的神经网络模型(Ranking模型)连接Detection模型。
S203:根据精彩程度评估结果选择至少一个视觉目标作为跟踪视觉目标。
根据精彩程度评估结果选择至少一个视觉目标作为跟踪视觉目标的实施方式包括:
将所述关键帧的所有视觉目标中精彩程度评分最高的视觉目标确定为跟踪视觉目标。
根据精彩程度评估结果选择至少一个视觉目标作为跟踪视觉目标的实施方式还包括:
按照精彩程度评分从大到小顺序选取对应视觉目标作为所述跟踪视觉目标,直至所述跟踪视觉目标的数量满足预设数量。
根据用户需求设置预设数量m,按照视觉目标的彩程度评分,排列视觉目标,输出精彩程度评分最大的m视觉目标,作为跟踪视觉目标。
携带标记的关键帧输入Ranking模型,Ranking模型获得每个视觉目标的bbox的位置,并根据bbox的坐标来取得视觉目标对应像素点的RGB特征,将每个视觉目标对应像素点的RGB特征缩放到统一大小,Ranking模型基于RGB特征可预测获得每个视觉目标的精彩程度评分,Ranking模型输出满足预设条件的跟踪视觉目标。
本发明一种实施例中,预设条件是精彩程度评分最高的Top-k个视觉目标,Ranking模型按照视觉目标的精彩程度评分大小,对视觉目标排序,顺序选择得分最高的Top-k个视觉目标给MOTracker模型,MOTracker模型对Top-k个视觉目标进行追踪,导出所述跟踪视觉目标分别在所述全景视频中每帧画面占据的图像区域,获得所述跟踪视觉目标的运动视频。
MOTracker模型对Top-k个视觉目标进行追踪,导出所述跟踪视觉目标分别在所述全景视频中每帧画面占据的图像区域,获得所述跟踪视觉目标的运动视频的过程包括:
MOTracker模型接受Ranking模型输出的Top-k个目标对象bbox,利用开源的深度跟踪模型或者传统跟踪算法进行跟踪,例如可采用多个基于 STAPLE、LightTrack等开源单目标跟踪算法实现多目标跟踪的目的,也可利用单个FairMOT等多目标跟踪算法实现。跟踪每个bbox生成视角的运动轨迹序列,并保存到离线文件中,例如可以保存json文件中,结构化全景视频数据。
S204:在所述全景视频的每帧图像追踪所述跟踪视觉目标,生成所述跟踪视觉目标在所述全景视频的运动轨迹序列;
S205:根据所述运动轨迹序列,将所述跟踪视觉目标对象对应的目标框在所述全景视频的每帧画面占据的图像区域投影成平面图像,获得所述跟踪视觉目标的运动视频。
平面图像可以是显示在与用户视线垂直的平面上的图像。
获得视觉目标bbox的运动轨迹序列,可以利用全景投影算法实现任意轨迹的投影,生成每个视角的2D运动轨迹视频,从而实现了全景视频自动剪辑的目的。
在本发明一种实施例中,Detection模型还可以输出视觉目标的位置坐标。
S201包括子步骤S2011:对所述至少一个视觉目标标注物体的位置坐标。
在所述全景视频的每帧图像追踪所述跟踪视觉目标,生成所述跟踪视觉目标在所述全景视频的运动轨迹序列,包括:根据所述位置坐标,在所述全景视频的每帧图像追踪所述跟踪视觉目标,生成所述跟踪视觉目标在所述全景视频的运动轨迹序列。
图5是本发明一种示例MOTracker模型输出的运动轨迹序列示意图。
在所述全景视频的每帧图像追踪所述跟踪视觉目标,生成所述跟踪视觉目标在所述全景视频的运动轨迹序列包括子步骤:S2031至S2032。
S2031:根据所述跟踪视觉目标的位置坐标,在所述全景视频追踪所述跟踪视觉目标,得到完整显示所述跟踪视觉目标的多个目标帧画面;
S2032:连接所述跟踪视觉目标在所述多个目标帧画面中的位置,得到所述跟踪视觉目标的运动轨迹序列;
S2033:根据所述运动轨迹序列,将所述跟踪视觉目标在所述全景视频的每帧画面占据的图像区域投影到与用户视线垂直的平面,获得所述跟踪视觉目标的运动视频。
本发明实施例可以采用Detection模型执行步骤S201,以检测全景视频中关键帧的视觉目标。Detection模型通过分析全景视频的关键帧,检测出全景视频中所有属于预定义类别物体的矩形边界框bbox,提供一个或者多个待跟踪的视觉目标作为候选目标,从而用于后续的对视频中物体进行精彩程度评分的神经网络模型进行重要性评价,并对视觉目标进行跟踪。Detection模型可采用业界标准的目标检测器,例如Faster RCNN,RetinaNet,CentreNet这些,基于全景标注的图片进行训练即可。
图6是本发明实施例执行另一种运动视频生成方法基于的另一种模型结构图,如图6所示,本发明实施例执行运动视频生成方法基于的模型包括:Detection模型、Ranking模型、MOTracker模型以及AutoEditor模型。
剪辑模型(AutoEditor模型)分析多个全景视频的跟踪序列,根据精彩程度评分对每个跟踪序列进行排序,并选取得分较高的Top-P个序列作为待剪辑的目标序列。对于每个目标序列,时长不一,可以根据用户的模板时长或者设定的时长进行剪辑。
在本发明一种示例中,对于某个时长为10s的跟踪序列,如果此模板的需要时长为3s,则通过启发式搜索算法搜索得分最高的3s片段,进行剪辑。前Top-P个序列经过剪辑后获得P个时长满足条件的视频片段,通过拼接可获得最终的视频合辑。
图7是本发明实施例提出的另一种运动视频生成方法的步骤流程图,如图6和图7所示,另一种运动视频生成方法的步骤包括:
S701:采用目标框标记全景视频中关键帧的多个视觉目标;所述关键帧为所述全景视频中的任意图像帧。
S702:利用对视频中物体进行精彩程度评分的神经网络模型,基于统一大小后的目标框提取每个视觉目标对应像素点的RGB特征,并根据对应每个视觉目标的RGB特征计算每个视觉目标的精彩程度评分。
S703:利用所述神经网络模型按照精彩程度评分从大到小顺序选取对应视觉目标作为跟踪视觉目标,直至所述跟踪视觉目标的数量满足预设数量。
S704:在所述全景视频的每帧图像追踪所述跟踪视觉目标,生成所述跟踪视觉目标在所述全景视频的运动轨迹序列。
S705:根据所述运动轨迹序列,将所述跟踪视觉目标在所述全景视频的每帧画面占据的图像区域投影到与用户视线垂直的平面,获得所述跟踪视觉目标的运动视频。
S706:响应用户指定的剪辑指令,获得待显示物体和视频时间长度。
S707:获得与待显示物体匹配的多个跟踪视觉目标。
S708:按照所述多个跟踪视觉目标的运动视频各自对应的精彩程度评分大小,顺序选取对应目标运动视频作为待剪辑视频。
S709:从所述待剪辑视频截取符合所述视频时间长度的片段,获得用户指定的显示物体的运动视频。
本发明再一种实施例中,根据依据用户输入的类别指令,调整指定类别的阈值。例如用户输入的类别指令为宠物,对视频中物体进行精彩程度评分的神经网络模型输出跟踪视觉目标时,可以针对宠物的视觉目标标准阈值。假设精彩程度评分的前Top-K个视觉目标中最低分是M1,属于宠物类别视觉目标A的精彩程度评分为M2,M2<M1,可以选择宠物类别视觉目标A作为跟踪视觉目标。
基于上述技术方案步骤S201还包括子步骤:S201-1。
S201-1:对所述至少一个视觉目标标注物体类型。
步骤S202包括子步骤:S202-1和S202-2。
S202-1:利用预设对视频中物体进行精彩程度评分的神经网络模型,对所述至少一个视觉目标进行精彩程度评分;
S202-2:根据所述至少一个视觉目标的精彩程度评分和所述至少一个视觉目标的类型,输出满足预设条件的跟踪视觉目标。
响应用户输入的需求类型,当视觉目标标注的物体类型与输入的需求类型匹配,优先输出该视觉目标作为跟踪视觉目标。
示例地,全景视频中关键帧的视觉目标包括建筑视觉目标和宠物视觉目标,响应用户输入的需求类型是建筑视觉目标,建筑视觉目标和宠物视觉目标的精彩程度相同,输出建筑视觉目标为跟踪视觉目标。
图8是本发明实施例执行运动视频生成方法数据流向图,图9是本发明一种示例中全景视频关键帧A的示意图,如图8和图9所示,本发明一种示 例执行运动视频生成方法的过程如下:
K11:将全景视频中关键帧A(全景图片)输入全景检测器(Detection模型),全景检测器对全景图片的视觉目标进行标记,输出bbox-1、bbox-2以及bbox-3,和bbox-1、bbox-2以及bbox-3的位置坐标。其中,bbox-1的物体类型是建筑、bbox-2的物体类型是人以及bbox-3的物体类型是电线杆。
K12:对bbox-1、bbox-2以及bbox-3的大小进行调整(Resize),调整至同一大小。根据bbox-1、bbox-2以及bbox-3的位置坐标,提取同一大小的bbox-1、bbox-2以及bbox-3中像素点的RGB特征。
K13:将RGB特征输入Ranking模型,Ranking模型对bbox-1、bbox-2以及bbox-3进行精彩程度评分,输出得分最高的bbox-1。
K14:MOTracker模型对bbox-1进行追踪,得到bbox-1对应视觉目标的运动轨迹序列。
K15:最后选择合适的FOV参数来渲染视角运动轨迹序列,产生运动轨迹序列的2D视频,FOV参数可以根据视觉目标的位置和大小进行自适应调整,例如对于高/宽较大的视觉目标,可以采用较大的FOV进行渲染;对于较小的高/宽视觉目标,可以选择较小的FOV参数。
在所述全景视频的每帧图像追踪到所述跟踪视觉目标后,所述方法还包括:利用对视频中物体进行精彩程度评分的神经网络模型对追踪到所述跟踪视觉目标的图像帧提取不同视觉目标对应像素点RGB特征;当任意视觉目标的精彩程度评分大于所述跟踪视觉目标的精彩程度评分,在所述全景视频的每帧图像追踪该任意视觉目标。
本发明一种示例提出全景视频导出方法的一种实现方式,获取全景视频的第一帧全景图像作为关键帧,标记第一帧全景图像的视觉目标,将携带标记的第一帧全景图像输入对视频中物体进行精彩程度评分的神经网络模型,对视频中物体进行精彩程度评分的神经网络模型检测出的精彩程度评分最高的两个跟踪视觉目标:动物A和人类B,在全景视频的第二帧全景图像追踪到动物A和人类B,标记第二帧全景图像的视觉目标,将携带标记的第二帧全景图像输入对视频中物体进行精彩程度评分的神经网络模型,对视频中物体进行精彩程度评分的神经网络模型检测第二帧全景图像,输出精彩程度评分 最高的两个跟踪视觉目标为:动物A和动物C,在全景视频追踪动物A、人类B以及动物C。
图10是本发明实施例提出的运动视频生成装置的功能模块图,上述运动视频生成装置设置在终端设备中,如图10所示,所述装置包括:
标记模块10,用于采用目标框标记全景视频中关键帧的至少一个视觉目标;所述关键帧为所述全景视频中的任意图像帧;
评估模块11,用于利用对视频中物体进行精彩程度评分的神经网络模型,基于统一大小后的目标框提取每个视觉目标对应像素点的RGB特征,并根据对应每个视觉目标的RGB特征对每个视觉目标进行精彩程度评估;
选择模块12,用于根据精彩程度评估结果选择至少一个视觉目标作为跟踪视觉目标;
轨迹生成模块13,用于在所述全景视频的每帧图像追踪所述跟踪视觉目标,生成所述跟踪视觉目标在所述全景视频的运动轨迹序列;
投影模块14,用于根据所述运动轨迹序列,将所述跟踪视觉目标对象对应的目标框在所述全景视频的每帧画面占据的图像区域投影成平面图像,获得所述跟踪视觉目标的运动视频。
图10所示实施例提供的运动视频生成装置可用于执行本说明书图1至图9所示方法实施例的技术方案,其实现原理和技术效果可以进一步参考方法实施例中的相关描述。
可选地,所述标记模块具体用于对所述至少一个视觉目标标注物体的位置坐标;
所述轨迹生成模块具体用于根据所述位置坐标,在所述全景视频的每帧图像追踪所述跟踪视觉目标,生成所述跟踪视觉目标在所述全景视频的运动轨迹序列。
可选地,所述装置还包括神经网络训练模块,所述神经网络训练模块具体用于:
获得全景图像;
根据多个维度对精彩程度的评价标准,对所述全景图像中每个物体标注 综合得分;所述多个维度包括:目标类别、运动状态、人物属性、显著性;
利用携带标注的全景图像多次训练多层神经网络,直至所述多层神经网络针对物体输出的精彩程度评分与对应标注综合得分相差程度小于预设阈值,将经过多次训练多层神经网络作为所述对视频中物体进行精彩程度评分的神经网络模型。
可选地,所述装置还包括:
响应模块,用于响应用户指定的剪辑指令,获得待显示物体和视频时间长度;
获得模块,用于获得与待显示物体匹配的多个跟踪视觉目标;
选取模块,用于按照所述多个跟踪视觉目标的运动视频各自对应的精彩程度评分大小,顺序选取对应目标运动视频作为待剪辑视频;
截取模块,用于从所述待剪辑视频截取符合所述视频时间长度的片段,获得用户指定的显示物体的运动视频。
可选地,所述评估模块具体用于将所述关键帧的所有视觉目标中精彩程度评分最高的视觉目标确定为跟踪视觉目标。
可选地,所述评估模块具体用于按照精彩程度评分从大到小顺序选取对应视觉目标作为所述跟踪视觉目标,直至所述跟踪视觉目标的数量满足预设数量。
可选地,所述标记模块具体用于对所述至少一个视觉目标标注物体类型;
所述评估模块包括:
评分子模块,用于利用所述预设对视频中物体进行精彩程度评分的神经网络模型,对所述至少一个视觉目标进行精彩程度评分;
输出子模块,用于根据所述至少一个视觉目标的精彩程度评分和所述至少一个视觉目标的类型,输出满足预设条件的跟踪视觉目标。
可选地,所述装置还包括:
提取模块,用于利用对视频中物体进行精彩程度评分的神经网络模型对追踪到所述跟踪视觉目标的图像帧提取不同视觉目标对应像素点RGB特征;
追踪模块,用于当任意视觉目标的精彩程度评分大于所述跟踪视觉目标的精彩程度评分,在所述全景视频的每帧图像追踪该任意视觉目标。
上述所示实施例提供的装置用于执行上述所示方法实施例的技术方案,其实现原理和技术效果可以进一步参考方法实施例中的相关描述,在此不再赘述。
上述所示实施例提供的装置例如可以是:芯片或者芯片模组。上述所示实施例提供的装置用于执行上述所示方法实施例的技术方案,其实现原理和技术效果可以进一步参考方法实施例中的相关描述,在此不再赘述。
关于上述实施例中描述的各个装置包含的各个模块/单元,其可以是软件模块/单元,也可以是硬件模块/单元,或者也可以部分是软件模块/单元,部分是硬件模块/单元。例如,对于应用于或集成于芯片的各个装置,其包含的各个模块/单元可以都采用电路等硬件的方式实现,或者,至少部分模块/单元可以采用软件程序的方式实现,该软件程序运行于芯片内部集成的处理器,剩余的部分模块/单元可以采用电路等硬件方式实现;对于应用于或集成于芯片模组的各个装置,其包含的各个模块/单元可以都采用电路等硬件的方式实现,不同的模块/单元可以位于芯片模组的同一组件(例如芯片、电路模块等)或者不同组件中,或者,至少部分模块/单元可以采用软件程序的方式实现,该软件程序运行于芯片模组内部集成的处理器,剩余的部分模块/单元可以采用电路等硬件方式实现;对于应用于或集成于电子终端设备的各个装置,其包含的各个模块/单元可以都采用电路等硬件的方式实现,不同的模块/单元可以位于电子终端设备内同一组件(例如,芯片、电路模块等)或者不同组件中,或者,至少部分模块/单元可以采用软件程序的方式实现,该软件程序运行于电子终端设备内部集成的处理器,剩余的(如果有)部分模块/单元可以采用电路等硬件方式实现。
图11为本发明实施例提供的一种电子终端设备的结构示意图,该电子终端设备1100包括处理器1110,存储器1111,存储在存储器1111上并可在所述处理器1110上运行的计算机程序,所述处理器1110执行所述程序时实现前述方法实施例中的步骤,实施例提供的电子终端设备可用于执行本上述所示方法实施例的技术方案,其实现原理和技术效果可以进一步参考方法实施例中的相关描述,在此不再赘述。
图12为本说明书一个实施例提供的终端设备的结构示意图,如图12所 示,上述终端设备可以包括至少一个处理器;以及与上述处理器通信连接的至少一个存储器,其中:存储器存储有可被处理器执行的程序指令,上述处理器调用上述程序指令能够执行本说明书图1~图9所示实施例提供的运动视频生成方法。
可以理解的是,本发明实施例示意的结构并不构成对终端设备100的具体限定。在本发明另一些实施例中,终端设备100可以包括比图示更多或更少的部件,或者组合某些部件,或者拆分某些部件,或者不同的部件布置。图示的部件可以以硬件,软件或软件和硬件的组合实现。
如图12所示,终端设备100可以包括处理器110,外部存储器接口120,内部存储器121,移动通信模块150,无线通信模块160,指示器192,摄像头193,显示屏194等。
处理器110可以包括一个或多个处理单元,例如:处理器110可以包括应用处理器(application processor,AP),调制解调处理器,图形处理器(graphics processing unit,GPU),图像信号处理器(image signal processor,ISP),控制器,视频编解码器,数字信号处理器(digital signal processor,DSP),基带处理器,和/或神经网络处理器(neural-network processing unit,NPU)等。其中,不同的处理单元可以是独立的器件,也可以集成在一个或多个处理器中。
控制器可以根据指令操作码和时序信号,产生操作控制信号,完成取指令和执行指令的控制。
处理器110中还可以设置存储器,用于存储指令和数据。在一些实施例中,处理器110中的存储器为高速缓冲存储器。该存储器可以保存处理器110刚用过或循环使用的指令或数据。如果处理器110需要再次使用该指令或数据,可从所述存储器中直接调用。避免了重复存取,减少了处理器110的等待时间,因而提高了系统的效率。
处理器110通过运行存储在内部存储器121中的程序,从而执行各种功能应用以及数据处理,例如实现本发明图1~图9所示实施例提供的运动视频生成方法。
终端设备100的无线通信功能可以通过天线1,天线2,移动通信模块150,无线通信模块160,调制解调处理器以及基带处理器等实现。
天线1和天线2用于发射和接收电磁波信号。终端设备100中的每个天线可用于覆盖单个或多个通信频带。不同的天线还可以复用,以提高天线的利用率。例如:可以将天线1复用为无线局域网的分集天线。在另外一些实施例中,天线可以和调谐开关结合使用。
终端设备100通过GPU,显示屏194,以及应用处理器等实现显示功能。GPU为图像处理的微处理器,连接显示屏194和应用处理器。GPU用于执行数学和几何计算,用于图形渲染。处理器110可包括一个或多个GPU,其执行程序指令以生成或改变显示信息。
显示屏194用于显示图像,视频等。显示屏194包括显示面板。显示面板可以采用液晶显示屏(liquid crystal display,LCD),有机发光二极管(organic light-emitting diode,OLED),有源矩阵有机发光二极体或主动矩阵有机发光二极体(active-matrix organic light emitting diode的,AMOLED),柔性发光二极管(flex light-emitting diode,FLED),Miniled,MicroLed,Micro-oLed,量子点发光二极管(quantum dot light emitting diodes,QLED)等。在一些实施例中,终端设备100可以包括1个或N个显示屏194,N为大于1的正整数。
终端设备100可以通过ISP,摄像头193,视频编解码器,GPU,显示屏194以及应用处理器等实现拍摄功能。
ISP用于处理摄像头193反馈的数据。例如,拍照时,打开快门,光线通过镜头被传递到摄像头感光元件上,光信号转换为电信号,摄像头感光元件将所述电信号传递给ISP处理,转化为肉眼可见的图像。ISP还可以对图像的噪点,亮度,肤色进行算法优化。ISP还可以对拍摄场景的曝光,色温等参数优化。在一些实施例中,ISP可以设置在摄像头193中。
摄像头193用于捕获静态图像或视频。物体通过镜头生成光学图像投射到感光元件。感光元件可以是电荷耦合器件(charge coupled device,CCD)或互补金属氧化物半导体(complementary metal-oxide-semiconductor,CMOS)光电晶体管。感光元件把光信号转换成电信号,之后将电信号传递给ISP转换成数字图像信号。ISP将数字图像信号输出到DSP加工处理。DSP将数字图像信号转换成标准的RGB,YUV等格式的图像信号。在一些实施例中,终端设备100可以包括1个或N个摄像头193,N为大于1的正整数。
数字信号处理器用于处理数字信号,除了可以处理数字图像信号,还可以处理其他数字信号。例如,当终端设备100在频点选择时,数字信号处理器用于对频点能量进行傅里叶变换等。
视频编解码器用于对数字视频压缩或解压缩。终端设备100可以支持一种或多种视频编解码器。这样,终端设备100可以播放或录制多种编码格式的视频,例如:动态图像专家组(moving picture experts group,MPEG)1,MPEG2,MPEG3,MPEG4等。
内部存储器121可以用于存储计算机可执行程序代码,所述可执行程序代码包括指令。内部存储器121可以包括存储程序区和存储数据区。其中,存储程序区可存储操作系统,至少一个功能所需的应用程序(比如声音播放功能,图像播放功能等)等。存储数据区可存储终端设备100使用过程中所创建的数据(比如音频数据,电话本等)等。此外,内部存储器121可以包括高速随机存取存储器,还可以包括非易失性存储器,例如至少一个磁盘存储器件,闪存器件,通用闪存存储器(universal flash storage,UFS)等。处理器110通过运行存储在内部存储器121的指令,和/或存储在设置于处理器中的存储器的指令,执行终端设备100的各种功能应用以及数据处理。
本发明实施例提供一种非暂态计算机可读存储介质,所述非暂态计算机可读存储介质存储计算机指令,所述计算机指令使所述计算机执行本说明书图1~图9所示实施例提供的运动视频生成方法。非暂态计算机可读存储介质可以指非易失性计算机存储介质。
上述非暂态计算机可读存储介质可以采用一个或多个计算机可读的介质的任意组合。计算机可读介质可以是计算机可读信号介质或者计算机可读存储介质。计算机可读存储介质例如可以是——但不限于——电、磁、光、电磁、红外线、或半导体的系统、装置或器件,或者任意以上的组合。计算机可读存储介质的更具体的例子(非穷举的列表)包括:具有一个或多个导线的电连接、便携式计算机磁盘、硬盘、随机存取存储器(RAM)、只读存储器(read only memory,ROM)、可擦式可编程只读存储器(erasable programmable read only memory,EPROM)或闪存、光纤、便携式紧凑磁盘只读存储器(CD-ROM)、光存储器件、磁存储器件、或者上述的任意合适的组合。在本文件中,计算机 可读存储介质可以是任何包含或存储程序的有形介质,该程序可以被指令执行系统、装置或者器件使用或者与其结合使用。
计算机可读的信号介质可以包括在基带中或者作为载波一部分传播的数据信号,其中承载了计算机可读的程序代码。这种传播的数据信号可以采用多种形式,包括——但不限于——电磁信号、光信号或上述的任意合适的组合。计算机可读的信号介质还可以是计算机可读存储介质以外的任何计算机可读介质,该计算机可读介质可以发送、传播或者传输用于由指令执行系统、装置或者器件使用或者与其结合使用的程序。
计算机可读介质上包含的程序代码可以用任何适当的介质传输,包括——但不限于——无线、电线、光缆、射频(radio frequency,RF)等等,或者上述的任意合适的组合。
可以以一种或多种程序设计语言或其组合来编写用于执行本说明书操作的计算机程序代码,所述程序设计语言包括面向对象的程序设计语言—诸如Java、Smalltalk、C++,还包括常规的过程式程序设计语言—诸如“C”语言或类似的程序设计语言。程序代码可以完全地在用户计算机上执行、部分地在用户计算机上执行、作为一个独立的软件包执行、部分在用户计算机上部分在远程计算机上执行、或者完全在远程计算机或服务器上执行。在涉及远程计算机的情形中,远程计算机可以通过任意种类的网络——包括局域网(local area network,LAN)或广域网(wide area network,WAN)连接到用户计算机,或者,可以连接到外部计算机(例如利用因特网服务提供商来通过因特网连接)。
上述对本说明书特定实施例进行了描述。其它实施例在所附权利要求书的范围内。在一些情况下,在权利要求书中记载的动作或步骤可以按照不同于实施例中的顺序来执行并且仍然可以实现期望的结果。另外,在附图中描绘的过程不一定要求示出的特定顺序或者连续顺序才能实现期望的结果。在某些实施方式中,多任务处理和并行处理也是可以的或者可能是有利的。
在本发明实施例的描述中,参考术语“一个实施例”、“一些实施例”、“示例”、“具体示例”、或“一些示例”等的描述意指结合该实施例或示例描述的具体特征、结构、材料或者特点包含于本说明书的至少一个实施例或示例中。在本说明书中,对上述术语的示意性表述不必须针对的是相同的实施例 或示例。而且,描述的具体特征、结构、材料或者特点可以在任一个或多个实施例或示例中以合适的方式结合。此外,在不相互矛盾的情况下,本领域的技术人员可以将本说明书中描述的不同实施例或示例以及不同实施例或示例的特征进行结合和组合。
此外,术语“第一”、“第二”仅用于描述目的,而不能理解为指示或暗示相对重要性或者隐含指明所指示的技术特征的数量。由此,限定有“第一”、“第二”的特征可以明示或者隐含地包括至少一个该特征。在本说明书的描述中,“多个”的含义是至少两个,例如两个,三个等,除非另有明确具体的限定。
流程图中或在此以其他方式描述的任何过程或方法描述可以被理解为,表示包括一个或更多个用于实现定制逻辑功能或过程的步骤的可执行指令的代码的模块、片段或部分,并且本说明书的优选实施方式的范围包括另外的实现,其中可以不按所示出或讨论的顺序,包括根据所涉及的功能按基本同时的方式或按相反的顺序,来执行功能,这应被本说明书的实施例所属技术领域的技术人员所理解。
取决于语境,如在此所使用的词语“如果”可以被解释成为“在……时”或“当……时”或“响应于确定”或“响应于检测”。类似地,取决于语境,短语“如果确定”或“如果检测(陈述的条件或事件)”可以被解释成为“当确定时”或“响应于确定”或“当检测(陈述的条件或事件)时”或“响应于检测(陈述的条件或事件)”。
需要说明的是,本发明实施例中所涉及的终端可以包括但不限于个人计算机(personal computer,PC)、个人数字助理(personal digital assistant,PDA)、无线手持设备、平板电脑(tablet computer)、手机、MP3播放器、MP4播放器等。
在本说明书所提供的几个实施例中,应该理解到,所揭露的系统、装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如,多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦 合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。
另外,在本说明书各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用硬件加软件功能单元的形式实现。
上述以软件功能单元的形式实现的集成的单元,可以存储在一个计算机可读取存储介质中。上述软件功能单元存储在一个存储介质中,包括若干指令用以使得一台计算机装置(可以是个人计算机,服务器,或者网络装置等)或处理器(processor)执行本说明书各个实施例所述方法的部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(ROM)、随机存取存储器(RAM)、磁碟或者光盘等各种可以存储程序代码的介质。
以上所述仅为本说明书的较佳实施例而已,并不用以限制本说明书,凡在本说明书的精神和原则之内,所做的任何修改、等同替换、改进等,均应包含在本说明书保护的范围之内。

Claims (18)

  1. 一种运动视频生成方法,其特征在于,所述方法包括:
    采用目标框标记全景视频中关键帧的至少一个视觉目标;所述关键帧为所述全景视频中的任意图像帧;
    利用对视频中物体进行精彩程度评分的神经网络模型,基于统一大小后的目标框提取每个视觉目标对应像素点的RGB特征,并根据对应每个视觉目标的RGB特征对每个视觉目标进行精彩程度评估;
    根据精彩程度评估结果选择至少一个视觉目标作为跟踪视觉目标;
    在所述全景视频的每帧图像追踪所述跟踪视觉目标,生成所述跟踪视觉目标在所述全景视频的运动轨迹序列;
    根据所述运动轨迹序列,将所述跟踪视觉目标对象对应的目标框在所述全景视频的每帧画面占据的图像区域投影成平面图像,获得所述跟踪视觉目标的运动视频。
  2. 根据权利要求1所述的方法,其特征在于,采用目标框标记全景视频中关键帧的至少一个视觉目标,包括:
    对所述至少一个视觉目标标注物体的位置坐标;
    在所述全景视频的每帧图像追踪所述跟踪视觉目标,生成所述跟踪视觉目标在所述全景视频的运动轨迹序列,包括:
    根据所述位置坐标,在所述全景视频的每帧图像追踪所述跟踪视觉目标,生成所述跟踪视觉目标在所述全景视频的运动轨迹序列。
  3. 根据权利要求1所述的方法,其特征在于,所述对视频中物体进行精彩程度评分的神经网络模型通过以下方式设定:
    获得全景图像;
    根据多个维度对精彩程度的评价标准,对所述全景图像中每个物体标注综合得分;所述多个维度包括:目标类别、运动状态、人物属性、显著性;
    利用携带标注的全景图像多次训练多层神经网络,直至所述多层神经网络针对物体输出的精彩程度评分与对应标注综合得分相差程度小于预设阈值,将经过多次训练多层神经网络作为所述对视频中物体进行精彩程度评分的神 经网络模型。
  4. 根据权利要求1所述的方法,其特征在于,所述方法还包括:
    响应用户指定的剪辑指令,获得待显示物体和视频时间长度;
    获得与待显示物体匹配的多个跟踪视觉目标;
    按照所述多个跟踪视觉目标的运动视频各自对应的精彩程度评分大小,顺序选取对应目标运动视频作为待剪辑视频;
    从所述待剪辑视频截取符合所述视频时间长度的片段,获得用户指定的显示物体的运动视频。
  5. 根据权利要求1所述的方法,其特征在于,根据精彩程度评估结果选择至少一个视觉目标作为跟踪视觉目标,包括:
    将所述关键帧的所有视觉目标中精彩程度评分最高的视觉目标确定为跟踪视觉目标。
  6. 根据权利要求1所述的方法,其特征在于,根据精彩程度评估结果选择至少一个视觉目标作为跟踪视觉目标,包括:
    按照精彩程度评分从大到小顺序选取对应视觉目标作为所述跟踪视觉目标,直至所述跟踪视觉目标的数量满足预设数量。
  7. 根据权利要求1所述的方法,其特征在于,采用目标框标记全景视频中关键帧的至少一个视觉目标,包括:
    对所述至少一个视觉目标标注物体类型;
    利用对视频中物体进行精彩程度评分的神经网络模型,基于统一大小后的目标框提取每个视觉目标对应像素点的RGB特征,并根据对应每个视觉目标的RGB特征对每个视觉目标进行精彩程度评估,包括:
    利用所述对视频中物体进行精彩程度评分的神经网络模型,对所述至少一个视觉目标进行精彩程度评分;
    根据所述至少一个视觉目标的精彩程度评分和所述至少一个视觉目标的类型,输出满足预设条件的跟踪视觉目标。
  8. 根据权利要求1所述的方法,其特征在于,在所述全景视频的每帧图像追踪到所述跟踪视觉目标后,所述方法还包括:
    利用所述对视频中物体进行精彩程度评分的神经网络模型对追踪到所述 跟踪视觉目标的图像帧提取与所述跟踪视觉目标不同的视觉目标对应像素点RGB特征;
    当任意视觉目标的精彩程度评分大于所述跟踪视觉目标的精彩程度评分,在所述全景视频的每帧图像追踪该任意视觉目标。
  9. 一种运动视频生成装置,其特征在于,所述装置包括:
    标记模块,用于采用目标框标记全景视频中关键帧的至少一个视觉目标;所述关键帧为所述全景视频中的任意图像帧;
    评估模块,用于利用对视频中物体进行精彩程度评分的神经网络模型,基于统一大小后的目标框提取每个视觉目标对应像素点的RGB特征,并根据对应每个视觉目标的RGB特征对每个视觉目标进行精彩程度评估;
    选择模块,用于根据精彩程度评估结果选择至少一个视觉目标作为跟踪视觉目标;
    轨迹生成模块,用于在所述全景视频的每帧图像追踪所述跟踪视觉目标,生成所述跟踪视觉目标在所述全景视频的运动轨迹序列;
    投影模块,用于根据所述运动轨迹序列,将所述跟踪视觉目标对象对应的目标框在所述全景视频的每帧画面占据的图像区域投影成平面图像,获得所述跟踪视觉目标的运动视频。
  10. 根据权利要求9所述的装置,其特征在于,所述标记模块具体用于对所述至少一个视觉目标标注物体的位置坐标;
    所述轨迹生成模块具体用于根据所述位置坐标,在所述全景视频的每帧图像追踪所述跟踪视觉目标,生成所述跟踪视觉目标在所述全景视频的运动轨迹序列。
  11. 根据权利要求9所述的装置,其特征在于,所述装置还包括神经网络训练模块,所述神经网络训练模块具体用于:
    获得全景图像;
    根据多个维度对精彩程度的评价标准,对所述全景图像中每个物体标注综合得分;所述多个维度包括:目标类别、运动状态、人物属性、显著性;
    利用携带标注的全景图像多次训练多层神经网络,直至所述多层神经网络针对物体输出的精彩程度评分与对应标注综合得分相差程度小于预设阈值, 将经过多次训练多层神经网络作为所述对视频中物体进行精彩程度评分的神经网络模型。
  12. 根据权利要求9所述的装置,其特征在于,所述装置还包括:
    响应模块,用于响应用户指定的剪辑指令,获得待显示物体和视频时间长度;
    获得模块,用于获得与待显示物体匹配的多个跟踪视觉目标;
    选取模块,用于按照所述多个跟踪视觉目标的运动视频各自对应的精彩程度评分大小,顺序选取对应目标运动视频作为待剪辑视频;
    截取模块,用于从所述待剪辑视频截取符合所述视频时间长度的片段,获得用户指定的显示物体的运动视频。
  13. 根据权利要求9所述的装置,其特征在于,所述评估模块具体用于将所述关键帧的所有视觉目标中精彩程度评分最高的视觉目标确定为跟踪视觉目标。
  14. 根据权利要求9所述的装置,其特征在于,所述评估模块具体用于按照精彩程度评分从大到小顺序选取对应视觉目标作为所述跟踪视觉目标,直至所述跟踪视觉目标的数量满足预设数量。
  15. 根据权利要求9所述的装置,其特征在于,所述标记模块具体用于对所述至少一个视觉目标标注物体类型;
    所述评估模块包括:
    评分子模块,用于利用所述对视频中物体进行精彩程度评分的神经网络模型,对所述至少一个视觉目标进行精彩程度评分;
    输出子模块,用于根据所述至少一个视觉目标的精彩程度评分和所述至少一个视觉目标的类型,输出满足预设条件的跟踪视觉目标。
  16. 根据权利要求9所述的装置,其特征在于,所述装置还包括:
    提取模块,用于利用对视频中物体进行精彩程度评分的神经网络模型对追踪到所述跟踪视觉目标的图像帧提取不同视觉目标对应像素点RGB特征;
    追踪模块,用于当任意视觉目标的精彩程度评分大于所述跟踪视觉目标的精彩程度评分,在所述全景视频的每帧图像追踪该任意视觉目标。
  17. 一种终端设备,包括:
    至少一个处理器;以及
    与所述处理器通信连接的至少一个存储器,其特征在于,
    所述存储器存储有可被所述处理器执行的程序指令,所述处理器调用所述程序指令能够执行如权利要求1至8任一所述的方法。
  18. 一种非暂态计算机可读存储介质,所述非暂态计算机可读存储介质存储计算机指令,其特征在于,所述计算机指令使所述计算机执行如权利要求1至8任一所述的方法。
PCT/CN2023/083187 2022-03-25 2023-03-22 运动视频生成方法、装置、终端设备以及存储介质 WO2023179692A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210305584.8 2022-03-25
CN202210305584.8A CN116862946A (zh) 2022-03-25 2022-03-25 运动视频生成方法、装置、终端设备以及存储介质

Publications (1)

Publication Number Publication Date
WO2023179692A1 true WO2023179692A1 (zh) 2023-09-28

Family

ID=88100097

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/083187 WO2023179692A1 (zh) 2022-03-25 2023-03-22 运动视频生成方法、装置、终端设备以及存储介质

Country Status (2)

Country Link
CN (1) CN116862946A (zh)
WO (1) WO2023179692A1 (zh)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109886998A (zh) * 2019-01-23 2019-06-14 平安科技(深圳)有限公司 多目标跟踪方法、装置、计算机装置及计算机存储介质
CN111182218A (zh) * 2020-01-07 2020-05-19 影石创新科技股份有限公司 全景视频处理方法、装置、设备及存储介质
CN112241982A (zh) * 2019-07-18 2021-01-19 杭州海康威视数字技术股份有限公司 一种图像处理方法、装置及机器可读存储介质
US20220059133A1 (en) * 2020-08-19 2022-02-24 Qnap Systems, Inc. Intelligent video editing method and system

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109886998A (zh) * 2019-01-23 2019-06-14 平安科技(深圳)有限公司 多目标跟踪方法、装置、计算机装置及计算机存储介质
CN112241982A (zh) * 2019-07-18 2021-01-19 杭州海康威视数字技术股份有限公司 一种图像处理方法、装置及机器可读存储介质
CN111182218A (zh) * 2020-01-07 2020-05-19 影石创新科技股份有限公司 全景视频处理方法、装置、设备及存储介质
US20220059133A1 (en) * 2020-08-19 2022-02-24 Qnap Systems, Inc. Intelligent video editing method and system

Also Published As

Publication number Publication date
CN116862946A (zh) 2023-10-10

Similar Documents

Publication Publication Date Title
CN109635621B (zh) 用于第一人称视角中基于深度学习识别手势的系统和方法
US11995556B2 (en) Video retrieval method, and method and apparatus for generating video retrieval mapping relationship
US11941883B2 (en) Video classification method, model training method, device, and storage medium
WO2020119350A1 (zh) 视频分类方法、装置、计算机设备和存储介质
US9436875B2 (en) Method and apparatus for semantic extraction and video remix creation
WO2019134516A1 (zh) 全景图像生成方法、装置、存储介质及电子设备
CN106575361B (zh) 提供视觉声像的方法和实现该方法的电子设备
CN113395542B (zh) 基于人工智能的视频生成方法、装置、计算机设备及介质
WO2021190078A1 (zh) 短视频的生成方法、装置、相关设备及介质
JP2022523606A (ja) 動画解析のためのゲーティングモデル
WO2023125335A1 (zh) 问答对生成的方法和电子设备
WO2020056903A1 (zh) 用于生成信息的方法和装置
US8457407B2 (en) Electronic apparatus and image display method
CN111491187B (zh) 视频的推荐方法、装置、设备及存储介质
US20120030711A1 (en) Method or system to predict media content preferences
WO2020052062A1 (zh) 检测方法和装置
EP4273684A1 (en) Photographing method and electronic device
WO2023179692A1 (zh) 运动视频生成方法、装置、终端设备以及存储介质
CN113792174B (zh) 图片显示方法、装置、终端设备以及存储介质
WO2022206605A1 (zh) 确定目标对象的方法、拍摄方法和装置
CN109640164A (zh) 一种用于多个虚拟现实设备间的播放方法与装置
KR20140033667A (ko) 객체 기반 동영상 편집 장치 및 방법
WO2020154883A1 (zh) 语音信息的处理方法、装置、存储介质及电子设备
Fukusato et al. Computational cartoonist: A comic-style video summarization system for anime films
CN111507421A (zh) 一种基于视频的情感识别方法及装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23773939

Country of ref document: EP

Kind code of ref document: A1