CN114979652A - Video processing method and device, electronic equipment and storage medium - Google Patents

Video processing method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN114979652A
CN114979652A CN202210553339.9A CN202210553339A CN114979652A CN 114979652 A CN114979652 A CN 114979652A CN 202210553339 A CN202210553339 A CN 202210553339A CN 114979652 A CN114979652 A CN 114979652A
Authority
CN
China
Prior art keywords
historical
saliency map
video
view
video content
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210553339.9A
Other languages
Chinese (zh)
Inventor
孙黎阳
张傲阳
何伟
马茜
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing ByteDance Network Technology Co Ltd
Lemon Inc Cayman Island
Original Assignee
Beijing ByteDance Network Technology Co Ltd
Lemon Inc Cayman Island
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing ByteDance Network Technology Co Ltd, Lemon Inc Cayman Island filed Critical Beijing ByteDance Network Technology Co Ltd
Priority to CN202210553339.9A priority Critical patent/CN114979652A/en
Publication of CN114979652A publication Critical patent/CN114979652A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/10Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
    • H04N19/134Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the element, parameter or criterion affecting or controlling the adaptive coding
    • H04N19/167Position within a video image, e.g. region of interest [ROI]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/10Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
    • H04N19/134Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the element, parameter or criterion affecting or controlling the adaptive coding
    • H04N19/157Assigned coding mode, i.e. the coding mode being predefined or preselected to be further used for selection of another element or parameter
    • H04N19/159Prediction type, e.g. intra-frame, inter-frame or bidirectional frame prediction
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/42Methods or arrangements for coding, decoding, compressing or decompressing digital video signals characterised by implementation details or hardware specially adapted for video compression or decompression, e.g. dedicated software implementation

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Molecular Biology (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Two-Way Televisions, Distribution Of Moving Picture Or The Like (AREA)

Abstract

The embodiment of the disclosure provides a video processing method, a video processing device, an electronic device and a storage medium, wherein the method comprises the following steps: determining a historical view saliency map of a historical time slice and a historical video content saliency map corresponding to the historical view saliency image; wherein the historical time slice is a last time slice of the predicted time slices; inputting the historical view saliency map, the historical video content saliency map and the current view saliency map into a target view prediction model obtained through pre-training to obtain a prediction video content saliency map of the prediction time slice; determining a target video within the predicted temporal slice based on the predicted video content saliency map. According to the technical scheme, the subsequent viewing angle of the user is predicted, the corresponding target video is provided for the user conveniently according to the prediction result, and the viewing experience of the user is improved.

Description

Video processing method and device, electronic equipment and storage medium
Technical Field
The embodiment of the disclosure relates to the technical field of image processing, and in particular, to a video processing method and apparatus, an electronic device, and a storage medium.
Background
At present, the development of virtual reality technology and related applications thereof is very rapid, and as an important component in the virtual reality technology, the transmission and processing work of panoramic video is in a very critical ring in the virtual reality application.
In the prior art, when a panoramic video is transmitted in a conventional video coding manner, pictures of each part of the panoramic video need to be coded with the same or similar quality to generate a path of code stream, so that the video is transmitted to a user side. Therefore, the existing panoramic video coding transmission mode has larger redundancy, the video transmission speed is slower, and the watching experience of a user is poorer.
Disclosure of Invention
The present disclosure provides a video processing method, an apparatus, an electronic device, and a storage medium, which implement prediction of a subsequent viewing angle of a user, facilitate providing a corresponding target video for the user according to a prediction result, and improve viewing experience of the user.
In a first aspect, an embodiment of the present disclosure provides a video processing method, including:
determining a historical view saliency map of a historical time slice and a historical video content saliency map corresponding to the historical view saliency image; wherein the historical time slice is a last time slice of the predicted time slices;
inputting the historical view saliency map, the historical video content saliency map and the current view saliency map into a target view prediction model obtained by pre-training to obtain a prediction video content saliency map of the prediction time slice;
determining a target video within the predicted temporal slice based on the predicted video content saliency map.
In a second aspect, an embodiment of the present disclosure further provides a video processing apparatus, including:
the historical view saliency map determination module is used for determining a historical view saliency map of a historical time slice and a historical video content saliency map corresponding to the historical view saliency map; wherein the historical time slice is a last time slice of the predicted time slices;
the prediction video content saliency map determining module is used for inputting the historical view saliency map, the historical video content saliency map and the current view saliency map into a target view prediction model obtained by pre-training to obtain a prediction video content saliency map of the prediction time slice;
a target video determination module for determining a target video within the predicted temporal slice based on the predicted video content saliency map.
In a third aspect, an embodiment of the present disclosure further provides an electronic device, where the electronic device includes:
one or more processors;
a storage device for storing one or more programs,
when executed by the one or more processors, cause the one or more processors to implement a video processing method as in any of the embodiments of the present disclosure.
In a fourth aspect, the disclosed embodiments also provide a storage medium containing computer-executable instructions, which when executed by a computer processor, are configured to perform the video processing method according to any of the disclosed embodiments.
According to the technical scheme, the historical view saliency map of the historical time slice and the historical video content saliency map corresponding to the historical view saliency map are determined, further, the historical view saliency map, the historical video content saliency map and the current view saliency map are input into a target view prediction model obtained through pre-training to obtain a prediction video content saliency map of the prediction time slice, finally, the target video in the prediction time slice is determined based on the prediction video content saliency map, prediction of the subsequent viewing view of the user is achieved through a machine learning model, the corresponding target video is provided for the user according to the prediction result, the viewing experience of the user is improved, meanwhile, the influence of the video content on the viewing view of the user is considered in the prediction process, and the prediction accuracy is improved.
Drawings
The above and other features, advantages and aspects of various embodiments of the present disclosure will become more apparent by referring to the following detailed description when taken in conjunction with the accompanying drawings. Throughout the drawings, the same or similar reference numbers refer to the same or similar elements. It should be understood that the drawings are schematic and that elements and features are not necessarily drawn to scale.
Fig. 1 is a schematic flow chart of a video processing method according to a first embodiment of the disclosure;
fig. 2 is a schematic diagram of determining a predicted view saliency map based on a target view prediction model according to a first embodiment of the present disclosure;
fig. 3 is a schematic view of a video processing flow provided in an embodiment of the present disclosure;
fig. 4 is a schematic structural diagram of a video processing apparatus according to a second embodiment of the present disclosure;
fig. 5 is a schematic structural diagram of an electronic device according to a third embodiment of the disclosure.
Detailed Description
Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While certain embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein, but rather are provided for a more complete and thorough understanding of the present disclosure. It should be understood that the drawings and embodiments of the disclosure are for illustration purposes only and are not intended to limit the scope of the disclosure.
It should be understood that the various steps recited in method embodiments of the present disclosure may be performed in a different order, and/or performed in parallel. Moreover, method embodiments may include additional steps and/or omit performing the illustrated steps. The scope of the present disclosure is not limited in this respect.
The term "include" and variations thereof as used herein are open-ended, i.e., "including but not limited to". The term "based on" is "based, at least in part, on". The term "one embodiment" means "at least one embodiment"; the term "another embodiment" means "at least one additional embodiment"; the term "some embodiments" means "at least some embodiments". Relevant definitions for other terms will be given in the following description.
It should be noted that the terms "first", "second", and the like in the present disclosure are only used for distinguishing different devices, modules or units, and are not used for limiting the order or interdependence of the functions performed by the devices, modules or units. It is noted that references to "a", "an", and "the" modifications in this disclosure are intended to be illustrative rather than limiting, and that those skilled in the art will recognize that "one or more" may be used unless the context clearly dictates otherwise.
The names of messages or information exchanged between devices in the embodiments of the present disclosure are for illustrative purposes only, and are not intended to limit the scope of the messages or information.
Before introducing the technical solution, an application scenario of the embodiment of the present disclosure may be exemplarily described. For example, when a user watches a panoramic video transmitted by a server by using an Augmented Reality (AR) device or a Virtual Reality (VR) device, only a part of a picture in the panoramic video can be seen, if the server performs coding on the parts of the panoramic video with the same or similar quality, the transmitted panoramic video occupies a larger bandwidth, and the data transmission speed is slower, so that the user cannot see an area of interest in the panoramic video in real time in the subsequent watching process. At this time, based on the scheme of the embodiment of the present disclosure, the server or the client determines the distribution of the user view angles in the predicted time slice according to the historical video content watched by the user, and then determines the corresponding target video based on the predicted view angle distribution, which can be understood as that the pictures of each part of the panoramic video are transmitted to the client at differentiated code rates, so that the user can smoothly watch the interested part of the panoramic video in the subsequent process, thereby ensuring the watching experience of the user.
Example one
Fig. 1 is a schematic flow chart of a video processing method provided in an embodiment of the present disclosure, where the embodiment of the present disclosure is suitable for a situation where a user predicts a viewing angle of the user in a next time period when watching a panoramic video, so as to provide a corresponding target video for the user, and the method may be executed by a video processing apparatus, and the apparatus may be implemented in a form of software and/or hardware, and optionally, implemented by an electronic device, where the electronic device may be a mobile terminal, a PC terminal, or a server.
As shown in fig. 1, the method includes:
and S110, determining a historical view saliency map of the historical time slice and a historical video content saliency map corresponding to the historical view saliency image.
When a user watches a panoramic video by using an AR device or a VR device, a time slice corresponding to the panoramic video needs to be determined first. The panoramic video is a video obtained by shooting 360 degrees in an all-round way by using a 3D camera, and when a user watches the panoramic video by using corresponding equipment, pictures in corresponding areas in the panoramic video can be watched in a mode of changing a visual angle. In this embodiment, for the panoramic video transmitted to the user end, the corresponding time slice may be determined according to the length of the predicted time or a preset dividing rule, and at the same time, for each time slice, at least one video frame corresponds, for example, when the predicted time length is one second, the time slice corresponds to a panoramic video frame within one second of the panoramic video. On the basis, when the subsequent viewing angle of the user is predicted, the predicted time period is the predicted time slice of the panoramic video, and meanwhile, the previous time period is the historical time slice of the panoramic video relative to the predicted time period.
Illustratively, when the predicted time length is one second, and the user watches the panoramic video through the AR device or the VR device, for the current time, the second before the time is the historical time slice, and correspondingly, the server or the client may predict the view angle of the user within one second after the time, and at this time, the second after the time is the predicted time slice; when the predicted time length is ten seconds, for the current time, ten seconds before the time are the historical time slice, correspondingly, the server or the client can predict the view angle of the user within ten seconds after the time, at this time, the latter ten seconds are the predicted time slice, of course, in the practical application process, the length of the predicted time can be set according to the practical situation, correspondingly, the duration of the time slice can also be adjusted according to the practically set predicted duration, which is not described in detail herein.
In this embodiment, after determining the historical time slice, it is also necessary to determine the historical perspective saliency map. In machine vision, Saliency is a mode of image partitioning, and a Saliency map (salience map) is an image showing uniqueness of each pixel, and it can be understood that the Saliency map aims to simplify or change a representation of a general image into a more easily analyzed form, for example, pixels corresponding to a certain partial region may be represented by red while pixels corresponding to another partial region may be represented by blue for an entire picture included in a panoramic video, or pixels corresponding to a certain partial region may be assigned with one label while pixels corresponding to another partial region are assigned with another label.
On the basis, each panoramic video frame in the historical time slice comprises a historical view angle saliency map and a historical video content saliency map corresponding to the historical view angle saliency map, wherein the historical video content saliency map and the historical view angle saliency map are panoramic maps. Specifically, the historical view saliency map is a map in which, on the basis of at least one panoramic video frame corresponding to a historical time slice, an area viewed by a user is displayed in red or highlight color, and meanwhile, an area not viewed by the user is displayed in a panoramic picture in blue or dim color; the historical video content saliency map is a panoramic image in which, on the basis of at least one panoramic video frame corresponding to a historical time slice, an area in a video image that is likely to be of interest to a user is displayed in red or other highlight colors, and an area in a video image that is not likely to be of interest to the user is displayed in blue or dim colors.
It can be understood that the historical view saliency map is a fused image, and specifically, since there are usually a plurality of panoramic video frames corresponding to the historical time slice, after a corresponding view saliency map is determined for each panoramic video frame, the plurality of view saliency maps can be fused to obtain a historical view saliency map corresponding to the historical time slice; similarly, the historical video content saliency maps can also be obtained by fusing video content saliency maps corresponding to a plurality of panoramic video frames, which is not described in detail in the embodiments of the present disclosure. It should be noted that, since the area viewed by the user may not be consistent with the area determined by the server or the client, which is easy for the user to generate interest, the salient area in the historical perspective saliency map may be different from the salient area in the historical video picture. The process of determining the two saliency maps described above is explained below.
Optionally, when the last video frame in the historical time slice is acquired, determining a historical view saliency map of the historical time slice and a historical video content saliency map; or when the historical video frames in the historical time slices are acquired, updating the historical view saliency map corresponding to the historical time slices and the historical video content saliency map based on the historical video frames.
Continuing to explain by the above example, when the user views part of the picture in all 30 video frames corresponding to the first time slice, the server or the client determines the region that the user views in each frame of the 30 panoramic video frames that the user views in order to predict the region that the user may view in the video frame corresponding to the second time slice, and displays the pixel points corresponding to the regions in red and displays other regions in blue, and further, the 30 generated perspective saliency maps are fused, so as to obtain the historical perspective saliency map corresponding to the first time slice; meanwhile, determining areas which are easy for users to interest in each frame of picture, displaying pixel points corresponding to the areas in red, displaying other areas in blue, and further fusing the 30 generated video content saliency maps to obtain the historical video content saliency map corresponding to the first time slice.
Or, in the process that the user watches each panoramic video frame corresponding to the first time slice, after one panoramic video frame is displayed to the user, the view angle saliency map and the video content saliency map corresponding to the panoramic video frame are determined in real time according to the display mode, and further, the determined view angle saliency map and the determined video content saliency map are fused into the historical view angle saliency map and the historical video content saliency map corresponding to the time slice, so that the historical view angle saliency map and the historical video content saliency map are updated. Of course, in an actual application process, any one of the two ways of determining the saliency maps may be selected according to actual requirements, and this is not specifically limited by the embodiment of the present disclosure.
Optionally, in the process of determining the historical view saliency map and the historical video content saliency map, historical view information of each historical video frame in the historical time slice may also be determined; and determining a historical visual angle saliency map corresponding to each historical visual angle information in the corresponding historical video frame and a historical video content saliency map of each historical video frame.
Specifically, when the user watches the panoramic video by using the AR device or the VR device, the server or the user may further obtain historical viewing angle information. For a played panoramic video frame, the historical view angle information is information reflecting a viewing angle of a user in the video frame, for example, a Field of view (FOV) corresponding to the user viewing the panoramic video frame, it can be understood that the size of the FOV determines a viewing range of the user, and meanwhile, according to the historical view angle information such as the FOV, the service end or the user end can at least determine which region in the panoramic picture the user mainly views when viewing the panoramic video frame. For example, when a user views a panoramic video through an AR device or a VR device, the angle of view of the user at each moment in time may be determined using an Inertial Measurement Unit (IMU) in the device, where the IMU is a sensor for detecting and measuring acceleration and rotational motion. Furthermore, when the model for predicting the user view angle is deployed at the server side, the user side can feed back the FOV information of the user at each moment to the server side after acquiring the FOV information based on the IMU, so that the server side can execute the subsequent processing process according to the FOV information.
Further, aiming at each historical video frame, according to the historical visual angle information of the current historical video frame, a salient region corresponding to the historical visual angle information is determined, and a visual angle salient map to be fused is determined based on the salient region; determining to perform segmentation processing on the current historical video frame based on the salient region segmentation model, and determining a salient map of the content to be fused of the current historical video frame; determining a historical view saliency map based on the to-be-fused view saliency map of each historical video frame; and determining the historical video content saliency map based on the to-be-fused content saliency map of each historical video frame.
The salient region is a region in a panoramic video frame watched by a user under a specific field angle, and the time slice can correspond to a plurality of panoramic video frames, so that after the salient region in any panoramic video frame is determined, the salient region can be displayed in red or other highlight colors, and the non-salient region is displayed in blue or other dim colors. The salient region segmentation model can be a pre-trained neural network model, a plurality of panoramic video frames are input into the neural network model, meanwhile, the salient region segmentation model can output a salient image of the content to be fused by identifying and dividing video frames, and similarly, after the model outputs the salient image of the content to be fused corresponding to each panoramic video frame, the plurality of salient images of the content to be fused can be fused, so that a historical video content salient image corresponding to the time slice is obtained.
Illustratively, after each panoramic video frame contained in a certain historical time slice is played, on one hand, the server may obtain 30 field angles of 30 frames of pictures corresponding to the time slice during the viewing process of the user, and determine which block of area in each panoramic video frame the user mainly views according to the field angles, that is, determine the salient area in each panoramic video frame, thereby generating 30 corresponding salient maps of the view angle to be fused, and further, after each salient map of the view angle to be fused is fused, obtain one salient map of the historical view angle corresponding to the historical time slice. On the other hand, 30 panoramic video frames corresponding to the historical time slice may be input into a pre-trained salient region segmentation model, so as to obtain 30 corresponding salient images of the content to be fused, and further, after the salient images of the content to be fused are fused, a historical video content salient image corresponding to the historical time slice is obtained.
And S120, inputting the historical view saliency map, the historical video content saliency map and the current view saliency map into a target view prediction model obtained through pre-training to obtain a prediction video content saliency map of a prediction time slice.
In this embodiment, after obtaining the historical view saliency map and the historical video content saliency map corresponding to the historical time slice, it is further required to determine a current view saliency map, where the current view saliency map is the view saliency map corresponding to the panoramic video frame currently viewed by the user, which is used for being input into the prediction model, and it can be understood that the saliency map can at least reflect which part of the picture in the panoramic video frame the user is currently viewing.
Further, the three saliency maps are input into a target view prediction model obtained through pre-training, and a prediction video content saliency map of a prediction time slice (namely, a next time slice of a historical time slice) can be obtained. The target view prediction model may be a model composed of a Full Convolutional Network (FCN) and a Long Short Term Memory (LSTM), and those skilled in the art should understand that the FCN may perform pixel-level classification on an image, so as to solve the problem of semantic level, and meanwhile, the FCN may receive an input image of any size and generate a prediction for each pixel; the LSTM is a time-cycle neural network, which is specially designed to solve the long-standing problems of the general cyclic neural network, and the embodiments of the present disclosure are not described herein again. The process of determining a saliency map of predicted video content based on this model is described below.
In this embodiment, for the LSTM that processes the three images, the LSTM may be divided into an encoder and a decoder according to the data processing sequence related to the neural network, and on this basis, optionally, the encoder in the target view prediction model processes the historical view saliency map and the historical video content saliency map to determine encoding parameters; and processing the coding parameters and the current view saliency map based on a decoder in the target view prediction model to obtain a predicted view saliency map corresponding to the current time slice.
The encoding parameters are determined by the LSTM encoder according to the historical view saliency map and the historical video content saliency map, and the parameters can be transmitted to the LSTM decoder and used as initialization parameters of the LSTM decoder. After the LSTM decoder acquires the parameter, the current temporal saliency map may be processed to obtain a predicted view saliency map, which may be understood as a panorama. This process is exemplified below with reference to fig. 2.
Referring to fig. 2, the time slice length is one second, each time slice corresponds to a plurality of panoramic video frames, meanwhile, the server or the client determines ten historical video content saliency maps corresponding to the time slice, and determines the corresponding ten historical view saliency maps according to the FOV track of the user, so that the historical video content saliency maps and the historical view saliency maps corresponding to the plurality of historical time slices are sequentially input to the FCN in the target view prediction model for processing, and then the processing result is input to the LSTM encoder in the target view prediction model, thereby obtaining corresponding encoding parameters. Furthermore, the encoding parameters are transmitted to an LSTM decoder in the target view prediction model, and the encoding parameters are used as initialization parameters of the LSTM decoder, at this time, the current view saliency map may be input to the LSTM decoder for processing, and then the processing result and the video content saliency map corresponding to the first predicted time slice are input to the FCN model together, that is, the predicted view saliency map corresponding to the predicted time slice is obtained.
With reference to fig. 2, the predicted view saliency map corresponding to the first predicted temporal slice is input to an LSTM decoder for processing, and the processing result and the video content saliency map corresponding to the second predicted temporal slice are input to an FCN model together, so as to obtain the predicted view saliency map corresponding to the second predicted temporal slice.
As can be understood from fig. 2, in the process of actually predicting the user view angle, when only one time slice before the current time is used as the historical time slice, only the user view angle in one predicted time slice after the current time can be predicted, and when a plurality of time slices are used as the historical time slices, the user view angles in the corresponding number of predicted time slices can also be predicted.
In the process of determining the predicted view saliency map, the area watched by the user in the panoramic video and the area possibly causing the user interest in the panoramic video are considered, prediction is performed based on the multiple factors, the situation that the difference between the subsequent watching view and the predicted view of the user is too large is further prevented, and the prediction accuracy is enhanced.
It should be noted that, in the process of training the target view prediction model, a training sample set may be determined first. The training sample set comprises a plurality of training samples, and the training samples comprise a to-be-trained view angle saliency map in a to-be-trained time slice, a to-be-trained video content saliency map and a to-be-trained predicted video content saliency map of the to-be-trained predicted time slice.
Further, aiming at each training sample, inputting a to-be-trained view angle saliency map and a to-be-trained video content saliency map in the current training sample into a to-be-trained view angle prediction model to obtain an output prediction saliency map; determining a loss value based on the output prediction adjustment graph and a prediction video content saliency graph to be trained in a current training sample, and correcting model parameters in a prediction model of a view angle to be trained based on the loss value; and (4) converging a loss function of the visual angle prediction model to be trained as a training target to obtain a target visual angle prediction model.
In this embodiment, the model parameters in the view prediction model to be trained may be modified based on the loss values. Specifically, after the view angle saliency map to be trained and the video content saliency map to be trained in the current training sample are input into the view angle prediction model to be trained, a corresponding prediction adjustment map can be obtained, and at this time, a plurality of loss values can be determined based on the prediction adjustment map and the corresponding video content saliency map to be trained; further, when the model parameters in the view prediction model to be trained are corrected by using the loss values, the convergence of the loss function can be used as a training target, for example, whether the training error is smaller than a preset error, or whether the error change tends to be stable, or whether the current iteration number is equal to the preset number. If the detection reaches the convergence condition, for example, the training error of the loss function is smaller than the preset error, or the error variation trend tends to be stable, indicating that the training of the to-be-trained visual angle prediction model is finished, at this moment, the iterative training may be stopped. If the current condition is not met, other training samples can be further obtained to train the visual angle prediction model to be trained continuously until the training error of the loss function is within the preset range. When the training error of the loss function reaches convergence, the trained to-be-trained view angle prediction model can be used as a target view angle prediction model, namely, the historical view angle saliency map, the historical video content saliency map and the current view angle saliency map are input into the target view angle prediction model, and then the prediction video content saliency map of the prediction time slice can be obtained.
Optionally, in the process of determining each training sample in the training sample set, at least one historical video to be used may be obtained first; aiming at each historical video to be used, determining a sub video to be trained of at least one time slice according to a preset time slice division rule for the current historical video to be used; determining a video content saliency map to be trained corresponding to the sub-video to be trained, and a user visual angle to be used of each sub-video to be trained, and determining a visual angle saliency map to be trained corresponding to the sub-video to be trained; according to the time slice identification of each sub video to be trained, taking the video content saliency map to be trained of the next time slice of the current time slice as the predicted video content saliency map to be trained of the current time slice; and determining at least one training sample according to the video content saliency map to be trained, the visual angle saliency map to be trained and the predicted video content saliency map to be trained of each sub-video to be trained.
Illustratively, after a user watches a panoramic video, the panoramic video can be used as a historical video to be used, and further, the video is divided according to a preset time slice rule, for example, when the predicted time length is one second, a video of ten seconds is divided into 10 sub-videos, the sub-videos are sub-videos to be trained, and each sub-video corresponds to one time slice. Furthermore, according to the mode of the embodiment of the present disclosure, a corresponding video content saliency map may be determined for each sub-video to be trained as a video content saliency map to be trained, and meanwhile, a corresponding view saliency map may be determined for each sub-video to be trained as a view saliency map to be trained according to a view of a user to be used.
When the input of the prediction model of the view angle to be trained is determined, the output used for training the model needs to be determined, and the above example is described further, the next time slice adjacent to the time slice can be determined according to the time slice identifier of the sub-video to be trained, and further, according to the mode of the embodiment of the present disclosure, the video content saliency map corresponding to the time slice is determined as the saliency map of the video content to be trained, and the data is associated with the data input as the prediction model of the view angle to be trained, so as to obtain a group of training samples in the training sample set.
It can be understood that, when a to-be-trained view saliency map corresponding to a time slice before a specific time, a to-be-trained video content saliency map, and a to-be-trained predicted video content saliency map corresponding to a time slice after the specific time are selected to train a model, in a subsequent application process, the model can predict a user view in a time slice after the current time for an actually played panoramic video; when the perspective saliency map to be trained, the perspective saliency map to be trained and the perspective saliency map of the video content to be trained which correspond to the plurality of time slices before the specific time and the perspective saliency map of the predicted video content to be trained which corresponds to the corresponding number of time slices after the specific time are selected to train the model, the model can predict the user perspectives in the plurality of time slices after the current time aiming at the panoramic video which is actually played in the subsequent application process.
Illustratively, N time slices can be divided from the historical video to be used according to the preset prediction time length, and the time slices are marked as T1, T2, … and TN in sequence, so that when the prediction model of the perspective to be trained is predicted, the perspective of the user in the T2 time period can be predicted according to T1, and the perspective of the user in the T3 and T4 time periods can be predicted according to T1 and T2.
Specifically, if the model is trained in a manner of predicting the view angle of the user in the T2 time period according to T1, and the model is deployed to a server or a user side for application, the saliency map corresponding to the slice with the duration of T1 may be input to the model for processing, so as to predict the view angle of the user in the subsequent slice with the duration of T2; correspondingly, if the model is trained in a manner of predicting the user visual angles in the time periods of T3 and T4 according to T1 and T2, and the model is deployed to a server or a user for application, the saliency map corresponding to the slice with the total duration of T1 and T2 can be input into the model for processing, so as to predict the user visual angles in the subsequent slices with the total duration of T3 and T4. It can be understood that, in the training process, the number of the time slices used for training may be selected according to the actual situation, and correspondingly, the time length of the slice predicted after the model training is completed should be consistent with the total time length of the time slices used in the training, which is not specifically limited in the embodiment of the present disclosure.
And S130, determining the target video in the prediction time slice based on the prediction video content saliency map.
In this embodiment, after the target view prediction model outputs the predicted video content saliency map, the server or the client may determine the corresponding target video. The target video may be a video that transmits pictures in different areas in the panoramic video to the user terminal at differentiated code rates and is displayed to the user through the AR device or the VR device.
Optionally, based on the predicted view saliency map, determining a predicted view corresponding to each predicted video frame in the predicted time slice, and adjusting code rate information of each region in the predicted view saliency map based on the predicted view, so as to issue the target video in the predicted time slice based on the code rate information.
The predicted viewing angle is the angle most likely to be selected by the user when the user subsequently views the panoramic video frame included in the predicted time slice. The bitrate is a Bit rate (Bit rate), which is the number of bits transmitted in a unit of time by the server, and the unit is bps.
Illustratively, after a prediction view saliency map corresponding to a prediction time slice is obtained, a region which is most likely to be watched by a user in a subsequent process can be determined through a red region and a blue region in each image, further, a corresponding viewing view can be determined according to the red region in the prediction view saliency map, and pictures under the view can be transmitted in a high-definition or super-definition coding manner, correspondingly, a corresponding non-viewing view can be determined according to the blue region in the prediction view saliency map, and the pictures under the views can be transmitted in a low-definition coding manner, so that the user end receives a panoramic video transmitted under a differentiated code rate, and it can be understood that the received panoramic video corresponding to the prediction time slice is a target video.
It should be noted that, in the embodiment of the present disclosure, the code rate information of each region in the panoramic video to be transmitted may be determined by the server, or the model may be integrated into the client, the client determines the code rate information of each region in the panoramic video to be transmitted, and feeds back the determined code rate information to the server, and finally, the server transmits the panoramic video corresponding to the predicted time slice to the client according to the received code rate information. Those skilled in the art should understand that the model deployed to the server or the client may be selected according to actual needs, and embodiments of the present disclosure are not specifically limited herein.
It should be further noted that, in the actual application process, a situation that the prediction result is inconsistent with the actual viewing angle of the user may also occur, and therefore, a thread may also be added in advance to monitor the subsequent viewing process of the user, and on this basis, when the prediction result is monitored to be inconsistent with the viewing angle of the user, the server may add a fast push-pull stream to deliver the picture of the actual viewing area of the user to the user terminal in a high definition or super definition coding manner, thereby ensuring the viewing experience of the user.
In practical application, the solution of the embodiment of the present disclosure can also be implemented based on the flowchart shown in fig. 3, which is described below with reference to fig. 2.
Referring to fig. 2, FOV view information in a process of watching a historical panoramic video by a user is determined by an AR device or a VR device used by the user, and is recorded, and further, a historical view saliency map is determined according to the FOV view information and is input to a pre-trained target view prediction model together with the historical video content saliency map, wherein the target view prediction model is composed of 2D LSTM and FCN. After the model receives the two saliency maps, the saliency maps can be processed by an LSTM encoder to determine encoding parameters, the encoding parameters are further transmitted to an LSTM decoder in the model to serve as initialization parameters, the view saliency map at the current moment is processed based on the LSTM decoder, a predicted view saliency map of the next time slice adjacent to the historical time slice can be obtained, according to the predicted view saliency map, the service end can determine code rate information of each region in the panoramic video, and corresponding data are encoded and transmitted to the user end in a differentiated mode according to the code rate information.
According to the technical scheme, the historical view saliency map of the historical time slice and the historical video content saliency map corresponding to the historical view saliency map are determined, further, the historical view saliency map, the historical video content saliency map and the current view saliency map are input into a target view prediction model obtained through pre-training to obtain a prediction video content saliency map of the prediction time slice, finally, the target video in the prediction time slice is determined based on the prediction video content saliency map, prediction of the subsequent viewing view of the user is achieved through a machine learning model, the corresponding target video is provided for the user according to the prediction result, the viewing experience of the user is improved, meanwhile, the influence of the video content on the viewing view of the user is considered in the prediction process, and the prediction accuracy is improved.
Example two
Fig. 4 is a schematic structural diagram of a video processing apparatus according to a second embodiment of the disclosure, as shown in fig. 4, the apparatus includes: a historical perspective saliency map determination module 210, a predicted video content saliency map determination module 220, and a target video determination module 230.
A historical view saliency map determination module 210, configured to determine a historical view saliency map of a historical time slice and a historical video content saliency map corresponding to the historical view saliency map; wherein the historical time slice is a last time slice of the predicted time slices.
The predicted video content saliency map determining module 220 is configured to input the historical view saliency map, the historical video content saliency map, and the current view saliency map into a target view prediction model obtained through pre-training, so as to obtain a predicted video content saliency map of the predicted time slice.
A target video determination module 230, configured to determine a target video within the predicted temporal slice based on the predicted video content saliency map.
On the basis of the above technical solutions, the historical perspective saliency map determination module 210 includes a historical perspective saliency map determination unit and a historical perspective determination unit.
The historical view saliency map determining unit is used for determining a historical view saliency map of the historical time slice and a historical video content saliency map when the last video frame in the historical time slice is acquired; or when the historical video frames in the historical time slices are acquired, updating the historical view saliency map corresponding to the historical time slices and the historical video content saliency map based on the historical video frames.
And the historical view angle determining unit is used for determining the historical view angle information of each historical video frame in the historical time slice.
Optionally, the historical view saliency map determining unit is further configured to determine a historical view saliency map corresponding to each piece of historical view information in the corresponding historical video frame, and a historical video content saliency map of each historical video frame.
Optionally, the historical view saliency map determining unit is further configured to determine, for each historical video frame, a saliency region corresponding to the historical view information according to the historical view information of the current historical video frame, and determine a to-be-fused view saliency map based on the saliency region; determining to perform segmentation processing on the current historical video frame based on a significant region segmentation model, and determining a content significant map to be fused of the current historical video frame; determining a historical view saliency map based on a to-be-fused view saliency map of each historical video frame; and determining the historical video content saliency map based on the content saliency map to be fused of each historical video frame.
On the basis of the above technical solutions, the predicted video content saliency map determination module 220 includes a coding parameter determination unit and a predicted view saliency map determination unit.
And the encoding parameter determining unit is used for processing the historical view saliency map and the historical video content saliency map based on an encoder in the target view prediction model to determine an encoding parameter.
And the predicted view saliency map determining unit is used for processing the coding parameters and the current view saliency map based on a decoder in the target view prediction model to obtain a predicted view saliency map corresponding to the current time slice.
Optionally, the target video determining module 230 is further configured to determine, based on the predicted view saliency map, a predicted viewing view corresponding to each predicted video frame in the predicted time slice, and adjust, based on the predicted viewing view, code rate information of each region in the predicted view saliency map, so as to issue the target video in the predicted time slice based on the code rate information.
On the basis of the technical schemes, the video processing device also comprises a model training module.
The model training module is used for determining a training sample set, wherein the training sample set comprises a plurality of training samples, and the training samples comprise a to-be-trained view saliency map in a to-be-trained time slice, a to-be-trained video content saliency map and a to-be-trained predicted video content saliency map of a to-be-trained predicted time slice; inputting the visual angle saliency map to be trained and the video content saliency map to be trained in the current training sample into a visual angle prediction model to be trained to obtain an output prediction saliency map for each training sample; determining a loss value based on an output prediction adjustment graph and a predicted video content saliency map to be trained in the current training sample, and correcting model parameters in the visual angle prediction model to be trained based on the loss value; and converging a loss function of the to-be-trained visual angle prediction model as a training target to obtain the target visual angle prediction model.
Optionally, the model training module is further configured to obtain at least one historical video to be used; aiming at each historical video to be used, determining a sub video to be trained of at least one time slice according to a preset time slice division rule for the current historical video to be used; determining a video content saliency map to be trained corresponding to the sub-video to be trained, and a user visual angle to be used of each sub-video to be trained, and determining a visual angle saliency map to be trained corresponding to the sub-video to be trained; according to the time slice identification of each sub video to be trained, taking the video content saliency map to be trained of the next time slice of the current time slice as the predicted video content saliency map to be trained of the current time slice; and determining at least one training sample according to the video content saliency map to be trained, the visual angle saliency map to be trained and the predicted video content saliency map to be trained of each sub-video to be trained.
On the basis of the above technical solutions, the historical video content saliency map, the historical view saliency map, and the predicted video content saliency map are panoramas.
According to the technical scheme provided by the embodiment, a historical view saliency map of a historical time slice and a historical video content saliency map corresponding to the historical view saliency map are determined, further, the historical view saliency map, the historical video content saliency map and a current view saliency map are input into a target view prediction model obtained through pre-training to obtain a predicted video content saliency map of the predicted time slice, finally, a target video in the predicted time slice is determined based on the predicted video content saliency map, prediction of a subsequent viewing view of a user is achieved through a machine learning model, a corresponding target video is provided for the user according to a prediction result, viewing experience of the user is improved, meanwhile, the influence of the video content on the viewing view of the user is considered in the prediction process, and prediction accuracy is improved.
The video processing device provided by the embodiment of the disclosure can execute the video processing method provided by any embodiment of the disclosure, and has corresponding functional modules and beneficial effects of the execution method.
It should be noted that, the units and modules included in the apparatus are merely divided according to functional logic, but are not limited to the above division as long as the corresponding functions can be implemented; in addition, specific names of the functional units are only used for distinguishing one functional unit from another, and are not used for limiting the protection scope of the embodiments of the present disclosure.
EXAMPLE III
Fig. 5 is a schematic structural diagram of an electronic device according to a third embodiment of the disclosure. Referring now to fig. 5, a schematic diagram of an electronic device (e.g., the terminal device or the server in fig. 5) 300 suitable for implementing embodiments of the present disclosure is shown. The terminal device in the embodiments of the present disclosure may include, but is not limited to, a mobile terminal such as a mobile phone, a notebook computer, a digital broadcast receiver, a PDA (personal digital assistant), a PAD (tablet computer), a PMP (portable multimedia player), a vehicle terminal (e.g., a car navigation terminal), and the like, and a stationary terminal such as a digital TV, a desktop computer, and the like. The electronic device shown in fig. 5 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present disclosure.
As shown in fig. 5, the electronic device 300 may include a processing means (e.g., a central processing unit, a pattern processor, etc.) 301 that may perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM)302 or a program loaded from a storage means 306 into a Random Access Memory (RAM) 303. In the RAM 303, various programs and data necessary for the operation of the electronic apparatus 300 are also stored. The processing device 301, the ROM 302, and the RAM 303 are connected to each other via a bus 304. An edit/output (I/O) interface 305 is also connected to bus 304.
Generally, the following devices may be connected to the I/O interface 305: editing devices 306 including, for example, a touch screen, a touch pad, a keyboard, a mouse, a camera, a microphone, an accelerometer, a gyroscope, etc.; an output device 307 including, for example, a Liquid Crystal Display (LCD), a speaker, a vibrator, and the like; storage devices 308 including, for example, magnetic tape, hard disk, etc.; and a communication device 309. The communication means 309 may allow the electronic device 300 to communicate wirelessly or by wire with other devices to exchange data. While fig. 5 illustrates an electronic device 300 having various means, it is to be understood that not all illustrated means are required to be implemented or provided. More or fewer devices may alternatively be implemented or provided.
In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program carried on a non-transitory computer readable medium, the computer program containing program code for performing the method illustrated by the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network via the communication means 309, or installed from the storage means 306, or installed from the ROM 302. The computer program, when executed by the processing device 301, performs the above-described functions defined in the methods of embodiments of the present disclosure.
The names of messages or information exchanged between devices in the embodiments of the present disclosure are for illustrative purposes only, and are not intended to limit the scope of the messages or information.
The electronic device provided by the embodiment of the present disclosure and the video processing method provided by the above embodiment belong to the same inventive concept, and technical details that are not described in detail in the embodiment can be referred to the above embodiment, and the embodiment has the same beneficial effects as the above embodiment.
Example four
The disclosed embodiments provide a computer storage medium having stored thereon a computer program that, when executed by a processor, implements the video processing method provided by the above-described embodiments.
It should be noted that the computer readable medium in the present disclosure can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In contrast, in the present disclosure, a computer readable signal medium may comprise a propagated data signal with computer readable program code embodied therein, either in baseband or as part of a carrier wave. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, optical cables, RF (radio frequency), etc., or any suitable combination of the foregoing.
In some embodiments, the clients, servers may communicate using any currently known or future developed network Protocol, such as HTTP (HyperText Transfer Protocol), and may be interconnected with any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include a local area network ("LAN"), a wide area network ("WAN"), the Internet (e.g., the Internet), and peer-to-peer networks (e.g., ad hoc peer-to-peer networks), as well as any currently known or future developed network.
The computer readable medium may be embodied in the electronic device; or may be separate and not incorporated into the electronic device.
The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to:
determining a historical view saliency map of a historical time slice and a historical video content saliency map corresponding to the historical view saliency image; wherein the historical time slice is a last time slice of the predicted time slices;
inputting the historical view saliency map, the historical video content saliency map and the current view saliency map into a target view prediction model obtained by pre-training to obtain a prediction video content saliency map of the prediction time slice;
determining a target video within the predicted temporal slice based on the predicted video content saliency map.
Computer program code for carrying out operations for the present disclosure may be written in any combination of one or more programming languages, including but not limited to an object oriented programming language such as Java, Smalltalk, C + +, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The units described in the embodiments of the present disclosure may be implemented by software or hardware. Where the name of a unit does not in some cases constitute a limitation of the unit itself, for example, the first retrieving unit may also be described as a "unit for retrieving at least two internet protocol addresses".
The functions described herein above may be performed, at least in part, by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic components that may be used include: field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), systems on a chip (SOCs), Complex Programmable Logic Devices (CPLDs), and the like.
In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
According to one or more embodiments of the present disclosure, [ example one ] there is provided a video processing method, the method comprising:
determining a historical view saliency map of a historical time slice and a historical video content saliency map corresponding to the historical view saliency image; wherein the historical time slice is a last time slice of the predicted time slices;
inputting the historical view saliency map, the historical video content saliency map and the current view saliency map into a target view prediction model obtained by pre-training to obtain a prediction video content saliency map of the prediction time slice;
determining a target video within the predicted temporal slice based on the predicted video content saliency map.
According to one or more embodiments of the present disclosure, [ example two ] there is provided a video processing method, further comprising:
optionally, when the last video frame in the historical time slice is acquired, determining a historical view saliency map of the historical time slice and a historical video content saliency map; or the like, or a combination thereof,
and when the historical video frames in the historical time slices are acquired, updating the historical view saliency map corresponding to the historical time slices and the historical video content saliency map based on the historical video frames.
According to one or more embodiments of the present disclosure, [ example three ] there is provided a video processing method, further comprising:
optionally, determining historical view information of each historical video frame in the historical time slice;
and determining a historical visual angle saliency map corresponding to each historical visual angle information in the corresponding historical video frame and a historical video content saliency map of each historical video frame.
According to one or more embodiments of the present disclosure, [ example four ] there is provided a video processing method, further comprising:
optionally, for each historical video frame, according to historical view information of a current historical video frame, a salient region corresponding to the historical view information is determined, and a view saliency map to be fused is determined based on the salient region; determining to perform segmentation processing on the current historical video frame based on a significant region segmentation model, and determining a content significant map to be fused of the current historical video frame;
determining a historical view saliency map based on a to-be-fused view saliency map of each historical video frame; and determining the historical video content saliency map based on the content saliency map to be fused of each historical video frame.
According to one or more embodiments of the present disclosure, [ example five ] there is provided a video processing method, further comprising:
optionally, processing the historical view saliency map and the historical video content saliency map based on an encoder in the target view prediction model, and determining an encoding parameter;
and processing the coding parameters and the current view saliency map based on a decoder in the target view prediction model to obtain a predicted view saliency map corresponding to the current time slice.
According to one or more embodiments of the present disclosure, [ example six ] there is provided a video processing method, further comprising:
optionally, based on the predicted view saliency map, determining a predicted view corresponding to each predicted video frame in the predicted time slice, and adjusting code rate information of each region in the predicted view saliency map based on the predicted view, so as to issue the target video in the predicted time slice based on the code rate information.
According to one or more embodiments of the present disclosure, [ example seven ] there is provided a video processing method, further comprising:
optionally, a training sample set is determined, where the training sample set includes a plurality of training samples, and the training samples include a to-be-trained view saliency map in a to-be-trained time slice, a to-be-trained video content saliency map, and a to-be-trained predicted video content saliency map of a to-be-trained predicted time slice;
inputting the visual angle saliency map to be trained and the video content saliency map to be trained in the current training sample into a visual angle prediction model to be trained to obtain an output prediction saliency map for each training sample;
determining a loss value based on an output prediction adjustment graph and a prediction video content saliency graph to be trained in the current training sample, and correcting a model parameter in a visual angle prediction model to be trained based on the loss value;
and converging a loss function of the visual angle prediction model to be trained as a training target to obtain the target visual angle prediction model.
According to one or more embodiments of the present disclosure, [ example eight ] there is provided a video processing method, further comprising:
optionally, at least one historical video to be used is obtained;
aiming at each historical video to be used, determining a sub video to be trained of at least one time slice according to a preset time slice division rule for the current historical video to be used;
determining a video content saliency map to be trained corresponding to the sub-video to be trained, and a user visual angle to be used of each sub-video to be trained, and determining a visual angle saliency map to be trained corresponding to the sub-video to be trained;
according to the time slice identification of each sub video to be trained, taking the video content saliency map to be trained of the next time slice of the current time slice as the predicted video content saliency map to be trained of the current time slice;
and determining at least one training sample according to the video content saliency map to be trained, the visual angle saliency map to be trained and the predicted video content saliency map to be trained of each sub-video to be trained.
According to one or more embodiments of the present disclosure, [ example nine ] there is provided a video processing method, further comprising:
optionally, the historical video content saliency map, the historical perspective saliency map, and the predicted video content saliency map are panoramas.
According to one or more embodiments of the present disclosure, [ example ten ] there is provided a video processing apparatus comprising:
the historical view saliency map determining module is used for determining a historical view saliency map of a historical time slice and a historical video content saliency map corresponding to the historical view saliency map; wherein the historical time slice is a last time slice of the predicted time slices;
the prediction video content saliency map determining module is used for inputting the historical view saliency map, the historical video content saliency map and the current view saliency map into a target view prediction model obtained by pre-training to obtain a prediction video content saliency map of the prediction time slice;
a target video determination module for determining a target video within the predicted temporal slice based on the predicted video content saliency map.
The foregoing description is only exemplary of the preferred embodiments of the disclosure and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the disclosure herein is not limited to the particular combination of features described above, but also encompasses other embodiments in which any combination of the features described above or their equivalents does not depart from the spirit of the disclosure. For example, the above features and (but not limited to) the features disclosed in this disclosure having similar functions are replaced with each other to form the technical solution.
Further, while operations are depicted in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order. Under certain circumstances, multitasking and parallel processing may be advantageous. Likewise, while several specific implementation details are included in the above discussion, these should not be construed as limitations on the scope of the disclosure. Certain features that are described in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination.
Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are disclosed as example forms of implementing the claims.

Claims (12)

1. A video processing method, comprising:
determining a historical view saliency map of a historical time slice and a historical video content saliency map corresponding to the historical view saliency image; wherein the historical time slice is a last time slice of the predicted time slices;
inputting the historical view saliency map, the historical video content saliency map and the current view saliency map into a target view prediction model obtained through pre-training to obtain a prediction video content saliency map of the prediction time slice;
determining a target video within the predicted temporal slice based on the predicted video content saliency map.
2. The method of claim 1, wherein determining the historical perspective saliency map for the historical time slice and the historical video content saliency map corresponding to the historical perspective saliency image comprises:
when the last video frame in the historical time slice is obtained, determining a historical view saliency map of the historical time slice and a historical video content saliency map; or the like, or, alternatively,
and when the historical video frames in the historical time slices are acquired, updating the historical view saliency map corresponding to the historical time slices and the historical video content saliency map based on the historical video frames.
3. The method of claim 1, wherein determining the historical perspective saliency map for the historical time slice and the historical video content saliency map corresponding to the historical perspective saliency image comprises:
determining historical view angle information of each historical video frame in the historical time slice;
and determining a historical visual angle saliency map corresponding to each historical visual angle information in the corresponding historical video frame and a historical video content saliency map of each historical video frame.
4. The method of claim 3, wherein the determining the corresponding historical view saliency map of each historical view information in the corresponding historical video frame and the historical video content saliency map of each historical video frame comprises:
for each historical video frame, determining a salient region corresponding to the historical visual angle information according to the historical visual angle information of the current historical video frame, and determining a visual angle salient map to be fused based on the salient region; determining to perform segmentation processing on the current historical video frame based on a significant region segmentation model, and determining a content significant map to be fused of the current historical video frame;
determining a historical view saliency map based on a to-be-fused view saliency map of each historical video frame; and determining the historical video content saliency map based on the content saliency map to be fused of each historical video frame.
5. The method according to claim 1, wherein the inputting the historical perspective saliency map, the historical video content saliency map, and the current perspective saliency map into a pre-trained target perspective prediction model to obtain the predicted video content saliency map of the predicted time slice comprises:
processing the historical view saliency map and the historical video content saliency map based on an encoder in the target view prediction model to determine encoding parameters;
and processing the coding parameters and the current view saliency map based on a decoder in the target view prediction model to obtain a predicted view saliency map corresponding to the current time slice.
6. The method of claim 1, wherein determining the target video within the predicted temporal slice based on the predicted video content saliency map comprises:
and determining a predicted viewing angle corresponding to each predicted video frame in the predicted time slice based on the predicted viewing angle saliency map, adjusting code rate information of each region in the predicted viewing angle saliency map based on the predicted viewing angle, and issuing a target video in the predicted time slice based on the code rate information.
7. The method of claim 1, further comprising:
determining a training sample set, wherein the training sample set comprises a plurality of training samples, and the training samples comprise a to-be-trained view saliency map in a to-be-trained time slice, a to-be-trained video content saliency map and a to-be-trained predicted video content saliency map of the to-be-trained predicted time slice;
aiming at each training sample, inputting a to-be-trained view angle saliency map and a to-be-trained video content saliency map in the current training sample into a to-be-trained view angle prediction model to obtain an output prediction saliency map;
determining a loss value based on an output prediction adjustment graph and a predicted video content saliency map to be trained in the current training sample, and correcting model parameters in the visual angle prediction model to be trained based on the loss value;
and converging a loss function of the to-be-trained visual angle prediction model as a training target to obtain the target visual angle prediction model.
8. The method of claim 7, wherein the determining a training sample set comprises:
acquiring at least one historical video to be used;
aiming at each historical video to be used, determining a sub video to be trained of at least one time slice according to a preset time slice division rule for the current historical video to be used;
determining a video content saliency map to be trained corresponding to the sub-video to be trained, and a user visual angle to be used of each sub-video to be trained, and determining a visual angle saliency map to be trained corresponding to the sub-video to be trained;
according to the time slice identification of each sub video to be trained, taking the video content saliency map to be trained of the next time slice of the current time slice as the predicted video content saliency map to be trained of the current time slice;
and determining at least one training sample according to the video content saliency map to be trained, the visual angle saliency map to be trained and the predicted video content saliency map to be trained of each sub-video to be trained.
9. The method of claim 1, wherein the historical video content saliency map, the historical perspective saliency map, and the predicted video content saliency map are panoramas.
10. A video processing apparatus, comprising:
the historical view saliency map determining module is used for determining a historical view saliency map of a historical time slice and a historical video content saliency map corresponding to the historical view saliency map; wherein the historical time slice is a last time slice of the predicted time slices;
the prediction video content saliency map determining module is used for inputting the historical view saliency map, the historical video content saliency map and the current view saliency map into a target view prediction model obtained by pre-training to obtain a prediction video content saliency map of the prediction time slice;
a target video determination module for determining a target video within the predicted temporal slice based on the predicted video content saliency map.
11. An electronic device, characterized in that the electronic device comprises:
one or more processors;
a storage device for storing one or more programs,
when executed by the one or more processors, cause the one or more processors to implement the video processing method of any of claims 1-9.
12. A storage medium containing computer-executable instructions for performing the video processing method of any of claims 1-9 when executed by a computer processor.
CN202210553339.9A 2022-05-20 2022-05-20 Video processing method and device, electronic equipment and storage medium Pending CN114979652A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210553339.9A CN114979652A (en) 2022-05-20 2022-05-20 Video processing method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210553339.9A CN114979652A (en) 2022-05-20 2022-05-20 Video processing method and device, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN114979652A true CN114979652A (en) 2022-08-30

Family

ID=82984521

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210553339.9A Pending CN114979652A (en) 2022-05-20 2022-05-20 Video processing method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN114979652A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116828265A (en) * 2023-08-28 2023-09-29 湖南快乐阳光互动娱乐传媒有限公司 Video control method, system, electronic equipment and readable storage medium
WO2024099314A1 (en) * 2022-11-08 2024-05-16 抖音视界有限公司 Viewing angle prediction method and apparatus, device, and storage medium
CN118071969A (en) * 2024-04-25 2024-05-24 山东金东数字创意股份有限公司 Method, medium and system for generating XR environment background in real time based on AI

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106507116A (en) * 2016-10-12 2017-03-15 上海大学 A kind of 3D HEVC coding methods that is predicted based on 3D conspicuousnesses information and View Synthesis
CN110166850A (en) * 2019-05-30 2019-08-23 上海交通大学 The method and system of multiple CNN neural network forecast panoramic video viewing location
CN110248178A (en) * 2019-06-18 2019-09-17 深圳大学 Utilize the viewport prediction technique and system of object tracking and historical track panoramic video
US20190373298A1 (en) * 2018-06-01 2019-12-05 At&T Intellectual Property I, L.P. Field of view prediction in live panoramic video streaming
WO2020238560A1 (en) * 2019-05-27 2020-12-03 腾讯科技(深圳)有限公司 Video target tracking method and apparatus, computer device and storage medium
CN112468806A (en) * 2020-11-12 2021-03-09 中山大学 Panoramic video transmission optimization method for cloud VR platform
CN112468828A (en) * 2020-11-25 2021-03-09 深圳大学 Code rate allocation method and device for panoramic video, mobile terminal and storage medium
CN112800276A (en) * 2021-01-20 2021-05-14 北京有竹居网络技术有限公司 Video cover determination method, device, medium and equipment
CN113365156A (en) * 2021-06-17 2021-09-07 合肥工业大学 Panoramic video multicast stream view angle prediction method based on limited view field feedback
CN114449162A (en) * 2021-12-22 2022-05-06 天翼云科技有限公司 Method and device for playing panoramic video, computer equipment and storage medium

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106507116A (en) * 2016-10-12 2017-03-15 上海大学 A kind of 3D HEVC coding methods that is predicted based on 3D conspicuousnesses information and View Synthesis
US20190373298A1 (en) * 2018-06-01 2019-12-05 At&T Intellectual Property I, L.P. Field of view prediction in live panoramic video streaming
WO2020238560A1 (en) * 2019-05-27 2020-12-03 腾讯科技(深圳)有限公司 Video target tracking method and apparatus, computer device and storage medium
CN110166850A (en) * 2019-05-30 2019-08-23 上海交通大学 The method and system of multiple CNN neural network forecast panoramic video viewing location
CN110248178A (en) * 2019-06-18 2019-09-17 深圳大学 Utilize the viewport prediction technique and system of object tracking and historical track panoramic video
CN112468806A (en) * 2020-11-12 2021-03-09 中山大学 Panoramic video transmission optimization method for cloud VR platform
CN112468828A (en) * 2020-11-25 2021-03-09 深圳大学 Code rate allocation method and device for panoramic video, mobile terminal and storage medium
CN112800276A (en) * 2021-01-20 2021-05-14 北京有竹居网络技术有限公司 Video cover determination method, device, medium and equipment
CN113365156A (en) * 2021-06-17 2021-09-07 合肥工业大学 Panoramic video multicast stream view angle prediction method based on limited view field feedback
CN114449162A (en) * 2021-12-22 2022-05-06 天翼云科技有限公司 Method and device for playing panoramic video, computer equipment and storage medium

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2024099314A1 (en) * 2022-11-08 2024-05-16 抖音视界有限公司 Viewing angle prediction method and apparatus, device, and storage medium
CN116828265A (en) * 2023-08-28 2023-09-29 湖南快乐阳光互动娱乐传媒有限公司 Video control method, system, electronic equipment and readable storage medium
CN116828265B (en) * 2023-08-28 2023-11-28 湖南快乐阳光互动娱乐传媒有限公司 Video control method, system, electronic equipment and readable storage medium
CN118071969A (en) * 2024-04-25 2024-05-24 山东金东数字创意股份有限公司 Method, medium and system for generating XR environment background in real time based on AI

Similar Documents

Publication Publication Date Title
CN114979652A (en) Video processing method and device, electronic equipment and storage medium
US11785195B2 (en) Method and apparatus for processing three-dimensional video, readable storage medium and electronic device
CN110290398B (en) Video issuing method and device, storage medium and electronic equipment
CN114581566A (en) Animation special effect generation method, device, equipment and medium
CN111726675A (en) Object information display method and device, electronic equipment and computer storage medium
CN115761090A (en) Special effect rendering method, device, equipment, computer readable storage medium and product
CN109862019B (en) Data processing method, device and system
CN111327762A (en) Operation track display method and device, electronic equipment and storage medium
CN114445600A (en) Method, device and equipment for displaying special effect prop and storage medium
CN111818265A (en) Interaction method and device based on augmented reality model, electronic equipment and medium
WO2023088104A1 (en) Video processing method and apparatus, and electronic device and storage medium
CN114979762B (en) Video downloading and transmitting method and device, terminal equipment, server and medium
US20230206575A1 (en) Rendering a virtual object in spatial alignment with a pose of an electronic device
CN113259601A (en) Video processing method and device, readable medium and electronic equipment
CN115937291B (en) Binocular image generation method and device, electronic equipment and storage medium
CN116708892A (en) Sound and picture synchronous detection method, device, equipment and storage medium
CN115756158A (en) Visual angle prediction method, device, equipment and storage medium
CN116248889A (en) Image encoding and decoding method and device and electronic equipment
CN114926326A (en) Image processing method, image processing device, electronic equipment and storage medium
CN116847147A (en) Special effect video determining method and device, electronic equipment and storage medium
CN110769129B (en) Image processing method, image processing device, electronic equipment and computer readable storage medium
CN114501041B (en) Special effect display method, device, equipment and storage medium
CN113794836B (en) Bullet time video generation method, device, system, equipment and medium
EP4202611A1 (en) Rendering a virtual object in spatial alignment with a pose of an electronic device
CN115760887A (en) Image processing method, image processing device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information

Country or region after: China

Address after: Room B-0035, 2nd Floor, Building 3, No. 30 Shixing Street, Shijingshan District, Beijing

Applicant after: Douyin Vision Co.,Ltd.

Country or region after: Britain

Applicant after: Face Meng Ltd.

Address before: 100041 B-0035, 2 floor, 3 building, 30 Shixing street, Shijingshan District, Beijing.

Applicant before: Tiktok vision (Beijing) Co.,Ltd.

Country or region before: China

Applicant before: Face Meng Ltd.

Country or region before: Britain

Country or region after: China

Address after: Room B-0035, 2nd Floor, Building 3, No. 30 Shixing Street, Shijingshan District, Beijing

Applicant after: Tiktok vision (Beijing) Co.,Ltd.

Country or region after: Britain

Applicant after: Face Meng Ltd.

Address before: 100041 B-0035, 2 floor, 3 building, 30 Shixing street, Shijingshan District, Beijing.

Applicant before: BEIJING BYTEDANCE NETWORK TECHNOLOGY Co.,Ltd.

Country or region before: China

Applicant before: Face Meng Ltd.

Country or region before: Britain

CB02 Change of applicant information