CN117880478A

CN117880478A - Virtual reality video format detection method, device, equipment and storage medium

Info

Publication number: CN117880478A
Application number: CN202410047365.3A
Authority: CN
Inventors: 李洋; 张亚彬; 廖懿婷
Original assignee: Beijing Zitiao Network Technology Co Ltd; Lemon Inc Cayman Island
Current assignee: Beijing Zitiao Network Technology Co Ltd; Lemon Inc Cayman Island
Priority date: 2024-01-11
Filing date: 2024-01-11
Publication date: 2024-04-12

Abstract

The embodiment of the disclosure provides a virtual reality video format detection method, device, equipment and storage medium. The method comprises the following steps: acquiring a plurality of first video frames in a target virtual reality video; determining a second video frame with complex texture based on the texture complexity corresponding to each first video frame; performing single-binocular detection on the second video frame based on the first classification model, and determining a single-binocular detection result; determining a monocular image corresponding to the second video frame based on the monocular detection result; detecting a projection format and a view angle range of the monocular image based on the second classification model, and determining a projection format detection result and a view angle range detection result; and determining a target format detection result corresponding to the target virtual reality video based on the single-binocular detection result, the projection format detection result and the view angle range detection result corresponding to the second video frame, thereby realizing automatic detection of the virtual reality video format, ensuring the format detection accuracy and improving the format detection efficiency.

Description

Virtual reality video format detection method, device, equipment and storage medium

Technical Field

Embodiments of the present disclosure relate to computer technology, and in particular, to a method, apparatus, device, and storage medium for detecting a virtual reality video format.

Background

With the rapid development of computer technology, a Virtual Reality (VR) video can be utilized to realize the effect of three-dimensional display. Virtual reality video may exist in a variety of video formats, so that format detection of the virtual reality video is required to perform operations such as video playing based on a specific format.

Currently, the specific format of a virtual reality video is usually determined by manually watching the virtual reality video. However, this manual detection method is time-consuming and laborious, and reduces the format detection efficiency of the virtual reality video.

Disclosure of Invention

The present disclosure provides a method, an apparatus, a device, and a storage medium for detecting a virtual reality video format, so as to implement automatic detection of the virtual reality video format, and improve format detection efficiency while ensuring format detection accuracy.

In a first aspect, an embodiment of the present disclosure provides a method for detecting a virtual reality video format, including:

acquiring a plurality of first video frames in a target virtual reality video to be detected;

Determining texture complexity corresponding to each first video frame, and determining a second video frame with complex texture from a plurality of first video frames based on the texture complexity;

performing single-binocular detection on the second video frame based on a first classification model, and determining a single-binocular detection result corresponding to the second video frame;

determining a monocular image corresponding to the second video frame based on a monocular detection result corresponding to the second video frame;

detecting a projection format and a view angle range of the monocular image based on a second classification model, and determining a projection format detection result and a view angle range detection result corresponding to the second video frame;

and determining a target format detection result corresponding to the target virtual reality video based on the single-binocular detection result, the projection format detection result and the view angle range detection result corresponding to the second video frame.

In a second aspect, an embodiment of the present disclosure further provides a virtual reality video format detection apparatus, including:

the first video frame acquisition module is used for acquiring a plurality of first video frames in the target virtual reality video to be detected;

a second video frame determining module, configured to determine a texture complexity corresponding to each first video frame, and determine a second video frame with a complex texture from a plurality of first video frames based on the texture complexity;

The single-binocular detection module is used for carrying out single-binocular detection on the second video frame based on the first classification model and determining a single-binocular detection result corresponding to the second video frame;

the monocular image determining module is used for determining a monocular image corresponding to the second video frame based on a monocular detection result corresponding to the second video frame;

the projection view angle detection module is used for detecting the projection format and the view angle range of the monocular image based on a second classification model and determining a projection format detection result and a view angle range detection result corresponding to the second video frame;

and the format detection result determining module is used for determining a target format detection result corresponding to the target virtual reality video based on the single-binocular detection result, the projection format detection result and the view angle range detection result corresponding to the second video frame.

In a third aspect, embodiments of the present disclosure further provide an electronic device, including:

one or more processors;

storage means for storing one or more programs,

the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the virtual reality video format detection method as described in any of the embodiments of the disclosure.

In a fourth aspect, the disclosed embodiments also provide a storage medium containing computer executable instructions, which when executed by a computer processor, are for performing the virtual reality video format detection method as described in any of the disclosed embodiments.

According to the embodiment of the disclosure, the plurality of first video frames in the target virtual reality video to be detected are obtained, and the second video frames with complex textures are determined based on the texture complexity corresponding to each first video frame, so that the video frames with simple textures are filtered, and the accuracy of format detection is further ensured. And carrying out single-binocular detection on the second video frame based on the first classification model, and determining a single-binocular detection result corresponding to the second video frame, thereby realizing the single-binocular automatic detection of the video by utilizing the first classification model. And determining a monocular image corresponding to the second video frame based on a monocular detection result corresponding to the second video frame, detecting a projection format and a view angle range of the monocular image based on a second classification model, and determining a projection format detection result and a view angle range detection result corresponding to the second video frame, thereby realizing automatic detection of the video projection format and the view angle range by using the second classification model. Based on the single-binocular detection result, the projection format detection result and the view angle range detection result corresponding to the second video frame, the target format detection result corresponding to the target virtual reality video can be accurately determined, so that automatic detection of the virtual reality video format is realized, manual participation is not needed, and format detection accuracy is ensured and format detection efficiency is improved.

Drawings

The above and other features, advantages, and aspects of embodiments of the present disclosure will become more apparent by reference to the following detailed description when taken in conjunction with the accompanying drawings. The same or similar reference numbers will be used throughout the drawings to refer to the same or like elements. It should be understood that the figures are schematic and that elements and components are not necessarily drawn to scale.

Fig. 1 is a flowchart of a virtual reality video format detection method according to an embodiment of the present disclosure;

FIG. 2 is an example diagram of a virtual reality video format detection process according to an embodiment of the present disclosure;

FIG. 3 is a flow chart of another method for detecting a virtual reality video format according to an embodiment of the present disclosure;

FIG. 4 is an exemplary diagram of a network architecture of a first classification model according to an embodiment of the disclosure;

fig. 5 is a schematic structural diagram of a virtual reality video format detecting apparatus according to an embodiment of the present disclosure;

fig. 6 is a schematic structural diagram of an electronic device according to an embodiment of the disclosure.

Detailed Description

Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While certain embodiments of the present disclosure have been shown in the accompanying drawings, it is to be understood that the present disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein, but are provided to provide a more thorough and complete understanding of the present disclosure. It should be understood that the drawings and embodiments of the present disclosure are for illustration purposes only and are not intended to limit the scope of the present disclosure.

It should be understood that the various steps recited in the method embodiments of the present disclosure may be performed in a different order and/or performed in parallel. Furthermore, method embodiments may include additional steps and/or omit performing the illustrated steps. The scope of the present disclosure is not limited in this respect.

The term "including" and variations thereof as used herein are intended to be open-ended, i.e., including, but not limited to. The term "based on" is based at least in part on. The term "one embodiment" means "at least one embodiment"; the term "another embodiment" means "at least one additional embodiment"; the term "some embodiments" means "at least some embodiments. Related definitions of other terms will be given in the description below.

It should be noted that the terms "first," "second," and the like in this disclosure are merely used to distinguish between different devices, modules, or units and are not used to define an order or interdependence of functions performed by the devices, modules, or units.

It should be noted that references to "one", "a plurality" and "a plurality" in this disclosure are intended to be illustrative rather than limiting, and those of ordinary skill in the art will appreciate that "one or more" is intended to be understood as "one or more" unless the context clearly indicates otherwise.

The names of messages or information interacted between the various devices in the embodiments of the present disclosure are for illustrative purposes only and are not intended to limit the scope of such messages or information.

It will be appreciated that the data (including but not limited to the data itself, the acquisition or use of the data) involved in the present technical solution should comply with the corresponding legal regulations and the requirements of the relevant regulations.

Fig. 1 is a schematic flow chart of a method for detecting a format of a virtual reality video provided by an embodiment of the present disclosure, where the embodiment of the present disclosure is suitable for a case of detecting a format of a virtual reality video uploaded by a user or an existing virtual reality video in a database, the method may be performed by a device for detecting a format of a virtual reality video, and the device may be implemented in a form of software and/or hardware, optionally, by an electronic device, where the electronic device may be a mobile terminal, a PC end, a server, or the like.

As shown in fig. 1, the method for detecting the virtual reality video format specifically includes the following steps:

s110, acquiring a plurality of first video frames in the target virtual reality video to be detected.

The target virtual reality video may refer to any one virtual reality video in a format that currently needs to be detected. The first video frame may refer to an original video frame in the target virtual reality video.

Specifically, decoding and frame extraction processing can be performed on the target virtual reality video to be detected, and a plurality of extracted first video frames are obtained. For example, one frame is extracted every N frames in the target virtual reality video, and each extracted video frame is used as a first video frame.

S120, determining texture complexity corresponding to each first video frame, and determining a second video frame with complex texture from the plurality of first video frames based on the texture complexity.

Wherein the texture complexity may be used to characterize the image texture complexity of the first video frame. The second video frame refers to the first video frame having a complex texture. The number of second video frames is a plurality. The first video frame with simple textures cannot represent the video format, so that subsequent format detection operation is not needed, and the accuracy of format detection is ensured. The first video frame with SimpleTexture may refer to a black screen frame without picture content, a video frame with a solid background, a video frame with little content, a head video frame, or a tail video frame, etc.

Specifically, the texture complexity corresponding to each first video frame may be determined by image edge extraction or the like. And comparing the texture complexity corresponding to each first video frame with a preset complexity threshold, determining the first video frames with the texture complexity smaller than the preset complexity threshold as first video frames with simple textures, and determining the first video frames with the texture complexity larger than or equal to the preset complexity threshold as second video frames with complex textures. Referring to fig. 2, the first video frame with simple texture is filtered, and only the second video frame with complex texture is reserved, so that subsequent format detection is performed based on the second video frame, and accuracy of format detection is guaranteed.

S130, performing single-binocular detection on the second video frame based on the first classification model, and determining a single-binocular detection result corresponding to the second video frame.

Wherein, the single-binocular detection result may include: a monocular type (i.e., a 2D type), a binocular left-right type (i.e., a 3D left-right type), or a binocular up-down type (i.e., a 3D up-down type). The first classification model may be a neural network model for detecting any architecture of a single binocular type. For example, the first classification model may be a three-classification network to determine whether the second video frame is of a monocular type, a binocular side-to-side type, or a binocular top-to-bottom type.

Specifically, for each second video frame, if the image size of the second video frame is not the preset image size, downsampling the second video frame to obtain a second video frame having the preset image size. The preset image size refers to an input image size of the first classification model. And inputting the second video frame with the preset image size into a first classification model for monocular detection, wherein the first classification model can output probability values of the second video frame respectively in a monocular type, a binocular left-right type and a binocular up-down type, and determining the type corresponding to the maximum probability value as a monocular detection result corresponding to the second video frame. For example, gray processing may be performed on the second video frame having the preset image size, and the gray image corresponding to the second video frame may be input into the first classification model to perform monocular and binocular detection, so that the monocular and binocular detection efficiency of the second video frame may be further improved.

And S140, determining a monocular image corresponding to the second video frame based on the monocular detection result corresponding to the second video frame.

Specifically, referring to fig. 2, based on a single-binocular detection result corresponding to the second video frame, whether the second video frame is a monocular image or a binocular image may be determined, and if the second video frame is a binocular image, clipping and scaling processing are required to be performed on any monocular view content in the second video frame, so as to obtain a monocular image with a preset image size corresponding to the second video frame.

Illustratively, S140 may include: if the single-binocular detection result corresponding to the second video frame is of a binocular left-right type or a binocular up-down type, cutting and scaling the second video frame to determine a monocular image with a preset image size; and if the single-binocular detection result corresponding to the second video frame is of a monocular type, determining the second video frame as a monocular image.

Specifically, when the single-binocular detection result corresponding to the second video frame is of a binocular left-right type, a left half part or a right half part in the second video frame may be cut out, and scaling processing is performed on the left half part or the right half part to obtain a monocular image with a preset image size. When the single-binocular detection result corresponding to the second video frame is of a binocular top-bottom type, the upper half part or the lower half part in the second video frame can be cut out, and scaling processing is carried out on the upper half part or the lower half part to obtain a monocular image with a preset image size. When the monocular detection result corresponding to the second video frame is of a monocular type, the second video frame can be directly determined to be a monocular image with a preset image size.

And S150, detecting the projection format and the view angle range of the monocular image based on the second classification model, and determining a projection format detection result and a view angle range detection result corresponding to the second video frame.

The projection format may refer to a manner of information mapping in the virtual reality video. For example, the projection format may include: plane format, ERP (Equirectangular Projection, equidistant cylindrical projection) format, and EAC (Equi-Angular Cubemap projection, cubic projection) format. The view angle range may refer to a range of virtual reality video capturing view angles. For example, the viewing angle range may include 180 degrees and 360 degrees. After combining the projection format and the viewing angle range, there may be five types, respectively: plane, ERP 180, ERP 360, EAC 180, and EAC360. The second classification model may be a neural network model for detecting any architecture of projection format and view angle range. For example, the second classification model may be a five-classification network to determine which of the five types of combination the second video frame belongs to.

Specifically, a monocular image with a preset image size is input into a second classification model to detect a projection format and a view angle range, the first classification model can output probability values of the second video frame for each projection format and view angle range type respectively, and the type corresponding to the maximum probability value is determined as a projection format detection result and a view angle range detection result corresponding to the second video frame. Illustratively, gray processing can be performed on a monocular image with a preset image size, and the gray image corresponding to the monocular image is input into a second classification model to detect a projection format and a viewing angle range, so that the detection efficiency of the projection format and the viewing angle range can be further improved.

S160, determining a target format detection result corresponding to the target virtual reality video based on the single-binocular detection result, the projection format detection result and the view angle range detection result corresponding to the second video frame.

The target format detection result may refer to a final format prediction result of the target virtual reality video. The target format detection result includes: the target virtual reality video belongs to a single-binocular type, a target projection format and a target visual angle range.

Specifically, based on the single-binocular detection result, the projection format detection result and the view angle range detection result corresponding to each second video frame, statistics is carried out on each format detection result, and a final target format detection result is determined based on the statistics result, for example, the format detection result with the largest occurrence is determined as the target format detection result, so that automatic detection of the virtual reality video format is realized, manual participation is not needed, and format detection accuracy is ensured.

Illustratively, S160 may include: determining the number of second video frames corresponding to each format detection result based on the single-binocular detection result, the projection format detection result and the view angle range detection result corresponding to the second video frames; and determining the format detection result with the largest second video frame number as a target format detection result corresponding to the target virtual reality video.

Specifically, comparing a single-binocular detection result, a projection format detection result and a view angle range detection result corresponding to each second video frame, and counting the number of the second video frames with the same single-binocular detection result, the same projection format detection result and the same view angle range detection result to obtain the number of the second video frames corresponding to each format detection result. Each format detection result comprises a single binocular type, a projection format and a view angle range. And determining the format detection result with the largest occurrence number as a target format detection result corresponding to the target virtual reality video, so that the format misjudgment condition caused by detecting only a single video frame can be avoided, and the accuracy of format detection is further improved.

According to the technical scheme, the plurality of first video frames in the target virtual reality video to be detected are obtained, and the second video frames with complex textures are determined based on the texture complexity corresponding to each first video frame, so that the video frames with simple textures are filtered, and the accuracy of format detection is further guaranteed. And carrying out single-binocular detection on the second video frame based on the first classification model, and determining a single-binocular detection result corresponding to the second video frame, thereby realizing the single-binocular automatic detection of the video by utilizing the first classification model. And determining a monocular image corresponding to the second video frame based on a monocular detection result corresponding to the second video frame, detecting a projection format and a view angle range of the monocular image based on a second classification model, and determining a projection format detection result and a view angle range detection result corresponding to the second video frame, thereby realizing automatic detection of the video projection format and the view angle range by using the second classification model. Based on the single-binocular detection result, the projection format detection result and the view angle range detection result corresponding to the second video frame, the target format detection result corresponding to the target virtual reality video can be accurately determined, so that automatic detection of the virtual reality video format is realized, manual participation is not needed, and format detection accuracy is ensured and format detection efficiency is improved.

Based on the above technical solution, the "determining the texture complexity corresponding to each first video frame" in S120 may include: determining a gray image corresponding to the first video frame; performing edge detection on the gray level image to determine a horizontal gradient and a vertical gradient corresponding to each pixel point in the gray level image; and determining texture complexity corresponding to the first video frame based on the horizontal gradient and the vertical gradient corresponding to each pixel point.

Specifically, for each first video frame, gray scale processing is performed on the first video frame, and a gray scale image corresponding to the first video frame is obtained. The edge detection can be performed on the gray image by using the sobel operator, so as to obtain a first-order gradient at the position of each pixel point in the gray image, namely, a horizontal gradient Grad_x and a vertical gradient Grad_y corresponding to each pixel point. By performing statistical analysis on the horizontal gradient grad_x and the vertical gradient grad_y corresponding to each pixel point, the texture complexity corresponding to the first video frame can be accurately obtained.

Illustratively, determining the texture complexity for the first video frame based on the horizontal gradient and the vertical gradient for each pixel point may include: determining the gradient intensity corresponding to each pixel point in the gray level image based on the horizontal gradient and the vertical gradient corresponding to each pixel point; and determining texture complexity corresponding to the first video frame based on the gradient strength corresponding to each pixel point.

Specifically, the absolute value of the horizontal gradient and the absolute value of the vertical gradient corresponding to each pixel point may be averaged, and the obtained average value is determined as the gradient intensity corresponding to the corresponding pixel point. And carrying out average processing on the gradient intensities corresponding to all the pixel points, and determining the obtained average value as texture complexity corresponding to the first video frame. The texture complexity corresponding to the first video frame can be more accurately determined by utilizing the edge detection mode, so that the accuracy of format detection is further improved.

On the basis of the technical scheme, the first classification model and the second classification model are obtained by training in advance based on the sample data set so as to ensure the accuracy of model detection. The first classification model and the second classification model may be constructed in the same network architecture or may be constructed in different network architectures. The sample data can be expanded by converting and cutting the format of the existing video to obtain a sample data set containing various training samples. The first classification model and the second classification model are trained based on the sample data set, for example, accuracy of image classification prediction results is quantized through a cross entropy loss function, and classification model parameters are updated and optimized, so that the first classification model after training can accurately detect monocular and binocular, and the second classification model after training can accurately detect a projection format and a view angle range.

Fig. 3 is a schematic flow chart of another method for detecting a virtual reality video format according to an embodiment of the present disclosure, where specific structures of a first classification model and a second classification model are optimized based on the above disclosed embodiment, and a detailed description is made of a virtual reality video format detection process based on the optimized structures. Wherein the same or corresponding terms as those of the above-described embodiments are not explained in detail herein.

As shown in fig. 3, the method for detecting the virtual reality video format specifically includes the following steps:

s310, acquiring a plurality of first video frames in the target virtual reality video to be detected.

S320, determining texture complexity corresponding to each first video frame, and determining a second video frame with complex texture from the plurality of first video frames based on the texture complexity.

S330, inputting the second video frame into a first convolution sub-model of the first classification model to perform feature extraction, and obtaining local feature information corresponding to the second video frame.

Wherein the first classification model may comprise: a first convolution sub-model, a first multi-headed attention sub-model, and a first classification sub-model. The first convolution sub-model may be a network of feature extraction by convolution. For example, the first convolution sub-model may include a plurality of convolution layers connected in series, through which local feature information corresponding to the second video frame may be extracted.

Specifically, for each second video frame, referring to fig. 4, the second video frame having a preset image size is input into the first convolution sub-model of the first classification model. And the first convolution sub-model extracts local features of the second video frame through the convolution layer to obtain local feature information corresponding to the second video frame.

Illustratively, referring to fig. 4, the first convolution sub-model may include a feature fusion layer and at least two convolution layers connected in series. Each convolution layer may comprise a 3 x 3 convolution kernel. Accordingly, S330 may include: inputting a second video frame into the first convolution layer, and sequentially carrying out convolution processing based on the connection sequence of the convolution layers to obtain target characteristic information output by each convolution layer; and inputting the target characteristic information output by each convolution layer into a characteristic fusion layer for fusion processing to obtain local characteristic information corresponding to the second video frame.

Specifically, local characteristic information under different receptive fields can be obtained through a small number of serially connected convolution layers and characteristic fusion layers, so that the convolution processing effect is improved, and further, the model detection accuracy is also improved. For example, the first convolution sub-model may include a feature fusion layer and three convolution layers connected in series. And inputting the second video frame into the first convolution layer to obtain target characteristic information output by the first convolution layer. And inputting the target characteristic information output by the first convolution layer into the second convolution layer to obtain the target characteristic information output by the second convolution layer. And inputting the target characteristic information output by the second convolution layer into a third convolution layer to obtain the target characteristic information output by the third convolution layer. And inputting the target characteristic information output by the first convolution layer, the target characteristic information output by the second convolution layer and the target characteristic information output by the third convolution layer into a characteristic fusion layer for fusion processing, for example, splicing the three output target characteristic information to obtain local characteristic information corresponding to the second video frame.

S340, inputting the local feature information into the first multi-head attention sub-model to perform attention processing, and obtaining global feature information corresponding to the second video frame.

The first multi-head attention sub-model can repeatedly execute attention processing for a plurality of times in parallel, so that global characteristic information of the second video frame is accurately extracted.

Specifically, referring to fig. 4, the local feature information is input into the first multi-head attention sub-model to perform attention processing for multiple times, so as to obtain global feature information corresponding to the second video frame. The global feature information can be extracted more quickly and accurately by using a small quantity of convolution layers and a multi-head attention mechanism, so that the computational complexity is reduced.

S350, inputting the global feature information into the first classification sub-model to carry out single-binocular classification, and obtaining a single-binocular detection result corresponding to the second video frame.

Wherein the first classification sub-model may be a network mapping global feature information to classification results. For example, the first classification sub-model may be a full connection layer or a multi-layer perceptron, or the like.

Specifically, the global feature information is input into the first classification sub-model to perform monocular and binocular classification, so that a monocular and binocular detection result corresponding to the second video frame can be accurately obtained.

S360, determining a monocular image corresponding to the second video frame based on a monocular detection result corresponding to the second video frame.

And S370, inputting the monocular image into a second convolution submodel for feature extraction, and obtaining local feature information corresponding to the monocular image.

The second classification model may have the same network architecture as the first classification model. The second classification model may also include: a second convolution sub-model, a second multi-headed attention sub-model, and a second classification sub-model. The second convolution sub-model may be a network of feature extraction by convolution. For example, the second convolution sub-model may include a plurality of convolution layers connected in series, through which local feature information corresponding to the monocular image may be extracted.

Specifically, a monocular image having a preset image size is input into a second convolution sub-model of a second classification model. And the second convolution sub-model performs local feature extraction on the input monocular image through the convolution layer to obtain local feature information corresponding to the monocular image.

Illustratively, the network architecture of the second convolution sub-model is also the same as that of the first convolution sub-model. For example, the second convolution sub-model includes a feature fusion layer and at least two convolution layers connected in series. Each convolution layer may comprise a 3 x 3 convolution kernel. Accordingly, S370 may include: inputting a monocular image into a first convolution layer, and sequentially carrying out convolution processing based on the connection sequence of the convolution layers to obtain target characteristic information output by each convolution layer; and inputting the target characteristic information output by each convolution layer into a characteristic fusion layer for fusion processing to obtain local characteristic information corresponding to the monocular image.

Specifically, local characteristic information under different receptive fields can be obtained through a small number of serially connected convolution layers and characteristic fusion layers, so that the convolution processing effect is improved, and further, the model detection accuracy is also improved. For example, the second convolution sub-model may include a feature fusion layer and three convolution layers connected in series. And inputting the monocular image into the first convolution layer to obtain target characteristic information output by the first convolution layer. And inputting the target characteristic information output by the first convolution layer into the second convolution layer to obtain the target characteristic information output by the second convolution layer. And inputting the target characteristic information output by the second convolution layer into a third convolution layer to obtain the target characteristic information output by the third convolution layer. And inputting the target characteristic information output by the first convolution layer, the target characteristic information output by the second convolution layer and the target characteristic information output by the third convolution layer into a characteristic fusion layer for fusion processing, for example, splicing the three output target characteristic information to obtain local characteristic information corresponding to the monocular image.

And S380, inputting the local characteristic information into a second multi-head attention sub-model to perform attention processing, and obtaining global characteristic information corresponding to the monocular image.

Wherein the second multi-headed attention sub-model may repeatedly perform the attention process a plurality of times in parallel, thereby accurately extracting global feature information of the monocular image.

Specifically, the local characteristic information of the monocular image is input into a second multi-head attention sub-model to perform attention processing for a plurality of times, and global characteristic information corresponding to the monocular image is obtained. The global characteristic information of the monocular image can be extracted more quickly and accurately by using a small quantity of convolution layers and a multi-head attention mechanism, and the calculation complexity is reduced.

S390, inputting the global characteristic information into the second classification sub-model to detect the projection format and the view angle range, and obtaining a projection format detection result and a view angle range detection result corresponding to the second video frame.

Wherein the second classification sub-model may be a network mapping global feature information to classification results. For example, the second classification sub-model may be a full connection layer or a multi-layer perceptron, or the like.

Specifically, global feature information of the monocular image is input into the first classification sub-model to detect a projection format and a view angle range, so that a projection format detection result and a view angle range detection result corresponding to the second video frame can be accurately obtained.

S391, determining a target format detection result corresponding to the target virtual reality video based on the single-binocular detection result, the projection format detection result and the view angle range detection result corresponding to the second video frame.

According to the technical scheme, global characteristic information can be extracted more quickly and accurately by means of combining a small number of convolution layers and a multi-head attention mechanism, and the calculation complexity is reduced, so that the efficiency and accuracy of monocular and binocular detection, the efficiency and accuracy of projection format and visual angle range detection are improved, and the efficiency and accuracy of virtual reality video format detection are further improved.

Fig. 5 is a schematic structural diagram of a virtual reality video format detection apparatus according to an embodiment of the disclosure, where, as shown in fig. 5, the apparatus specifically includes: a first video frame acquisition module 510, a second video frame determination module 520, a single-binocular detection module 530, a monocular image determination module 540, a projection view detection module 550, and a format detection result determination module 560.

The first video frame obtaining module 510 is configured to obtain a plurality of first video frames in a target virtual reality video to be detected; a second video frame determining module 520, configured to determine a texture complexity corresponding to each first video frame, and determine a second video frame having a complex texture from the plurality of first video frames based on the texture complexity; the single-binocular detection module 530 is configured to perform single-binocular detection on the second video frame based on the first classification model, and determine a single-binocular detection result corresponding to the second video frame; a monocular image determining module 540, configured to determine a monocular image corresponding to the second video frame based on a monocular detection result corresponding to the second video frame; a projection view angle detection module 550, configured to detect a projection format and a view angle range of the monocular image based on a second classification model, and determine a projection format detection result and a view angle range detection result corresponding to the second video frame; the format detection result determining module 560 is configured to determine a target format detection result corresponding to the target virtual reality video based on the single-binocular detection result, the projection format detection result, and the view angle range detection result corresponding to the second video frame.

According to the technical scheme provided by the embodiment of the disclosure, the plurality of first video frames in the target virtual reality video to be detected are obtained, and the second video frames with complex textures are determined based on the texture complexity corresponding to each first video frame, so that the video frames with simple textures are filtered, and the accuracy of format detection is further ensured. And carrying out single-binocular detection on the second video frame based on the first classification model, and determining a single-binocular detection result corresponding to the second video frame, thereby realizing the single-binocular automatic detection of the video by utilizing the first classification model. And determining a monocular image corresponding to the second video frame based on a monocular detection result corresponding to the second video frame, detecting a projection format and a view angle range of the monocular image based on a second classification model, and determining a projection format detection result and a view angle range detection result corresponding to the second video frame, thereby realizing automatic detection of the video projection format and the view angle range by using the second classification model. Based on the single-binocular detection result, the projection format detection result and the view angle range detection result corresponding to the second video frame, the target format detection result corresponding to the target virtual reality video can be accurately determined, so that automatic detection of the virtual reality video format is realized, manual participation is not needed, and format detection accuracy is ensured and format detection efficiency is improved.

Based on the above technical solution, the second video frame determining module 520 includes:

a gray image determining unit, configured to determine a gray image corresponding to the first video frame;

the edge detection unit is used for carrying out edge detection on the gray level image and determining a horizontal gradient and a vertical gradient corresponding to each pixel point in the gray level image;

and the texture complexity determining unit is used for determining the texture complexity corresponding to the first video frame based on the horizontal gradient and the vertical gradient corresponding to each pixel point.

Based on the above technical solutions, the texture complexity determining unit is specifically configured to:

determining the gradient intensity corresponding to each pixel point in the gray level image based on the horizontal gradient and the vertical gradient corresponding to each pixel point; and determining texture complexity corresponding to the first video frame based on the gradient strength corresponding to each pixel point.

On the basis of the above technical solutions, the first classification model includes: a first convolution sub-model, a first multi-headed attention sub-model, and a first classification sub-model;

a single binocular detection module 530 comprising:

the local feature information determining subunit is used for inputting the second video frame into the first convolution submodel to perform feature extraction, so as to obtain local feature information corresponding to the second video frame;

The global feature information determining subunit is used for inputting the local feature information into the first multi-head attention sub-model to perform attention processing and obtain global feature information corresponding to the second video frame;

and the single-binocular detection result determining subunit is used for inputting the global characteristic information into the first classification sub-model to carry out single-binocular classification, so as to obtain a single-binocular detection result corresponding to the second video frame.

On the basis of the technical schemes, the first convolution sub-model comprises a feature fusion layer and at least two convolution layers connected in series;

the local characteristic information determining subunit is specifically configured to: inputting the second video frame into a first convolution layer, and sequentially carrying out convolution processing based on the connection sequence of the convolution layers to obtain target characteristic information output by each convolution layer; and inputting the target characteristic information output by each convolution layer into a characteristic fusion layer for fusion processing to obtain local characteristic information corresponding to the second video frame.

Based on the above technical solutions, the monocular image determining module 540 is specifically configured to:

if the single-binocular detection result corresponding to the second video frame is of a binocular left-right type or a binocular up-down type, cutting and scaling the second video frame to determine a monocular image with a preset image size; and if the single-binocular detection result corresponding to the second video frame is of a monocular type, determining the second video frame as a monocular image.

Based on the above technical solutions, the format detection result determining module 560 is specifically configured to:

determining the number of second video frames corresponding to each format detection result based on the single-binocular detection result, the projection format detection result and the view angle range detection result corresponding to the second video frames; and determining the format detection result with the largest second video frame number as a target format detection result corresponding to the target virtual reality video.

The virtual reality video format detection device provided by the embodiment of the disclosure can execute the virtual reality video format detection method provided by any embodiment of the disclosure, and has the corresponding functional modules and beneficial effects of the execution method.

It should be noted that each unit and module included in the above apparatus are only divided according to the functional logic, but not limited to the above division, so long as the corresponding functions can be implemented; in addition, the specific names of the functional units are also only for convenience of distinguishing from each other, and are not used to limit the protection scope of the embodiments of the present disclosure.

Fig. 6 is a schematic structural diagram of an electronic device according to an embodiment of the disclosure. Referring now to fig. 6, a schematic diagram of an electronic device (e.g., a terminal device or server in fig. 6) 500 suitable for use in implementing embodiments of the present disclosure is shown. The terminal devices in the embodiments of the present disclosure may include, but are not limited to, mobile terminals such as mobile phones, notebook computers, digital broadcast receivers, PDAs (personal digital assistants), PADs (tablet computers), PMPs (portable multimedia players), in-vehicle terminals (e.g., in-vehicle navigation terminals), and the like, and stationary terminals such as digital TVs, desktop computers, and the like. The electronic device shown in fig. 6 is merely an example and should not be construed to limit the functionality and scope of use of the disclosed embodiments.

As shown in fig. 6, the electronic device 500 may include a processing means (e.g., a central processing unit, a graphics processor, etc.) 501, which may perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 502 or a program loaded from a storage means 508 into a Random Access Memory (RAM) 503. In the RAM 503, various programs and data required for the operation of the electronic apparatus 500 are also stored. The processing device 501, the ROM 502, and the RAM 503 are connected to each other via a bus 504. An edit/output (I/O) interface 505 is also connected to bus 504.

In general, the following devices may be connected to the I/O interface 505: input devices 506 including, for example, a touch screen, touchpad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; an output device 507 including, for example, a Liquid Crystal Display (LCD), a speaker, a vibrator, and the like; storage 508 including, for example, magnetic tape, hard disk, etc.; and communication means 509. The communication means 509 may allow the electronic device 500 to communicate with other devices wirelessly or by wire to exchange data. While fig. 6 shows an electronic device 500 having various means, it is to be understood that not all of the illustrated means are required to be implemented or provided. More or fewer devices may be implemented or provided instead.

In particular, according to embodiments of the present disclosure, the processes described above with reference to flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a non-transitory computer readable medium, the computer program comprising program code for performing the method shown in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network via the communication means 509, or from the storage means 508, or from the ROM 502. The above-described functions defined in the methods of the embodiments of the present disclosure are performed when the computer program is executed by the processing device 501.

The electronic device provided by the embodiment of the present disclosure and the virtual reality video format detection method provided by the foregoing embodiment belong to the same inventive concept, and technical details not described in detail in the present embodiment may be referred to the foregoing embodiment, and the present embodiment has the same beneficial effects as the foregoing embodiment.

The embodiment of the present disclosure provides a computer storage medium having stored thereon a computer program which, when executed by a processor, implements the virtual reality video format detection method provided by the above embodiment.

It should be noted that the computer readable medium described in the present disclosure may be a computer readable signal medium or a computer readable storage medium, or any combination of the two. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples of the computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this disclosure, a computer-readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present disclosure, however, the computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, with the computer-readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, fiber optic cables, RF (radio frequency), and the like, or any suitable combination of the foregoing.

In some implementations, the clients, servers may communicate using any currently known or future developed network protocol, such as HTTP (HyperText Transfer Protocol ), and may be interconnected with any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include a local area network ("LAN"), a wide area network ("WAN"), the internet (e.g., the internet), and peer-to-peer networks (e.g., ad hoc peer-to-peer networks), as well as any currently known or future developed networks.

The computer readable medium may be contained in the electronic device; or may exist alone without being incorporated into the electronic device.

The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to: acquiring a plurality of first video frames in a target virtual reality video to be detected; determining texture complexity corresponding to each first video frame, and determining a second video frame with complex texture from a plurality of first video frames based on the texture complexity; performing single-binocular detection on the second video frame based on a first classification model, and determining a single-binocular detection result corresponding to the second video frame; determining a monocular image corresponding to the second video frame based on a monocular detection result corresponding to the second video frame; detecting a projection format and a view angle range of the monocular image based on a second classification model, and determining a projection format detection result and a view angle range detection result corresponding to the second video frame; and determining a target format detection result corresponding to the target virtual reality video based on the single-binocular detection result, the projection format detection result and the view angle range detection result corresponding to the second video frame.

Computer program code for carrying out operations of the present disclosure may be written in one or more programming languages, including, but not limited to, an object oriented programming language such as Java, smalltalk, C ++ and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider).

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units involved in the embodiments of the present disclosure may be implemented by means of software, or may be implemented by means of hardware. The name of the unit does not in any way constitute a limitation of the unit itself, for example the first acquisition unit may also be described as "unit acquiring at least two internet protocol addresses".

The functions described above herein may be performed, at least in part, by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic components that may be used include: a Field Programmable Gate Array (FPGA), an Application Specific Integrated Circuit (ASIC), an Application Specific Standard Product (ASSP), a system on a chip (SOC), a Complex Programmable Logic Device (CPLD), and the like.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

According to one or more embodiments of the present disclosure, there is provided a virtual reality video format detection method, including:

According to one or more embodiments of the present disclosure, there is provided a virtual reality video format detection method [ example two ] further comprising:

Optionally, the determining the texture complexity corresponding to each first video frame includes:

determining a gray image corresponding to the first video frame;

performing edge detection on the gray level image, and determining a horizontal gradient and a vertical gradient corresponding to each pixel point in the gray level image;

and determining texture complexity corresponding to the first video frame based on the horizontal gradient and the vertical gradient corresponding to each pixel point.

According to one or more embodiments of the present disclosure, there is provided a virtual reality video format detection method [ example three ], further comprising:

optionally, the determining the texture complexity corresponding to the first video frame based on the horizontal gradient and the vertical gradient corresponding to each pixel point includes:

determining the gradient intensity corresponding to each pixel point in the gray level image based on the horizontal gradient and the vertical gradient corresponding to each pixel point;

and determining texture complexity corresponding to the first video frame based on the gradient strength corresponding to each pixel point.

According to one or more embodiments of the present disclosure, there is provided a virtual reality video format detection method [ example four ] further comprising:

optionally, the first classification model includes: a first convolution sub-model, a first multi-headed attention sub-model, and a first classification sub-model;

The step of performing monocular and binocular detection on the second video frame based on the first classification model, and determining a monocular and binocular detection result corresponding to the second video frame includes:

inputting the second video frame into the first convolution submodel for feature extraction to obtain local feature information corresponding to the second video frame;

inputting the local characteristic information into the first multi-head attention sub-model for attention processing to obtain global characteristic information corresponding to the second video frame;

and inputting the global characteristic information into the first classification sub-model to carry out monocular and binocular classification, and obtaining a monocular and binocular detection result corresponding to the second video frame.

According to one or more embodiments of the present disclosure, there is provided a virtual reality video format detection method [ example five ], further comprising:

optionally, the first convolution sub-model includes a feature fusion layer and at least two serially connected convolution layers;

the step of inputting the second video frame into the first convolution submodel to perform feature extraction, and obtaining local feature information corresponding to the second video frame includes:

inputting the second video frame into a first convolution layer, and sequentially carrying out convolution processing based on the connection sequence of the convolution layers to obtain target characteristic information output by each convolution layer;

And inputting the target characteristic information output by each convolution layer into a characteristic fusion layer for fusion processing to obtain local characteristic information corresponding to the second video frame.

According to one or more embodiments of the present disclosure, there is provided a virtual reality video format detection method [ example six ] further comprising:

optionally, the determining, based on the single-binocular detection result corresponding to the second video frame, a monocular image corresponding to the second video frame includes:

if the single-binocular detection result corresponding to the second video frame is of a binocular left-right type or a binocular up-down type, cutting and scaling the second video frame to determine a monocular image with a preset image size;

and if the single-binocular detection result corresponding to the second video frame is of a monocular type, determining the second video frame as a monocular image.

According to one or more embodiments of the present disclosure, there is provided a virtual reality video format detection method [ example seventh ], further comprising:

optionally, the determining the target format detection result corresponding to the target virtual reality video based on the single-binocular detection result, the projection format detection result, and the view angle range detection result corresponding to the second video frame includes:

Determining the number of second video frames corresponding to each format detection result based on the single-binocular detection result, the projection format detection result and the view angle range detection result corresponding to the second video frames;

and determining the format detection result with the largest second video frame number as a target format detection result corresponding to the target virtual reality video.

According to one or more embodiments of the present disclosure, there is provided a virtual reality video format detection apparatus, including:

The foregoing description is only of the preferred embodiments of the present disclosure and description of the principles of the technology being employed. It will be appreciated by persons skilled in the art that the scope of the disclosure referred to in this disclosure is not limited to the specific combinations of features described above, but also covers other embodiments which may be formed by any combination of features described above or equivalents thereof without departing from the spirit of the disclosure. Such as those described above, are mutually substituted with the technical features having similar functions disclosed in the present disclosure (but not limited thereto).

Moreover, although operations are depicted in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order. In certain circumstances, multitasking and parallel processing may be advantageous. Likewise, while several specific implementation details are included in the above discussion, these should not be construed as limiting the scope of the present disclosure. Certain features that are described in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination.

Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are example forms of implementing the claims.

Claims

1. A method for detecting a virtual reality video format, comprising:

2. The method for detecting a virtual reality video format according to claim 1, wherein determining a texture complexity corresponding to each first video frame comprises:

determining a gray image corresponding to the first video frame;

3. The method for detecting a virtual reality video format according to claim 2, wherein determining a texture complexity corresponding to the first video frame based on the horizontal gradient and the vertical gradient corresponding to each pixel point comprises:

4. The method of claim 1, wherein the first classification model comprises: a first convolution sub-model, a first multi-headed attention sub-model, and a first classification sub-model;

5. The method of claim 4, wherein the first convolution sub-model comprises a feature fusion layer and at least two serially connected convolution layers;

6. The method according to claim 1, wherein determining the monocular image corresponding to the second video frame based on the monocular detection result corresponding to the second video frame comprises:

7. The method according to claim 1, wherein determining the target format detection result corresponding to the target virtual reality video based on the single-binocular detection result, the projection format detection result, and the view angle range detection result corresponding to the second video frame comprises:

8. A virtual reality video format detection apparatus, comprising:

9. An electronic device, the electronic device comprising:

one or more processors;

storage means for storing one or more programs,

the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the virtual reality video format detection method of any of claims 1-7.

10. A storage medium containing computer executable instructions which, when executed by a computer processor, are for performing the virtual reality video format detection method of any of claims 1-7.