WO2018171234A1

WO2018171234A1 - Video processing method and apparatus

Info

Publication number: WO2018171234A1
Application number: PCT/CN2017/112342
Authority: WO
Inventors: 徐异凌; 张文军; 黄巍; 胡颖; 马展; 吴钊; 李明; 吴平
Original assignee: 上海交通大学; 中兴通讯股份有限公司
Priority date: 2017-03-24
Filing date: 2017-11-22
Publication date: 2018-09-27
Also published as: CN108628913B; CN108628913A

Abstract

A video processing method and apparatus, comprising: labeling a target object within a video, and thereby generating label information of the target object according to a labeling result, wherein the label information is at least configured to indicate one of the following: the type of the target object, content of the target object, and spatial position information of the target object in the video (S202); acquiring instruction information, and indexing specified label information of a specified target object according to the instruction information (S204); pushing or displaying part or all of the videos corresponding to said specified label information within said video (S206).

Description

Video processing method and device

Technical field

The present disclosure relates to the field of communications, and in particular, to a video processing method and apparatus.

Background technique

With the rapid development of digital media technology, the application scenarios in streaming media consumption are becoming more intelligent, personalized and diversified. The core of these application scenarios is often the research and processing of the Region of Interest (ROI). The user's area of interest, that is, when the user watches the video media, the line of sight is mainly concentrated and the video area of interest.

In the related art, research on ROI focuses on the corresponding recognition and retrieval of video content after the user receives the video. For example, in a video surveillance application, if a user needs to find a specific video content, after the user receives the monitoring video, the user can find the video content required by the user through the region of interest detection technology, which requires a large amount of surveillance video in a short period of time. The detection is performed to identify the video image corresponding to the area of interest to the user. In addition, in the application field of panoramic video, if the user is interested in the video content of some of the regions, the user needs to find the video region required by the user through the region of interest detection technology after the user receives the panoramic video. In the above retrieval process, the user needs to identify the video area of interest by recognizing the video content in the received large number of videos, which requires a large amount of resources and time.

In the related art, the user needs to identify the video area that is of interest by identifying the video content in the received large number of videos, which leads to a problem that requires a large amount of resources and time, and there is no reasonable solution at present.

Summary of the invention

An embodiment of the present disclosure provides a method and an apparatus for processing a video, so as to at least solve the problem that a user needs to detect a video area that is of interest by identifying a video content in a received large number of videos in the related art, resulting in a large amount of resources and The problem of time.

According to an aspect of the present disclosure, a method for processing a video includes: marking a target object in a video, and generating identification information of the target object according to the marking result, wherein the identification information is at least set to indicate One of the following: a type of the target object, a content of the target object, spatial location information of the target object in the video, acquiring instruction information, and specifying specified identification information of the target object according to the instruction information index; Pushing or displaying part or all of the video corresponding to the specified identification information in the video.

Preferably, the identifier information is set to at least one of: a mark type of the identifier information, a mark content type of the identifier information, a mark content of the identifier information, a length information of the identifier information, the a quality level of a part or all of the video in which the target object is located, a quantity of identification information included in part or all of the video in which the target object is located, time information corresponding to part or all of the video in which the target object is located, and the target object is located The spatial location information of some or all of the videos in the video.

Preferably, the spatial location information of the part or all of the video in the video includes at least one of: a central point coordinate of the part or all of the video, a width of the part or all of the video, the part or all of the video The height of the coordinate system includes one of the following: a two-dimensional space coordinate system, and a three-dimensional space coordinate system.

Preferably, in the two-dimensional space coordinate system, the value of the coordinate includes at least one of the following: a value of a two-dimensional rectangular coordinate system, and a value of a two-dimensional spherical coordinate system; in a three-dimensional space coordinate system, the coordinate The value is at least one of the following: the value of the three-dimensional space rectangular coordinate system, and the value of the three-dimensional spherical coordinate system.

Preferably, the target object in the video is marked, and the identification information of the target object is generated according to the marking result, including: marking, in the process of video capturing or editing, the target object in the video, and further marking according to the marking The result is generated identification information of the target object; and/or in the captured or edited video data, the target object in the video is marked, and the identification information of the target object is generated according to the marking result.

Preferably, the acquiring the instruction information for indicating the at least one specified target object comprises: acquiring first instruction information preset by the user; and/or acquiring second instruction information obtained after analyzing the video viewing behavior of the user.

According to another aspect of the present disclosure, there is also provided a video processing apparatus, comprising: a marking module configured to mark a target object in a video; and a generating module configured to generate identification information of the target object according to the marking result And the identifier information is set to at least one of: a type of the target object, a content of the target object, spatial location information of the target object in the video, and an obtaining module configured to acquire an instruction And an indexing module, configured to index the specified identification information of the specified target object according to the instruction information; and the processing module is configured to push or display part or all of the video corresponding to the specified identification information in the video.

Preferably, the identifier information is set to at least one of: a mark type of the identifier information, a mark content type of the identifier information, a length information of the identifier information, a mark content of the identifier information, the a quality level of a part or all of the video in which the target object is located, a quantity of identification information included in part or all of the video in which the target object is located, time information corresponding to the part or all of the video, and the part or all of the video is in the Spatial location information in the video.

Preferably, in the two-dimensional space coordinate system, the value of the coordinate includes at least one of the following: a value of a two-dimensional rectangular coordinate system, a value of a two-dimensional spherical coordinate system; a value of a three-dimensional space rectangular coordinate system, a three-dimensional spherical coordinate Take the value.

Preferably, the marking module comprises: a first marking unit configured to mark a target object in the video during video capture or editing; and a second marking unit configured to be in the captured or edited video data. , mark the target object in the video.

Preferably, the obtaining module includes: a first acquiring unit configured to acquire first instruction information preset by the user; and a second acquiring unit configured to acquire second instruction information obtained after analyzing the video viewing behavior of the user.

According to another aspect of the present disclosure, there is also provided a storage medium including a stored program, wherein the execution of the processing method of the video in the above embodiment is performed while the program is running.

According to another aspect of the present disclosure, there is further provided a processor configured to execute a program, wherein the program is executed to execute an implementation of a processing method of a video in the above embodiment.

Through the disclosure, the target object in the video is marked, and then the identification information of the target object is generated according to the marking result, the identification information includes at least the spatial position information of the target object in the video, and then the instruction set to indicate the specified target object is obtained by acquiring Information, and indexing to the specified identification information of the specified target object according to the instruction information, and then pushing or displaying part or all of the video corresponding to the specified identification information according to the spatial location information in the identification information, where some or all of the video is included in the entire video. . The above method solves the problem that the user needs to detect the video area that is interested by identifying the video content in the received large number of videos in the related art, which causes a problem that requires a large amount of resources and time, and the user can already exist through the index video. The identification information quickly captures the video of interest, which greatly saves resources and time during the video retrieval process.

DRAWINGS

The drawings described herein are provided to provide a further understanding of the present disclosure, which is a part of the present disclosure, and the description of the present disclosure and the description thereof are not intended to limit the disclosure. In the drawing:

1 is a schematic diagram of an application environment of an optional video processing method according to an embodiment of the present disclosure;

2 is a flowchart of a method of processing an optional video according to an embodiment of the present disclosure;

3 is a structural block diagram of an optional video processing apparatus according to an embodiment of the present disclosure;

4 is a block diagram showing the structure of an optional video processing apparatus according to an embodiment of the present disclosure;

FIG. 5 is a structural block diagram of an optional video processing apparatus according to an embodiment of the present disclosure; FIG.

6 is a schematic diagram of content of an optional identification information in an embodiment of the present disclosure;

FIG. 7 is a schematic diagram of an optional video positioning method according to an embodiment of the present disclosure; FIG.

FIG. 8 is a schematic diagram of an optional video retrieval method according to an embodiment of the present disclosure.

detailed description

The present disclosure will be described in detail below with reference to the drawings in conjunction with the embodiments. It should be noted that the embodiments in the present application and the features in the embodiments may be combined with each other without conflict.

It is to be understood that the terms "first", "second", and the like in the specification and claims of the present disclosure are used to distinguish similar objects, and are not necessarily used to describe a particular order or order. It is to be understood that the data so used may be interchanged as appropriate, such that the embodiments of the present disclosure described herein can be implemented in a sequence other than those illustrated or described herein. In addition, the terms "comprises" and "comprises" and "the" and "the" are intended to cover a non-exclusive inclusion, for example, a process, method, system, product, or device that comprises a series of steps or units is not necessarily limited to Those steps or units may include other steps or units not explicitly listed or inherent to such processes, methods, products or devices.

Example 1

In an embodiment of the present disclosure, an embodiment of a method of processing the above video is provided. Figure 1 is based on the present disclosure A schematic diagram of an application environment of an optional video processing method of the embodiment. As an optional implementation manner, the processing method of the video may be, but is not limited to, being applied to an application environment as shown in FIG. 1. The terminal 102 is connected to the server 106, wherein the server 106 may push the video file to the terminal 102. An application client 104 that can receive and display video images is run on the terminal 102. The server 106 marks the target object in the video image, and further generates identification information of the target object according to the marking result, wherein the identification information is at least set to indicate one of: a type of the target object, a content of the target object, and the target object is in the The spatial location information in the video image; the server 106 acquires the instruction information, and acquires the specified identification information of the specified target object according to the instruction information, wherein the instruction information is set to indicate the at least one specified target object; the server 106 pushes the video corresponding to the specified identification information, wherein The video includes: some or all of the videos. It should be noted that each step of the foregoing steps performed by the server 106 can also be performed at the terminal 102. This embodiment of the disclosure does not limit this.

Embodiments of the present disclosure also provide a method of processing a video. 2 is a flow chart of an alternative video processing method in accordance with an embodiment of the present disclosure. As shown in FIG. 2, an optional process of the video processing method includes:

Step S202, marking the target object in the video, and further generating identification information of the target object according to the marking result, wherein the identification information is at least set to indicate one of: a type of the target object, a content of the target object, and the target object is in the video. Spatial location information;

Step S204, acquiring instruction information, and specifying specified identification information of the target object according to the instruction information index;

Step S206, pushing or displaying part or all of the videos corresponding to the specified identification information in the video.

Through the method provided by the present disclosure, the target object in the video is marked, and then the identification information of the target object is generated according to the marking result, and the identification information includes at least the spatial location information of the target object in the video, and then is set to indicate the specified target by acquiring The instruction information of the object is indexed to the specified identification information of the specified target object according to the instruction information, and then part or all of the video corresponding to the specified identification information is pushed or displayed according to the spatial location information in the identification information, where some or all of the video is included In the entire video. The above method solves the problem that the user needs to detect the video area that is interested by identifying the video content in the received large number of videos in the related art, which causes a problem that requires a large amount of resources and time, and the user can already exist through the index video. The identification information quickly captures the video of interest, which greatly saves resources and time during the video retrieval process.

In an optional example of the embodiment of the present disclosure, the identification information is at least set to indicate one of: a tag type of the identification information, a tag content type of the identification information, a tag content of the identification information, a length information of the identification information, and a target object The quality level of some or all of the videos, the number of identification information contained in some or all of the videos in which the target object is located, the time information corresponding to some or all of the videos, and the spatial location information of some or all of the videos in the video.

In an optional example of the embodiment of the present disclosure, the spatial location information of the video in the video includes at least one of the following: the coordinates of the center point of part or all of the video, the width of part or all of the video, and the height of some or all of the video. Wherein, the coordinate system in which the coordinates are located includes one of the following: a two-dimensional space coordinate system, and a three-dimensional space coordinate system.

In an optional example of the embodiment of the present disclosure, in the two-dimensional space coordinate system, the value of the coordinate includes at least one of the following: a value of the two-dimensional rectangular coordinate system, and a value of the two-dimensional spherical coordinate system. The value of the two-dimensional Cartesian coordinate system here can be expressed as (x, y), and the value of the two-dimensional spherical coordinate system can be expressed as (pitch angle coordinate value, yaw angle coordinate value). In three dimensions In the coordinate system, the value of the coordinates is at least one of the following: the value of the three-dimensional space rectangular coordinate system, and the value of the three-dimensional spherical coordinate system. Here, the value of the three-dimensional rectangular coordinate system can be expressed as (x, y, z), and the value of the three-dimensional spherical coordinate system can be expressed as (pitch angle coordinate value, yaw angle coordinate value, roll angle).

In an optional example of the embodiment of the present disclosure, the target object in the video is marked, and the identification information of the target object is generated according to the marking result, including: marking the target object in the video during video capture or editing. And generating identification information of the target object according to the marking result; and/or marking the target object in the video in the captured or edited video data, and generating identification information of the target object according to the marking result.

In an optional example of the embodiment of the present disclosure, acquiring the instruction information set to indicate the at least one specified target object comprises: acquiring first instruction information preset by the user; and/or obtaining the obtained after analyzing the video viewing behavior of the user. Second instruction information.

Example 2

An optional video processing device is also provided in the embodiment, and the device is configured to implement the foregoing embodiments and preferred embodiments, and details are not described herein. As used below, the term "module" may implement a combination of software and/or hardware of a predetermined function. Although the apparatus described in the following embodiments is preferably implemented in software, hardware, or a combination of software and hardware, is also possible and contemplated.

According to an embodiment of the present disclosure, there is also provided a processing apparatus configured to implement the above video. FIG. 3 is a structural block diagram of an optional video processing apparatus according to an embodiment of the present disclosure. As shown in Figure 3, the device comprises:

a marking module 302, configured to mark a target object in the video;

The generating module 304 is configured to generate identification information of the target object according to the marking result, wherein the identification information is at least set to indicate one of: a type of the target object, a content of the target object, and spatial location information of the target object in the video;

The obtaining module 306 is configured to acquire instruction information, where the instruction information is set to indicate at least one specified target object;

The indexing module 308 is configured to specify specified identification information of the target object according to the instruction information index;

The processing module 310 is configured to push or display part or all of the video corresponding to the specified identification information in the video.

Through the above device, the marking module marks the target object in the video, and the generating module generates the identification information of the target object according to the marking result, and the identification information includes at least the spatial location information of the target object in the video, and then the acquisition module obtains the setting as Instructing instruction information of the specified target object, the indexing module indexes the specified identification information of the specified target object according to the instruction information, and then the processing module pushes or displays part or all of the video corresponding to the specified identification information in the video according to the spatial location information in the identification information. Some or all of the videos here are included throughout the video. The invention solves the problem that the user needs to detect the video area that is interested in the video content in the received large number of videos, which causes a large amount of resources and time, and the user can quickly identify the existing identification information in the video. Get video feeds of interest, greatly saving resources and time in the video retrieval process.

In an optional example of the embodiment of the present disclosure, the identification information is at least set to indicate one of: a tag type of the identification information, a tag content type of the identification information, a length information of the identification information, a tag content of the identification information, and a target pair. The quality level of some or all of the videos, the number of identification information contained in some or all of the videos in which the target object is located, the time information corresponding to some or all of the videos, and the spatial location information of some or all of the videos in the video.

In an optional example of the embodiment of the present disclosure, in the two-dimensional space coordinate system, the value of the coordinate includes at least one of the following: a value of the two-dimensional rectangular coordinate system, and a value of the two-dimensional spherical coordinate system. The value of the two-dimensional Cartesian coordinate system here can be expressed as (x, y), and the value of the two-dimensional spherical coordinate system can be expressed as (pitch angle coordinate value, yaw angle coordinate value). In the three-dimensional space coordinate system, the value of the coordinates is at least one of the following: the value of the three-dimensional space rectangular coordinate system, and the value of the three-dimensional spherical coordinate system. Here, the value of the three-dimensional rectangular coordinate system can be expressed as (x, y, z), and the value of the three-dimensional spherical coordinate system can be expressed as (pitch angle coordinate value, yaw angle coordinate value, roll angle).

The embodiment of the present disclosure also provides an optional video processing device. 4 is a block diagram showing the structure of an optional video processing apparatus according to an embodiment of the present disclosure.

As shown in FIG. 4, the marking module 302 includes: a first marking unit 3020 configured to mark a target object in the video during video capture or editing; and a second marking unit 3022 configured to be completed in acquisition or editing In the video data, mark the target object in the video.

The obtaining module 306 includes: a first obtaining unit 3060 configured to acquire first instruction information preset by the user; and a second obtaining unit 3062 configured to acquire second instruction information obtained after analyzing the video viewing behavior of the user.

It should be noted that, in the embodiment of the present disclosure, the foregoing apparatus may be applied to any hardware device having the foregoing functional module in the server or the terminal, which is not limited in the embodiment of the present disclosure.

The embodiment of the present disclosure further provides a physical device applying the above functional module. FIG. 5 is a structural block diagram of an optional video processing apparatus according to an embodiment of the present disclosure. As shown in Figure 5, the device includes:

a processor 50, wherein the memory 52 is configured to store instructions executable by the processor 50; the processor 50 is configured to perform an operation of tagging a target object in the video based on instructions stored in the memory 52, thereby As a result, the identification information of the target object is generated, wherein the identification information is at least set to indicate one of: a type of the target object, a content of the target object, and spatial location information of the target object in the video; acquiring instruction information, and specifying the target according to the instruction information index The specified identification information of the object; push or display part or all of the video corresponding to the specified identification information.

The processor 50 described above may also perform an implementation of any of the above-described video processing methods.

Through the above device, the processor marks the target object in the video, and then generates identification information of the target object according to the marking result, where the identification information includes at least spatial location information of the target object in the video, and then is set to indicate the specified target object by acquiring The instruction information is indexed to the specified identification information of the specified target object according to the instruction information, and then part or all of the video corresponding to the specified identification information is pushed according to the spatial location information in the identification information, and some or all of the video here is included in the entire video. . Through the above method, the related art needs to detect the video content that is interested in the video content in the received large number of videos, which results in a large amount of resources and The problem of time, the user can quickly obtain the video push of interest by indexing the information already existing in the video, which greatly saves the resources and time in the video retrieval process.

Embodiments of the present disclosure also provide a storage medium including a stored program, wherein the program is executed to perform an implementation of a processing method of a video in the above-described embodiments and its alternative examples.

Example 3

In order to better understand the technical solutions in the foregoing embodiments, the present embodiment introduces the technical solutions of the embodiments of the present disclosure in an embodiment based on an exemplary application scenario.

An embodiment of the present disclosure provides an identification information marking method based on video content and a spatial location thereof, which can perform corresponding information identification on a specific content or a video area of a specific spatial location in the video medium, thereby being able to pass the content of interest of the user. The identification information provided in the embodiments of the present disclosure is associated to a corresponding video area. The video area herein can be understood as a video image of a certain range around the target object to which the information is associated. The size or shape of the area can be customized, which is not limited in this embodiment.

Illustratively, applications and services such as video location and video retrieval can be applied. Video positioning, that is, pre-acquisition information according to user habits, preferences, etc., is matched with the identification information marked by the video itself. Since the identification information is based on specific video content and spatial location, the video area is directly located and the video is directly located. The area is pushed to the user. In particular, in the consumption of the panoramic video, since the user cannot view the entire panoramic video at one time but only a part of the area can be viewed, the video positioning application of the present disclosure can realize the application of the panoramic video such as the initial viewing angle, or Priority is given to areas of interest to the user. Video retrieval, which directly retrieves the video content required by the user in a large number of videos. For example, in a video surveillance application scenario, it is necessary to perform fast and centralized processing on an area of interest to the user.

The present disclosure provides a marking method based on identification information of video content and its spatial location. Therefore, the identification information provided by the present disclosure can be retrieved to quickly retrieve the corresponding video region, which greatly improves the efficiency of video retrieval.

In order to achieve the above object, the embodiment of the present disclosure adopts the following technical solutions. It should be noted that the video content accessory tag or tag mentioned in the embodiment of the present disclosure may be understood as the identification information based on the video content and its spatial location.

An object of the present disclosure is to provide an identification information marking method based on video content and its spatial position, which is exemplarily: for a video picture finally presented to a user, a video area in which a specific content or a specific spatial position is attached is unique Associated specific video tag information.

In the present disclosure, the identification information based on the video content and its spatial location to be added may be varied. Preferably, the following set of information may be used as an example:

Information one: set to indicate the tag type of the video content accessory tag of the area;

Information 2: set to indicate the tag content type of the video content accessory tag of the area;

Information three: set as exemplary information indicating the label content of the video content affiliate tag of the area;

Information 4: set to indicate the quality level of the video content in the area;

Message 5: Set to indicate the spatial location of the area in the overall video.

The disclosure identifies information about a specific content or a specific spatial location of the video medium, and the identification information indicates a specific content category, content information, content quality, and content location of the portion of the video. In an exemplary application, such as video conferencing, video surveillance, video ad placement, etc., the video tag information provided by the present disclosure may be set up in one embodiment as a process and presentation of a client application or service.

The present disclosure is described in detail below in conjunction with the exemplary embodiments. The following examples are intended to help those skilled in the art to understand the present disclosure in an embodiment, but do not limit the disclosure in any form. It should be noted that a number of variations and modifications may be made by those skilled in the art without departing from the scope of the present disclosure. These are all within the scope of protection of the present disclosure.

For example, the server can simultaneously analyze video content through image processing, pattern recognition, and the like in the video capture and acquisition phase. Mark specific content or specific spatial locations of the video media based on the results of the analysis.

Alternatively, the server may mark specific content or specific spatial locations of the video media during the video editing process.

Alternatively, the server tags the specific content or specific spatial location of the video media in the captured or edited video data.

Illustratively, the server may place the tagged specific content or specific spatial location information in a reserved field in the video stream or codestream.

It is also possible that the server separately creates the tag data associated with the corresponding video data.

It is also possible for the client used by the user to separately create the tag data of the corresponding video according to the user's usage habit, and feed back to the server.

After receiving the video media, the user can learn the specific content and the spatial location of the video by identifying the information, thereby performing further application processing.

Before the server pushes the video to the user, the server may first obtain the video area that matches the user information by matching the preset user information and the identification information marked in the video. Then match the push according to user preferences or settings.

Alternatively, during the video push process, the server dynamically matches the identification information according to the video viewing requirement of the user for the specific content, and pushes the corresponding video area to the user.

Alternatively, the server pushes the complete video to the user, and the terminal acquires the video area that matches the user information according to the preset user information and the identification information marked in the video, and performs matching display according to the user's preference or setting.

Alternatively, the server pushes the complete video to the user. During the user's viewing process, the terminal dynamically matches the identification information according to the video viewing requirement of the specific content, and displays the corresponding video area to the user.

The user information herein may include, but is not limited to, at least one of: a user's viewing habits, a user's preference for a particular content, a user's preference, and a user's specific use. The identification information herein may be set to indicate but not limited to at least one of the following: a label type of the video content attachment label of the area, a label content type of the video content attachment label of the area, and information of the label content of the area video content attachment label, The quality level of the video content in the area, the spatial location of the area in the overall video.

Since the identification information in the video is identified based on the specific video content and spatial location, the server can directly locate the video area that matches the user information and push the video area to the user, and the terminal can be straight. The video area that matches the user information is located and displayed to the user. It should be noted that the user information here may be the user information that is obtained in advance before the video is pushed, or may be obtained by collecting the feedback of the user in the process of the user watching the video, which is not limited in this embodiment. . If the user information is collected in advance, the matched video area can be pushed to the user in the initial stage of the user watching the video. If the user information collected during the video viewing by the user is analyzed, the user information can be analyzed and identified in the video. After the information is matched, the matched video area is pushed to the user during the subsequent viewing process of the user.

The above marking process can be implemented by adding new identification information to the video media related information, which can be implemented variously, preferably by the following set of information.

Quality_level: indicates the quality level of the video content in the area;

Label_center_yaw: indicates the yaw coordinate value of the yaw angle of the center point of the label area;

Label_center_pitch: indicates the pitch angle pitch coordinate value of the center point of the label area;

Label_width: indicates the width of the label area;

Label_height: indicates the height of the label area;

Label_type: indicates the label type of the video content attachment label of the area;

Label_info_type: indicates the tag content type of the video content affiliate tag of the area;

Label_info_content_length: indicates the content length of the video content affiliate tag of the area;

Content_byte: Indicates the specific byte information of the tag content of the video content affiliate tag of the area.

For the convenience of description in the following embodiments, a set of identification information descriptions described above is cited, but in other embodiments, other information may or may be used.

Taking the basic media file format ISOBMFF as an example, the identification information based on the video content and its spatial position, that is, quality_level, label_center_yaw, label_center_pitch, label_width, label_height, label_type, label_info_type, label_info_content_length, content_byte, is appropriately added to form a specific content and a specific spatial position. The ID of the video area.

For the present disclosure, the following fields can be reasonably added as needed:

Label_number: Indicates the number of labels included in this video area.

Quality_level: Indicates the quality level of video content in this area. The higher the value, the higher the video quality.

Label_center_yaw: Indicates the yaw coordinate value of the center point of the label area, in units of 0.01 degrees, ranging from [-18000, 18000).

Label_center_pitch: Indicates the pitch coordinate value of the center point of the label area, in units of 0.01 degrees, in the range of [-9000, 9000].

Label_width: Indicates the width of the label area, in units of 0.01 degrees.

Label_height: Indicates the height of the label area, in units of 0.01 degrees. Label_type: indicates the label type of the video content attachment label of the area. The value and meaning of the label type are shown in Table 1.

Table 1

取值Value	描述description
00	该视频内容的附属标签为人脸The attached label of the video content is a face
11	该视频内容的附属标签为车牌The attached label for this video content is the license plate
22	该视频内容的附属标签为一般运动目标The affiliate tag of this video content is a general sports target.
33	该视频内容的附属标签为一般静态目标The affiliate tag of the video content is a general static target
44	该视频内容的附属标签为商品The attached label for this video content is the product
55	该视频内容的附属标签为植物The attached label for this video content is a plant
6-2546-254	该部分为保留字段This part is reserved
255255	该视频内容的附属标签为用户自定义标签The affiliate tag for this video content is a user-defined tag

Label_info_type: indicates the tag content type of the video content attachment tag of the area. The value and meaning of the tag content type are shown in Table 2.

Table 2

取值Value	描述description
00	该标签内容为文本The label content is text
11	该标签内容为URLThe tag content is a URL
2-2552-255	该部分为保留字段This part is reserved

Label_info_content_length: indicates the length of the label content of the video content attachment tag of the area.

Based on the above information, taking ISOBMFF as an example, the organization of this information is given below. The label group LabelBox corresponding to a video area includes label_number label information LabelInfoBox and label area information LabelRegionBox.

A label information LabelInfoBox contains a label type label_type, a label content type label_info_type, a label content length label_content_length, and a label_content_length content information content_byte.

A label area information LabelRegionBox contains a quality level quality_level, spatial position information: label area center point information (label_center_yaw.label_center_pitch), label area width label_width, label area height label_height.

The meaning of each of the above fields has been explained above.

It should be noted that, in the present disclosure, the video content attached label is only described by taking the above field as an example, and is not limited to the above fields and their sizes. In order to better understand the meaning of the above fields, refer to the application example shown in FIG. 6. FIG. 6 is a schematic diagram of content of an optional identification information in an embodiment of the present disclosure.

Example 4

In order to better understand the technical solutions in the foregoing embodiments, the present embodiment introduces the technical solutions of the embodiments of the present disclosure by using the following preferred embodiments.

Preferred Embodiment 1: Video Positioning Application

The panoramic video includes a 180-degree or 360-degree viewing angle range, but the human viewing angle has limitations, and the entire panoramic video content cannot be viewed at the same time, but only a part of the panoramic video is viewed. Therefore, the user can view different area videos in the panorama in different browsing order. It is worth noting that some areas where the user views the panoramic video are not completely random, but the video area is switched according to the user's personal preference. The disclosure provides a label associated with a video, which is set to indicate a partial video area specific content and specific spatial space location information, and then directly locates a corresponding video area according to user preferences, and presents the part of the video to the user. The following is exemplified by several examples.

Example one

According to the preset label type, the information of the corresponding video area is marked in the recorded panoramic video content, and the video area containing the label is preferentially pushed to the user for viewing according to the user's preference for the label type during the user viewing process. .

It is also possible to dynamically collect the user's viewing content information according to the existing tag types in the video, analyze the user's preferences, and push the user's area of interest video to the user for viewing.

See Figure 7 for an exemplary example. FIG. 7 is a schematic diagram of an optional video positioning method according to an embodiment of the present disclosure. As shown in FIG. 7, the tag may indicate that the corresponding area is information such as a face, a plant, and the like. If the user likes to pay attention to the plants in the video, the user can preferentially push the video content of the region to the user by positioning the plant tag and according to the corresponding spatial location information and rotation information when the user views the panoramic video.

Example two

Because the user's view area is limited at the same time, the entire panoramic video will not be viewed. Therefore, in the case of limited bandwidth, the user's region of interest can be encoded with high quality, and the non-interest region of the user can be low-quality encoded.

Illustratively, the area of the face in which the user is interested is in a high quality coding mode, and the other parts are in low quality coding.

Example three

In the process of watching the panoramic video, the user can set a plurality of tags of interest, and push the optimal area video to the user according to various possible combinations of the tags.

It is also possible to dynamically collect the user's viewing habits, analyze the user's preferences, and select multiple user preferences to combine into multiple forms.

Illustratively, the user is interested in a certain person and a certain car, and sets the two types as the tags of interest. When pushing the video, the video area containing both the person and the car tags is preferentially displayed when there is no simultaneous presence. When, select the video area display of the person or car alone.

Example four

The tag type added in the panoramic video can be preset, or it can be a label customized by the user according to his own needs.

Or the user can combine the related tags to define the type of combination that they need.

Exemplarily, the user sets a custom label for an item in the video, feeds the information of the label to the server, and the server subsequently pushes the relevant video area to the user according to the set label.

Example five

The content form carried by the different tags in the panoramic video may be different from the content itself, and the tag content may be text, such as a character tag, and the text content describes the person's name and resume. The tag content can be a number, such as a product tag, and a digital content describing the price information. The tag content can be a link, such as a plant tag, and the link content gives the URL address of the plant in detail.

Example six

A tag in the same video area can be associated with multiple types of content information. Illustratively, for a description of a product mark in a video, text information of the product name, digital price of the product price or production date, and goods can be added. The link information for the purchase path.

Example seven

A label of the panoramic video area setting can be nested to contain multiple sub-tags. Illustratively, for a sports panoramic video, there are a plurality of athletes on the field, and the user does not pay attention to a single athlete, but the entire moving picture and motion. In the cooperation between the members, multiple character sub-tags can be nested under the same sports label so that the user can watch.

Preferred Embodiment 2: Virtual Reality Video Application

Similar to the panoramic video application, in the virtual reality video application, the video area viewed by the user is not a complete virtual reality video area, so the video content of interest can be pushed for the user by adding different labels.

Preferred Embodiment 3: Multi-view video application

A tag is added to multiple view videos, and the user sets a tag for the region of interest, and the best view video can be selected and pushed to the user according to the tag of the user's interest.

Preferred Embodiment 4: Video Retrieval Application

In the application scenario of video surveillance, the acquired surveillance video is usually used to track target vehicles, target people, etc., but because these tracking behaviors often need to analyze and process a large number of surveillance videos through image processing and other technologies in a short time, for video surveillance. The application brings a lot of work. In combination with the indication information provided by the present disclosure, since the video area of a specific content such as a face, a license plate, and the like can be marked during the shooting of the monitoring video, the label in the video can be directly retrieved after receiving the monitoring video. , greatly reducing the workload of video retrieval. The following is exemplified by several examples.

Example one

FIG. 8 is a schematic diagram of an optional video retrieval method according to an embodiment of the present disclosure. As shown in FIG. 8, the specific content in the video is tagged during the monitoring video capture, and the user can directly retrieve the tags after receiving the monitoring video, for example, the tags of all the license plates can be retrieved, and the tags are obtained. The associated information finally obtains the video information of all the license plates included in the video and the number information of the license plate.

Example two

Multiple tags can be set for video retrieval, and all relevant video regions are searched based on various combinations of these tags.

Illustratively, a person and a certain car are searched for a combined tag, and finally video information containing the two tags is obtained.

Video retrieval, which directly retrieves the video content required by the user in a large number of videos. The present disclosure provides an identification information marking method based on video content and its spatial location. Therefore, the corresponding video region can be quickly retrieved by retrieving the identification information provided by the present disclosure, thereby greatly improving the efficiency of video retrieval.

The present disclosure uses ISOBMFF as an example to illustrate the proposed solution, but these solutions can also be used in other file packages, transmission systems, and protocols.

Example 5

Embodiments of the present disclosure also provide a storage medium. Optionally, in the embodiment of the present disclosure, the foregoing storage medium may be configured to save the program code executed by the card tray pop-up method provided in the first embodiment.

Optionally, in the embodiment of the present disclosure, the foregoing storage medium may be located in any one of the computer terminal groups in the computer network, or in any one of the mobile terminal groups.

Optionally, in an embodiment of the present disclosure, the storage medium is arranged to store program code arranged to perform the following steps:

S1, marking the target object in the video, and generating identification information of the target object according to the marking result, wherein the identification information is at least set to indicate one of: a type of the target object, a content of the target object, and a target object in the video. Spatial location information;

S2: acquiring instruction information, and specifying specified identification information of the target object according to the index of the instruction information;

S3, pushing or displaying some or all of the videos corresponding to the specified identification information in the video.

The above-mentioned serial numbers of the embodiments of the present disclosure are merely for the description, and do not represent the advantages and disadvantages of the embodiments.

In the above-mentioned embodiments of the present disclosure, the descriptions of the various embodiments are all focused, and the parts that are not detailed in a certain embodiment can be referred to the related descriptions of other embodiments.

In the several embodiments provided by the present application, it should be understood that the disclosed technical contents may be implemented in other manners. The device embodiments described above are merely illustrative. For example, the division of the unit is only a logical function division. In actual implementation, there may be another division manner. For example, multiple units or components may be combined or may be Integrate into another system, or some features can be ignored or not executed. In addition, the mutual coupling or direct coupling or communication connection shown or discussed may be an indirect coupling or communication connection through some interface, unit or module, and may be electrical or otherwise.

The units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the objectives of the embodiments of the present disclosure.

In addition, each functional unit in various embodiments of the present disclosure may be integrated into one processing unit, or each unit may exist physically separately, or two or more units may be integrated into one unit. The above integrated unit can be implemented in the form of hardware or in the form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a standalone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present disclosure, or all or part of the technical solution, may be embodied in the form of a software product stored in a storage medium. A number of instructions are included to cause a computer device (which may be a personal computer, server or network device, etc.) to perform all or part of the steps of the methods described in various embodiments of the present disclosure. The foregoing storage medium includes: a U disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic disk, or an optical disk, and the like. .

The above description is only a preferred embodiment of the present disclosure, and is not intended to limit the disclosure, and various changes and modifications may be made to the present disclosure. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and scope of the present disclosure are intended to be included within the scope of the present disclosure.

Industrial applicability

A method for processing a video provided by an embodiment of the present disclosure, by marking a target object in a video, and further according to The tag result generates identification information of the target object, and then obtains instruction information set to indicate the specified target object, and indexes the specified identification information to the specified target object according to the instruction information, and then pushes or displays the specified identifier according to the spatial location information in the identification information. Part or all of the video corresponding to the information solves the problem that the user needs to detect the video area that is interested in the video content in the received large number of videos in the related art, which causes a large amount of resources and time, and the user can pass the index. The identification information already existing in the video quickly acquires the video of interest, which greatly saves resources and time in the video retrieval process.

Claims

A video processing method, including:

Marking the target object in the video, and generating the identification information of the target object according to the marking result, wherein the identification information is at least set to indicate one of: a type of the target object, a content of the target object, Spatial location information of the target object in the video;

Obtaining instruction information, and specifying specified identification information of the target object according to the instruction information index;

Pushing or displaying part or all of the video corresponding to the specified identification information in the video.
The method according to claim 1, wherein the identification information is further configured to at least indicate one of: a tag type of the identification information, a tag content type of the identification information, and a tag content of the identification information, a length information of the identification information, a quality level of a part or all of the video in which the target object is located, a quantity of the identification information included in part or all of the video in which the target object is located, and a part or all of the video in which the target object is located Time information, spatial location information of some or all of the videos in which the target object is located in the video.
The method according to claim 2, wherein the spatial position information of the part or all of the video in the video comprises at least one of: a central point coordinate of the part or all of the video, the part or all of the video Width, the height of the part or all of the video; wherein the coordinate system in which the coordinates are located includes one of the following: a two-dimensional spatial coordinate system, a three-dimensional spatial coordinate system.
The method of claim 3, wherein

In the two-dimensional space coordinate system, the value of the coordinate includes at least one of the following: a value of a two-dimensional rectangular coordinate system, and a value of a two-dimensional spherical coordinate system;

In the three-dimensional space coordinate system, the coordinates are at least one of the following values: a three-dimensional space rectangular coordinate system takes values, and a three-dimensional spherical coordinate system takes values.
The method according to claim 1, wherein the marking the target object in the video, and generating the identification information of the target object according to the marking result, comprises:

During the process of video capture or editing, marking the target object in the video, and generating the identification information of the target object according to the marking result; and/or

In the captured or edited video data, the target object in the video is marked, and the identification information of the target object is generated according to the marking result.
The method of claim 1, wherein the obtaining instruction information comprises:

Obtaining first instruction information preset by a user; and/or

Obtaining second instruction information obtained after analyzing the video viewing behavior of the user.
A video processing device, comprising:

a tagging module, configured to tag a target object in the video;

Generating a module, configured to generate identification information of the target object according to the marking result, wherein the identification information is at least set to indicate one of: a type of the target object, a content of the target object, and the target object is Spatial location information in the video;

Obtaining a module, set to obtain instruction information;

An indexing module, configured to: specify specified identification information of the target object according to the instruction information index;

The processing module is configured to push or display part or all of the video corresponding to the specified identification information in the video.
The apparatus according to claim 7, wherein the identification information is at least set to indicate one of: a tag type of the identification information, a tag content type of the identification information, a length information of the identification information, the a mark content of the identification information, a quality level of a part or all of the video in which the target object is located, a quantity of the identification information included in part or all of the videos in which the target object is located, and time information corresponding to the part or all of the videos. The spatial location information of some or all of the videos in the video.
The apparatus according to claim 8, wherein the spatial position information of said part or all of said video in said video comprises at least one of: a central point coordinate of said part or all of said video, said part or all of video Width, the height of the part or all of the video; wherein the coordinate system in which the coordinates are located includes one of the following: a two-dimensional spatial coordinate system, a three-dimensional spatial coordinate system.
The apparatus according to claim 9, wherein

In the two-dimensional space coordinate system, the value of the coordinate includes at least one of the following: a value of a two-dimensional rectangular coordinate system, and a value of a two-dimensional spherical coordinate system;

In the three-dimensional space coordinate system, the coordinates are at least one of the following values: a three-dimensional space rectangular coordinate system takes values, and a three-dimensional spherical coordinate system takes values.
The apparatus of claim 7 wherein said marking module comprises:

a first marking unit configured to mark a target object in the video during video capture or editing;

The second marking unit is configured to mark the target object in the video in the captured or edited video data.
The apparatus of claim 7, wherein the obtaining module comprises:

a first acquiring unit, configured to acquire first instruction information preset by a user;

The second obtaining unit is configured to acquire second instruction information that is obtained after analyzing the video viewing behavior of the user.
A storage medium, wherein the storage medium comprises a stored program, wherein the program is executed to perform the method of any one of claims 1 to 6.
A processor, wherein the processor is configured to execute a program, wherein the program is executed to perform the method of any one of claims 1 to 6.