CN114501165A

CN114501165A - Video structured representation method and device and electronic equipment

Info

Publication number: CN114501165A
Application number: CN202011149590.6A
Authority: CN
Inventors: 薛子育; 王磊; 刘庆同; 郭沛宇
Original assignee: Research Institute Of Radio And Television Science State Administration Of Radio And Television
Current assignee: Research Institute Of Radio And Television Science State Administration Of Radio And Television
Priority date: 2020-10-23
Filing date: 2020-10-23
Publication date: 2022-05-13

Abstract

The application discloses a video structured representation method, which comprises the following steps: receiving a video object to be processed; according to the video object, obtaining a first video frame set and a second video frame set, wherein the first video frame set is obtained by splitting the video object by using a preset rule, and video frames in the second video frame set are key frames in the video object; and obtaining the structural representation information of the video object according to the first video frame set and the second video frame set. The method can conveniently and uniformly obtain the structural representation information of the video object, thereby increasing the reusability of the video object.

Description

Video structured representation method and device and electronic equipment

Technical Field

The present disclosure relates to the field of video processing technologies, and in particular, to a method and an apparatus for representing a video structure, and a computer-readable storage medium.

Background

At present, when video resources, such as resources of a broadcast station, an Internet Protocol Television (IPTV), a new media, and a web audiovisual, are labeled to be structurally represented, a large amount of manpower and material resources are generally consumed to perform manual labeling; in addition, the different styles of different annotators lead to lower reusability of video resources and the problem of resource waste.

Disclosure of Invention

It is an object of embodiments of the present disclosure to provide a new technical solution for video structured representation.

In a first aspect of the present disclosure, a video structured representation method is provided, which includes:

receiving a video object to be processed;

according to the video object, a first video frame set and a second video frame set are obtained, wherein the first video frame set is obtained by splitting the video object by using a preset rule, and video frames in the second video frame set are key frames in the video object;

and obtaining the structural representation information of the video object according to the first video frame set and the second video frame set.

Optionally, the obtaining the structured representation information of the video object according to the first video frame set and the second video frame set includes:

obtaining a first semantic information set according to the first video frame set, wherein the first semantic information set comprises semantic information respectively corresponding to video frames in the first video frame set;

according to the second video frame set, obtaining video semantic information used for describing the video object;

and obtaining the structural representation information according to the first semantic information set and the video semantic information.

Optionally, the obtaining a first semantic information set according to the first video frame set includes:

obtaining a first feature information set according to the first video frame set, wherein the first feature information set comprises feature information respectively corresponding to video frames in the first video frame set;

according to the first characteristic information set, performing semantic analysis processing on the video frames in the first video frame set to obtain the first semantic information set.

Optionally, the obtaining video semantic information for describing the video object according to the second video frame set includes:

obtaining a second feature information set according to the second video frame set, wherein the second feature information set comprises feature information respectively corresponding to video frames in the second video frame set;

according to the second characteristic information set, performing semantic analysis processing on the video frames in the second video frame set to obtain a second semantic information set;

and obtaining the video semantic information according to the second semantic information set.

Optionally, after obtaining the structured representation information, the method further comprises:

and correspondingly storing the video object, the video semantic information, the first video frame set and the first semantic information set into a video resource library.

Optionally, the first set of video frames is obtained by:

and splitting the video object according to the preset splitting duration as a step length to obtain the first video frame set.

Optionally, the first set of video frames is obtained by: acquiring the total duration of the video object;

obtaining a target splitting duration for splitting the video object according to the total duration;

and splitting the video object according to the target splitting duration as a step length to obtain the first video frame set.

In a second aspect of the present disclosure, there is also provided a feature extraction device, including:

the video object receiving module is used for receiving a video object to be processed;

the video object splitting module is used for obtaining a first video frame set and a second video frame set according to the video object, wherein the first video frame set is obtained by splitting the video object by using a preset rule, and video frames in the second video frame set are key frames in the video object;

and the structural representation information obtaining module is used for obtaining the structural representation information of the video object according to the first video frame set and the second video frame set.

In a third aspect of the present disclosure, there is also provided an electronic device, which includes the apparatus of the second aspect of the present disclosure; alternatively, the electronic device includes: a memory for storing executable instructions; a processor configured to execute the electronic device according to the control of the instruction to perform the method according to the first aspect of the present disclosure.

In a fourth aspect of the present disclosure, a computer-readable storage medium is also provided, which stores a computer program readable and executable by a computer, and the computer program is configured to execute the method according to the first aspect of the present disclosure when the computer program is read and executed by the computer.

The method has the advantages that according to the embodiment of the disclosure, after the electronic device receives the video object to be processed, the video object is split according to the preset rule, and the key frame in the video object is extracted to obtain the first video frame set and the second video frame set; then, according to the first video frame set and the second video frame set, the electronic equipment can conveniently and uniformly obtain the structural representation information of the video object. According to the method, the video object is not required to be labeled manually, but is represented structurally by the electronic equipment intelligently, so that manpower and material resources can be saved, the problem of nonuniform formats of video structural representation information caused by different styles of labels can be avoided, and the reusability of the video object can be improved.

Other features of the present disclosure and advantages thereof will become apparent from the following detailed description of exemplary embodiments thereof, which proceeds with reference to the accompanying drawings.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of the specification, illustrate embodiments of the disclosure and together with the description, serve to explain the principles of the disclosure.

Fig. 1 is a schematic block diagram of a hardware configuration of a video structured representation system that can be used to implement a video structured representation method according to an embodiment of the present disclosure.

Fig. 2 is a schematic flowchart of a video structured representation method according to an embodiment of the present disclosure.

Fig. 3 is a schematic view of an application scenario of a video structured representation method provided by an embodiment of the present disclosure.

Fig. 4 is a schematic block diagram of a video structured representation apparatus provided by an embodiment of the present disclosure.

Fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present disclosure.

Detailed Description

Various exemplary embodiments of the present disclosure will now be described in detail with reference to the accompanying drawings. It should be noted that: the relative arrangement of the components and steps, the numerical expressions, and numerical values set forth in these embodiments do not limit the scope of the present disclosure unless specifically stated otherwise.

The following description of at least one exemplary embodiment is merely illustrative in nature and is in no way intended to limit the disclosure, its application, or uses.

Techniques, methods, and apparatus known to those of ordinary skill in the relevant art may not be discussed in detail but are intended to be part of the specification where appropriate.

In all examples shown and discussed herein, any particular value should be construed as merely illustrative, and not limiting. Thus, other examples of the exemplary embodiments may have different values.

It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, further discussion thereof is not required in subsequent figures.

< hardware configuration >

As shown in fig. 1, the video structured representation system 1000 includes a server 1100, a terminal device 1200, and a communication network 1300.

The server 1100 may be, for example, a blade server, a rack server, or the like, and the server 1100 may also be a server cluster deployed in a cloud, which is not limited herein.

As shown in FIG. 1, server 1100 may include a processor 1110, a memory 1120, an interface device 1130, a communication device 1140, a display device 1150, and an input device 1160. The processor 1110 may be, for example, a central processing unit CPU or the like. The memory 1120 includes, for example, a ROM (read only memory), a RAM (random access memory), a nonvolatile memory such as a hard disk, and the like. The interface device 1130 includes, for example, a USB interface, a serial interface, and the like. The communication device 1140 is capable of wired or wireless communication, for example. The display device 1150 is, for example, a liquid crystal display panel. Input devices 1160 may include, for example, a touch screen, a keyboard, and the like.

In this embodiment, the server 1100 may be used to participate in implementing a method according to any embodiment of the present disclosure.

As applied to any embodiment of the present disclosure, the memory 1120 of the server 1100 is configured to store instructions for controlling the processor 1110 to operate in support of implementing a method according to any embodiment of the present invention. The skilled person can design the instructions according to the disclosed solution. How the instructions control the operation of the processor is well known in the art and will not be described in detail herein.

Those skilled in the art will appreciate that although a number of devices are shown in FIG. 1 for the server 1100, the server 1100 of embodiments of the present disclosure may refer to only some of the devices therein, e.g., only the processor 1110 and the memory 1120.

As shown in fig. 1, the terminal apparatus 1200 may include a processor 1210, a memory 1220, an interface device 1230, a communication device 1240, a display device 1250, an input device 1260, an audio output device 1270, an audio input device 1280, and the like. The processor 1210 may be a central processing unit CPU, a microprocessor MCU, or the like. The memory 1220 includes, for example, a ROM (read only memory), a RAM (random access memory), a nonvolatile memory such as a hard disk, and the like. The interface device 1230 includes, for example, a USB interface, a headphone interface, and the like. The communication device 1240 can perform wired or wireless communication, for example. The display device 1250 is, for example, a liquid crystal display, a touch display, or the like. The input device 1260 may include, for example, a touch screen, a keyboard, and the like. The terminal apparatus 1200 may output the audio information through the audio output device 1270, the audio output device 1270 including a speaker, for example. The terminal apparatus 1200 may pick up voice information input by the user through the audio pickup device 1280, and the audio pickup device 1280 includes, for example, a microphone.

The terminal device 1200 may be a smart phone, a laptop computer, a desktop computer, a tablet computer, etc., and is not limited herein.

It should be understood by those skilled in the art that although a plurality of means of the terminal device 1200 are shown in fig. 1, the terminal device 1200 of the embodiments of the present disclosure may refer to only some of the means therein, for example, only the processor 1210, the memory 1220, and the like.

The communication network 1300 may be a wireless network or a wired network, and may be a local area network or a wide area network. The terminal apparatus 1200 can communicate with the server 1100 through the communication network 1300.

It should be noted that the video structured representation system 1000 shown in fig. 1 is merely illustrative and is in no way intended to limit the present disclosure, its applications, or uses. For example, although fig. 1 shows only one server 1100 and one terminal apparatus 1200, it is not meant to limit the respective numbers, and multiple servers 1100 and/or multiple terminal apparatuses 1200 may be included in the system 1000.

< method examples >

FIG. 2 is a flow diagram of a video structured representation method, which may be implemented by an electronic device, which may be a server, for example, server 1100 shown in FIG. 1, according to an embodiment of the present disclosure; alternatively, as the technology advances, the electronic device may be a terminal device, and is not particularly limited herein.

Referring to FIG. 2, the method of the present embodiment may include the following steps S2100-S2300, which will be described in detail below.

In step S2100, a video object to be processed is received.

The video object may be any video object in a broadcast station, an internet protocol television, a new media, and a network audio-visual, and the attributes such as format, resolution, and the like of the video object are not particularly limited in this embodiment.

Specifically, in order to solve the problems of manpower and material waste and low resource reusability in the prior art when a video object is structurally represented by using a manual annotation, the present embodiment provides a video structural representation method, which can obtain structural representation information of a video object by sending the video object to be processed to an electronic device for performing video structural representation processing, for example, a server 1100 shown in fig. 1, without manually performing annotation when performing structural representation of the video object.

In particular implementation, a user may send a video object to be processed to the server 1100 through a terminal device, for example, the terminal device 1200 shown in fig. 1; alternatively, the server 1100 may automatically obtain the video object to be subjected to the video structured representation processing from a preset location according to the setting at a preset time interval, for example, the server 1100 may obtain the video object to be subjected to the video structured representation processing from an appointed video storage address every 60 minutes, which is not described herein again.

After step S2100, step S2200 is executed to obtain a first video frame set and a second video frame set according to the video object, where the first video frame set is obtained by splitting the video object using a preset rule, and a video frame in the second video frame set is a key frame in the video object.

After an electronic device, for example, the server 1100 shown in fig. 1 receives a video object to be processed, in order to accurately obtain the structural representation information of the video object, in the present embodiment, a description is taken of the video object from two dimensions of an image and a video to achieve the purpose of structural representation of the video object, which is described in detail below.

Specifically, after the electronic device receives a video object, the video object may be split using a preset rule to obtain a first video frame set composed of a plurality of video frames, and description information describing the video object from the dimension of the image is obtained by performing semantic analysis processing on each video frame, i.e., each image, in the first video frame set; and extracting key frames in the video object according to the time sequence information in the video object to obtain a second video frame set consisting of at least one key frame, and obtaining description information describing the video object from the dimension of the video by performing semantic analysis processing on the key frames of the video object.

In the following, how to obtain the first set of video frames will first be described.

In a specific implementation, the first set of video frames may be obtained by: and splitting the video object according to the preset splitting duration as a step length to obtain the first video frame set.

For example, with the preset split duration being 10s, for a video object with a total duration of 100s, a first video frame set consisting of a video frame at 0s, a video frame at 10s, …, and a video frame at 100s may be obtained. It should be noted that, in practice, the preset splitting duration may be set according to needs, and is not particularly limited herein.

It should be noted that, in order to reduce the amount of calculation of the computing device and increase the processing speed, after the electronic device receives the video object, the split duration may also be automatically obtained according to the total duration of the video object, that is, the first video frame set may also be obtained through the following steps: acquiring the total duration of the video object; obtaining a target splitting duration for splitting the video object according to the total duration; and splitting the video object according to the target splitting duration as a step length to obtain the first video frame set.

In a specific implementation, the obtaining, according to the total duration, a target splitting duration for splitting the video object may be: setting a time length threshold value set, wherein the time length threshold value set comprises at least one time length threshold value, and each time length threshold value corresponds to different splitting time lengths; and obtaining the target splitting duration according to the total duration and the duration threshold value set.

For example, the duration threshold set may be set to { (0 second to 100 seconds, 10 seconds), (101 seconds to 60 minutes, 5 minutes), (60 minutes and more, 10 minutes) }, that is, when the total duration of the video objects to be processed is less than 100 seconds, the target splitting duration is 10 seconds; when the total time length of the video objects to be processed is greater than or equal to 100 seconds and less than 60 minutes, the target split time length is 5 minutes. Of course, this is merely an example, and in specific implementation, the duration threshold and the corresponding split duration in the duration threshold set may be set according to needs, which is not particularly limited herein.

It should be noted that, in specific implementation, the two manners may be combined or other rules may be used to split the video object according to needs to obtain the first video frame set, which is not described herein again.

How to obtain the first set of video frames is explained above, and how to obtain the second set of video frames is explained below.

In this embodiment, the second set of video frames is a set consisting of key frames in the video object to be processed, wherein the key frames are complete frames of images in the video object.

In specific implementation, when a key frame of a video object needs to be extracted, a shot-based algorithm may be generally used, that is, first, each shot in the video object is obtained, and then, the first frame and the last frame in each shot are used as key frames of the video object; alternatively, an algorithm based on motion analysis may also be used, that is, a video frame with the minimum number of optical flow movements in the video shot is selected as a key frame of the video object by analyzing the optical flow of the object motion in the video shot; or, a video distance method can be used, that is, clustering analysis is performed on video frames in the video object to obtain a plurality of clusters; and then, taking the video frame corresponding to the central point in each cluster as a key frame of the video object.

In this embodiment, a convolutional neural network model for extracting a key frame of the video object may be trained in advance, and a second set of video frames composed of at least one key frame of the video object may be obtained by inputting the video object into the convolutional neural network model.

Of course, in specific implementation, other key frame extraction algorithms may also be used to extract key frames of the video object as needed to obtain the second video frame set, which is not particularly limited herein.

After step S2200, step S2300 is executed to obtain the structured representation information of the video object according to the first video frame set and the second video frame set.

After the first video frame set and the second video frame set are obtained through the steps, the structured representation information of the video object can be obtained according to the first video frame set and the second video frame set.

Namely, the obtaining the structured representation information of the video object according to the first video frame set and the second video frame set comprises: obtaining a first semantic information set according to the first video frame set, wherein the first semantic information set comprises semantic information respectively corresponding to video frames in the first video frame set; according to the second video frame set, obtaining video semantic information used for describing the video object; and obtaining the structural representation information according to the first semantic information set and the video semantic information.

Specifically, said obtaining a first set of semantic information from said first set of video frames comprises: obtaining a first feature information set according to the first video frame set, wherein the first feature information set comprises feature information respectively corresponding to video frames in the first video frame set; according to the first characteristic information set, performing semantic analysis processing on the video frames in the first video frame set to obtain the first semantic information set.

That is, in this embodiment, a video object may be split to obtain a plurality of video frames, and semantic information of each video frame is obtained, so that the video object is described hierarchically from the dimension of an image.

In specific implementation, the first characteristic information set can be obtained by inputting the video frames in the first video frame set into a convolutional neural network model for extracting the image characteristic information; and then, according to the first characteristic information set, performing semantic analysis processing on each video frame to obtain a first semantic information set.

In this embodiment, when performing semantic analysis processing on a video frame according to feature information of the video frame, an object in the video frame may be identified according to the feature information, and the identified object may be used as semantic information corresponding to the video frame.

For example, for a video frame including an airplane and an airport, semantic information of the video frame may be identified as 'airplane and airport'.

In a specific implementation, the obtaining, according to the second video frame set, video semantic information for describing the video object includes: obtaining a second feature information set according to the second video frame set, wherein the second feature information set comprises feature information corresponding to video frames in the second video frame set respectively; according to the second characteristic information set, performing semantic analysis processing on the video frames in the second video frame set to obtain a second semantic information set; and obtaining the video semantic information according to the second semantic information set.

That is, after the key frames of the video object are obtained, the feature information of each key frame is obtained, semantic analysis processing is performed on the key frames according to the feature information to obtain a second semantic information set, and then the video semantic information can be obtained by sorting the content in the second semantic information set.

For example, for a video object whose video content is mainly airplane takeoff and landing, the video semantic information of the video object can be obtained by extracting a key frame during airplane takeoff and a key frame during airplane landing, and the video semantic information is "airplane takeoff from a purdon airport and capital airport landing".

In the above way, the video frames in the first video frame set and the second video frame set are respectively subjected to semantic analysis processing, so that the video object can be described from two dimensions of the image and the video, and the structured representation information of the video object can be conveniently and quickly obtained.

Please refer to fig. 3, which is a schematic view of an application scenario of the video structured representation method according to the embodiment of the present disclosure. As shown in fig. 3, in specific implementation, for a video object to be processed, after a first video frame set is obtained by performing "video frame splitting" processing on the video object, a second video frame set is obtained by performing key frame extraction processing, and feature information of video frames and key frames in the first video frame set and the second video frame set are respectively extracted to obtain a first semantic information set and video semantic information, in order to manage the video object and facilitate later retrieval of the video object and multiplexing of the video object, the method provided in this embodiment further includes: and correspondingly storing the video object, the video semantic information, the first video frame set and the first semantic information set into a video resource library.

As shown in fig. 3, in a specific implementation, when the corresponding video object and the structural description information thereof are stored in the video repository, the video object and the structural description information thereof may be stored in a corresponding relationship of { (video sequence number, video object identification information, video semantic information), (frame 1, frame description information 1), (frame 2, frame description information 2), and … (frame n, frame description information n) }, where frame 1, frame 2, and … frame n are video frames in a first video frame set obtained by splitting the video object, frame description information 1, frame description information 2, and … frame description information n are semantic description information in the first semantic information set respectively corresponding to video frames in the first video frame set, and n is a positive integer not less than zero. In addition, the video repository may be a database in the database server for storing video objects, so as to manage, retrieve and multiplex the video objects.

For example, for a video object whose video content is mainly an airplane taking off and landing, the video structured representation information finally stored in the video repository may be: mp4, plane takeoff from the purdon airport, capital airport landing, 1-00:00: 00:01, plane, airport, taxi-in, pilot, ….

In summary, in the video structured representation method provided by this embodiment, after the electronic device receives a video object to be processed, the video object is split according to a preset rule, and a first video frame set and a second video frame set are obtained by extracting a key frame in the video object; then, according to the first video frame set and the second video frame set, the electronic equipment can conveniently and uniformly obtain the structural representation information of the video object. According to the method, the video object does not need to be labeled manually, but is structurally represented intelligently by the electronic equipment, so that manpower and material resources can be saved, the problem of non-uniform format of video structural representation information caused by different styles of labels can be avoided, and the reusability of the video object can be improved.

< apparatus embodiment >

Corresponding to the above method embodiments, in this embodiment, a video structured representation apparatus is further provided, as shown in fig. 4, the apparatus 4000 may include a video object receiving module 4100, a video object splitting module 4200, and a structured representation information obtaining module 4300.

The video object receiving module 4100 is configured to receive a video object to be processed.

The video object splitting module 4200 is configured to obtain a first video frame set and a second video frame set according to the video object, where the first video frame set is obtained by splitting the video object using a preset rule, and a video frame in the second video frame set is a key frame in the video object.

The structural representation information obtaining module 4300 is configured to obtain the structural representation information of the video object according to the first video frame set and the second video frame set.

In one embodiment, the structured representation information obtaining module 4300, when obtaining the structured representation information of the video object according to the first video frame set and the second video frame set, may be configured to: obtaining a first semantic information set according to the first video frame set, wherein the first semantic information set comprises semantic information respectively corresponding to video frames in the first video frame set; according to the second video frame set, obtaining video semantic information used for describing the video object; and obtaining the structural representation information according to the first semantic information set and the video semantic information.

In one embodiment, the structured representation information obtaining module 4300, when obtaining the first set of semantic information from the first set of video frames, may be configured to: obtaining a first feature information set according to the first video frame set, wherein the first feature information set comprises feature information respectively corresponding to video frames in the first video frame set; according to the first characteristic information set, performing semantic analysis processing on the video frames in the first video frame set to obtain the first semantic information set.

In one embodiment, when the structural representation information obtaining module 4300 obtains video semantic information describing the video object according to the second video frame set, it may be configured to: obtaining a second feature information set according to the second video frame set, wherein the second feature information set comprises feature information corresponding to video frames in the second video frame set respectively; according to the second characteristic information set, performing semantic analysis processing on the video frames in the second video frame set to obtain a second semantic information set; and obtaining the video semantic information according to the second semantic information set.

In one embodiment, after obtaining the structured representation information, the apparatus 4000 further includes a storage module configured to store the video object, the video semantic information, the first video frame set, and the first semantic information set in a video repository.

< apparatus embodiment >

Corresponding to the above method embodiments, in this embodiment, an electronic device is further provided, which may include the video structured representation apparatus 4000 according to any embodiment of the present disclosure, for implementing the method according to any embodiment of the present disclosure.

As shown in fig. 5, the electronic device 5000 may further include a processor 5200 and a memory 5100, the memory 5100 being configured to store executable instructions; the processor 5200 is configured to operate the electronic device to perform a method according to any embodiment of the present disclosure, as controlled by the instructions.

The various modules of apparatus 4000 above may be implemented by processor 5200 executing the instructions to perform a method according to any of the embodiments of the present disclosure.

< media examples >

Corresponding to the above method embodiments, in this embodiment, a computer-readable storage medium is further provided, where a computer program that can be read and executed by a computer is stored, and when the computer program is read and executed by the computer, the computer program is configured to perform the method according to any of the above embodiments of the present disclosure.

The present disclosure may be systems, methods, and/or computer program products. The computer program product may include a computer-readable storage medium having computer-readable program instructions embodied thereon for causing a processor to implement various aspects of the present disclosure.

The computer readable storage medium may be a tangible device that can hold and store the instructions for use by the instruction execution device. The computer readable storage medium may be, for example, but not limited to, an electronic memory device, a magnetic memory device, an optical memory device, an electromagnetic memory device, a semiconductor memory device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a Static Random Access Memory (SRAM), a portable compact disc read-only memory (CD-ROM), a Digital Versatile Disc (DVD), a memory stick, a floppy disk, a mechanical coding device, such as punch cards or in-groove projection structures having instructions stored thereon, and any suitable combination of the foregoing. Computer-readable storage media as used herein is not to be construed as transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through a waveguide or other transmission medium (e.g., optical pulses through a fiber optic cable), or electrical signals transmitted through electrical wires.

The computer-readable program instructions described herein may be downloaded from a computer-readable storage medium to a respective computing/processing device, or to an external computer or external storage device via a network, such as the internet, a local area network, a wide area network, and/or a wireless network. The network may include copper transmission cables, fiber optic transmission, wireless transmission, routers, firewalls, switches, gateway computers and/or edge servers. The network adapter card or network interface in each computing/processing device receives computer-readable program instructions from the network and forwards the computer-readable program instructions for storage in a computer-readable storage medium in the respective computing/processing device.

The computer program instructions for carrying out operations of the present disclosure may be assembler instructions, Instruction Set Architecture (ISA) instructions, machine-related instructions, microcode, firmware instructions, state setting data, or source or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The computer-readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider). In some embodiments, the electronic circuitry that can execute the computer-readable program instructions implements aspects of the present disclosure by utilizing the state information of the computer-readable program instructions to personalize the electronic circuitry, such as a programmable logic circuit, a Field Programmable Gate Array (FPGA), or a Programmable Logic Array (PLA).

Various aspects of the present disclosure are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the disclosure. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer-readable program instructions.

These computer-readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer-readable program instructions may also be stored in a computer-readable storage medium that can direct a computer, programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer-readable medium storing the instructions comprises an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.

The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer, other programmable apparatus or other devices implement the functions/acts specified in the flowchart and/or block diagram block or blocks.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or acts, or combinations of special purpose hardware and computer instructions. It is well known to those skilled in the art that implementation by hardware, by software, and by a combination of software and hardware are equivalent.

Having described embodiments of the present disclosure, the foregoing description is intended to be exemplary, not exhaustive, and not limited to the disclosed embodiments. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein is chosen in order to best explain the principles of the embodiments, the practical application, or improvements made to the technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein. The scope of the present disclosure is defined by the appended claims.

Claims

1. A method for structured representation of video, comprising:

receiving a video object to be processed;

according to the video object, obtaining a first video frame set and a second video frame set, wherein the first video frame set is obtained by splitting the video object by using a preset rule, and video frames in the second video frame set are key frames in the video object;

2. The method according to claim 1, wherein obtaining the structured representation information of the video object according to the first set of video frames and the second set of video frames comprises:

3. The method of claim 2, wherein obtaining a first set of semantic information from the first set of video frames comprises:

4. The method according to claim 2, wherein said obtaining video semantic information describing the video object according to the second set of video frames comprises:

5. The method of claim 2, wherein after obtaining the structured representation information, the method further comprises:

6. The method of claim 1, wherein the first set of video frames is obtained by:

7. The method of claim 1, wherein the first set of video frames is obtained by: acquiring the total duration of the video object;

8. A video structured representation apparatus, comprising:

9. An electronic device comprising the apparatus of claim 8; alternatively, the first and second electrodes may be,

the electronic device includes:

a memory for storing executable instructions;

a processor configured to execute the electronic device to perform the method according to the control of the instruction, wherein the method is as claimed in any one of claims 1 to 7.

10. A computer-readable storage medium, in which a computer program is stored which is readable and executable by a computer, the computer program being adapted to perform the method according to any one of claims 1-7 when read and executed by the computer.