CN112163560A

CN112163560A - Video information processing method and device, electronic equipment and storage medium

Info

Publication number: CN112163560A
Application number: CN202011141537.1A
Authority: CN
Inventors: 俞一鹏
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2020-10-22
Filing date: 2020-10-22
Publication date: 2021-01-01
Anticipated expiration: 2040-10-22
Also published as: CN112163560B

Abstract

The invention provides a video information processing method, which comprises the following steps: acquiring a target video, and processing the target video to form a multi-level video tag matched with the target video; determining a corresponding text generation frame in response to the target video and a multi-level video tag matching the target video; and generating text description information matched with the target video through the text generation framework. The invention also provides a video information processing device, an electronic device and a storage medium. The method and the device can realize the description of the content of the target video through the natural language, facilitate non-technical personnel to timely and accurately convert the content of the video information into the text of the corresponding natural language for output, and do not need to configure high-quality training samples for a neural network model in the traditional technology, thereby effectively improving the sharing speed of the content of the video information.

Description

Video information processing method and device, electronic equipment and storage medium

Technical Field

The present invention relates to information processing technologies, and in particular, to a video information processing method and apparatus, an electronic device, and a storage medium.

Background

Under the traditional technical condition, the demand of multimedia information is increased explosively, and the traditional information processing technology cannot meet the requirements of multimedia data on tasks such as marking, description and the like. In the related art, before a video is displayed to a user, a description text of the video is generally required to be authored in advance to generally describe an event occurring in the video and to be added to the video, for example, in the video playing scenario of the electronic game event, the added title (e.g., the virtual character continuously kills the virtual character) is beneficial to the user to know the video content accurately in time, however, the related art generates the text description information of the video through the artificial intelligent neural network model, which not only needs to configure high-quality training samples for the neural network model, but also needs special technical personnel to supervise the learning and deployment of the neural network model, and is not beneficial to the use of common users, meanwhile, the description of the text information generated by the neural network model is relatively rigid, so that the use experience of a user is not facilitated, and the sharing speed of the video information content is influenced.

Disclosure of Invention

In view of this, embodiments of the present invention provide a video information processing method and apparatus, an electronic device, and a storage medium, which can generate text information matched with a target video, implement description of content of the target video through a natural language, save time for manually processing video information, and induce a viewing interest of a user, and meanwhile, do not need to configure a high-quality training sample for a neural network model in the conventional technology, thereby effectively improving a sharing speed of video information content.

The technical scheme of the embodiment of the invention is realized as follows:

the embodiment of the invention provides a video information processing method, which comprises the following steps:

acquiring a target video, and processing the target video to form a multi-level video tag matched with the target video;

responding to the target video and a multi-level video label matched with the target video, and determining a corresponding text generation frame, wherein the text generation frame is composed of a conditional probability combination and a pseudo terminator set;

and generating text description information matched with the target video through the text generation framework.

An embodiment of the present invention further provides a video information processing apparatus, where the apparatus includes:

the information transmission module is used for acquiring a target video and processing the target video to form a multi-level video label matched with the target video;

the information processing module is used for responding to the target video and a multi-level video label matched with the target video, and determining a corresponding text generation frame, wherein the text generation frame is composed of a conditional probability combination and a pseudo terminator set;

and the information processing module is used for generating text description information matched with the target video through the text generation framework.

In the above-mentioned scheme, the first step of the method,

the information processing module is used for acquiring an original video and determining the time sequence information of the original video;

the information processing module is used for determining a playing time parameter and a storage position parameter corresponding to the original video according to the time sequence information of the original video;

and the information processing module is used for extracting a target video from the original video based on the playing time parameter and the storage position parameter corresponding to the original video.

In the above-mentioned scheme, the first step of the method,

the information processing module is used for carrying out optical character recognition on text information in a video frame of the target video based on a feature matching network in a video processing model to obtain an optical character recognition text corresponding to the video frame;

the information processing module is used for determining an event label matched with the target video according to the optical character recognition text;

the information processing module is used for carrying out object recognition processing on image information in a video frame of the target video based on an object recognition network in the video processing model and determining a target object label matched with the target video;

the information processing module is used for classifying the image information in the video frame of the target video based on the image classification network in the video processing model and determining the picture category label matched with the target video;

and the information processing module is used for determining a multi-level video label matched with the target video based on the incidence relation among the event label, the target object label and the picture category label.

In the above-mentioned scheme, the first step of the method,

the information processing module is used for determining a video time parameter matched with the event label in the original video based on the event label matched with the target video;

the information processing module is used for extracting corresponding video frames from the original video again based on the video time parameters;

and the information processing module is used for forming a target video matched with the event label based on the extracted video frame.

In the above-mentioned scheme, the first step of the method,

the information processing module is used for determining that the event label matched with the target video is a first type label;

the information processing module is used for determining that a target object label and a picture type label which are matched with the target video are second type labels;

and the information processing module is used for carrying out hierarchical combination processing on the first type label and the second type label based on the incidence relation among the event label, the target object label and the picture type label to form a multi-level video label matched with the target video.

In the above-mentioned scheme, the first step of the method,

the information processing module is used for configuring a corresponding starting identifier for the text generation frame;

the information processing module is used for determining a conditional probability parameter corresponding to the video label based on the video label matched with the target video;

the information processing module is used for configuring a corresponding limited set of non-terminators for the text generation framework;

the information processing module is used for determining a limited set of pseudo terminators in the text generation framework based on the conditional probability parameters;

the information processing module is configured to determine a limited set of derived non-terminator subsets in the text generation framework based on the conditional probability parameter.

In the above-mentioned scheme, the first step of the method,

the information processing module is used for determining the input video label matched with the text generation frame;

the information processing module is used for performing first conditional probability matching processing on the input video label based on a limited set of derivation non-terminator subsets in the text generation framework to form first candidate text information;

the information processing module is used for performing second probability matching processing on the input video label based on the limited set of the pseudo terminator in the text generation frame to form second candidate text information;

and the information processing module is used for generating text description information matched with the target video by performing combined processing on the first candidate text information and the second candidate text information.

In the above-mentioned scheme, the first step of the method,

the information processing module is used for determining corresponding text selection probability based on the input video label;

the information processing module is used for determining a first conditional Boolean value based on the label of the target video;

and the information processing module is used for performing first conditional probability matching processing on a first type label in the input video labels based on the text selection probability and the first conditional Boolean value to form first candidate text information.

In the above-mentioned scheme, the first step of the method,

the information processing module is used for determining a second conditional Boolean value based on the label of the target video and the first candidate text information;

and the information processing module is used for performing second conditional probability matching processing on a second type label in the input video label according to the second conditional Boolean value through the limited set of the pseudo terminator in the text generation frame to form second candidate text information.

In the above-mentioned scheme, the first step of the method,

the information processing module is used for determining the portrait information of the target object when the target video is a game type video;

the information processing module is configured to determine, based on the portrait information of the target object, an input video tag matched with the text generation frame, where the portrait information of the target object: including at least one of skill, weapon, action, gender, character, speech, and skin of the target object in a game type video.

An embodiment of the present invention further provides an electronic device, where the electronic device includes:

a memory for storing executable instructions;

and the processor is used for realizing the preorder video information processing method when the executable instructions stored in the memory are operated.

The embodiment of the invention also provides a computer readable storage medium, which stores executable instructions, and the executable instructions are executed by a processor to realize the video information processing method of the preamble.

The embodiment of the invention has the following beneficial effects:

according to the embodiment of the invention, a multi-level video label matched with a target video is formed by acquiring the target video and processing the target video; in response to the target video and a multi-level video tag matched with the target video, determining a corresponding text generation frame, wherein the text generation frame is composed of a conditional probability combination and a pseudo terminator set; through the text generation framework, text description information matched with the target video is generated, so that the content of the target video can be described through natural language, non-technical personnel can conveniently and accurately convert the content of the video information into the text of the corresponding natural language in time and output the text, and a high-quality training sample is not required to be configured for a neural network model in the traditional technology, so that the sharing speed of the content of the video information is effectively improved, the sharing scene of the content of the video information is enlarged, and the use experience of a user is improved.

Drawings

Fig. 1 is a schematic view of a usage scenario of a video information processing method according to an embodiment of the present invention;

fig. 2 is a schematic diagram illustrating a configuration of a video information processing apparatus according to an embodiment of the present invention;

fig. 3 is a schematic flow chart of an alternative video information processing method according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of video tag generation according to an embodiment of the present invention;

fig. 5 is a schematic flow chart of an alternative video information processing method according to an embodiment of the present invention;

FIG. 6 is a schematic diagram of a data structure of a video information processing method according to the present invention;

fig. 7 is an optional effect of the video information processing method according to the embodiment of the present invention;

fig. 8 is an optional effect of the video information processing method according to the embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention clearer, the present invention will be further described in detail with reference to the accompanying drawings, the described embodiments should not be construed as limiting the present invention, and all other embodiments obtained by persons skilled in the art without inventive work shall fall within the scope of protection of the present invention.

In the following description, reference is made to "some embodiments" which describe a subset of all possible embodiments, but it is understood that "some embodiments" may be the same subset or different subsets of all possible embodiments, and may be combined with each other without conflict.

Before further detailed description of the embodiments of the present invention, terms and expressions mentioned in the embodiments of the present invention are explained, and the terms and expressions mentioned in the embodiments of the present invention are applied to the following explanations.

1) In response to the condition or state on which the performed operation depends, one or more of the performed operations may be in real-time or may have a set delay when the dependent condition or state is satisfied; there is no restriction on the order of execution of the operations performed unless otherwise specified.

2) Original video, various forms of video information available in the internet, such as video files, multimedia information, etc. presented in a client or an intelligent device;

target video: and extracting the original video to form a video clip with a corresponding label.

3) Hidden Markov models (HMM Hidden Markov models) are statistical models that describe a Markov process with Hidden unknown parameters. In hidden markov models, states are not directly visible, but some variables affected by the states are visible. States in an HMM are the basic components of the HMM; the transition probability of the HMM represents the probability of a transition occurring between states of the HMM; each state has a probability distribution over the symbols that may be output, i.e. the output probability of the HMM. Among them, the markov process is a stochastic process without memory peculiarities. The stochastic process has a conditional probability distribution of its future states that depends only on the current state, given the current state and all past states.

4) A Gaussian Mixture Model (DNN Gaussian Mixture Model) is a Model that accurately quantizes objects using a Gaussian probability density function (normal distribution curve) and decomposes one object into several objects formed based on the Gaussian probability density function (normal distribution curve).

5) Convolutional Neural Networks (CNN Convolutional Neural Networks) are a class of feed forward Neural Networks (feed forward Neural Networks) that contain convolution computations and have a deep structure, and are one of the representative algorithms for deep learning (deep learning). The convolutional neural network has a representation learning (representation learning) capability, and can perform shift-invariant classification (shift-invariant classification) on input information according to a hierarchical structure of the convolutional neural network.

6) Optical Character Recognition (OCR) converts characters of various bills, newspapers, books, manuscripts and other printed matters into image information by an Optical input method such as scanning, and converts the image information into a usable computer input technology by using a Character Recognition technology.

Fig. 1 is a schematic view of a usage scenario of a video information processing method according to an embodiment of the present invention, referring to fig. 1, a client capable of displaying software corresponding to a first target video, such as a video playing client or a plug-in, is disposed on a terminal (including a terminal 10-1 and a terminal 10-2), and a user can obtain and display a target video (the target video may be a game short video in different live platforms) through the corresponding client; the terminal is connected to the server 200 through a network 300, and the network 300 may be a wide area network or a local area network, or a combination of the two, and uses a wireless link to realize data transmission.

As an example, the server 200 is used for laying a video information processing apparatus to implement the video information processing method provided by the present invention, so as to form a multi-level video tag matching with a target video by acquiring the target video and processing the target video; in response to the target video and a multi-level video tag matched with the target video, determining a corresponding text generation frame, wherein the text generation frame is composed of a conditional probability combination and a pseudo terminator set; and generating text description information matched with the target video through the text generation framework. The text description information is used for describing the content of the target video through a natural language, and the text description information matched with the target video is displayed and output through a terminal (the terminal 10-1 and/or the terminal 10-2).

Of course, the information processing apparatus provided by the present invention may be applied to video playing, and in video playing, target videos of different data sources are usually processed, and finally, text information matched with the corresponding target video is presented on a User Interface (UI). The background database for video playing receives a large amount of video data from different sources every day, and the obtained text description information matched with the target video can be called by other application programs, such as a webpage video recommendation process, an applet video recommendation process or a video recommendation process of a short video client.

As will be described in detail below with respect to the structure of the video information processing apparatus according to the embodiment of the present invention, the video information processing apparatus may be implemented in various forms, such as a dedicated terminal with a processing function of the video information processing apparatus, or a server provided with a processing function of the video information processing apparatus, for example, the server 200 in fig. 1. Fig. 2 is a schematic diagram of a composition structure of a video information processing apparatus according to an embodiment of the present invention, and it is understood that fig. 2 only shows an exemplary structure of the video information processing apparatus, and not a whole structure, and a part of the structure or the whole structure shown in fig. 2 may be implemented as needed.

The video information processing device provided by the embodiment of the invention comprises: at least one processor 201, memory 202, user interface 203, and at least one network interface 204. The various components in the video information processing apparatus are coupled together by a bus system 205. It will be appreciated that the bus system 205 is used to enable communications among the components of the connection. The bus system 205 includes a power bus, a control bus, and a status signal bus in addition to a data bus. For clarity of illustration, however, the various buses are labeled as bus system 205 in fig. 2.

The user interface 203 may include, among other things, a display, a keyboard, a mouse, a trackball, a click wheel, a key press, a button, a touch pad, or a touch screen.

It will be appreciated that the memory 202 can be either volatile memory or nonvolatile memory, and can include both volatile and nonvolatile memory. The memory 202 in embodiments of the present invention is capable of storing data to support operation of the terminal (e.g., 10-1). Examples of such data include: any computer program, such as an operating system and application programs, for operating on a terminal (e.g., 10-1). The operating system includes various system programs, such as a framework layer, a core library layer, a driver layer, and the like, for implementing various basic services and processing hardware-based tasks. The application program may include various application programs.

In some embodiments, the video information processing apparatus provided in the embodiments of the present invention may be implemented by a combination of hardware and software, and by way of example, the video information processing apparatus provided in the embodiments of the present invention may be a processor in the form of a hardware decoding processor, which is programmed to execute the video information processing method provided in the embodiments of the present invention. For example, a processor in the form of a hardware decoding processor may employ one or more Application Specific Integrated Circuits (ASICs), DSPs, Programmable Logic Devices (PLDs), Complex Programmable Logic Devices (CPLDs), Field-Programmable Gate arrays (FPGAs), or other electronic components.

As an example of the video information processing apparatus provided by the embodiment of the present invention implemented by combining software and hardware, the video information processing apparatus provided by the embodiment of the present invention may be directly embodied as a combination of software modules executed by the processor 201, where the software modules may be located in a storage medium located in the memory 202, and the processor 201 reads executable instructions included in the software modules in the memory 202, and completes the video information processing method provided by the embodiment of the present invention in combination with necessary hardware (for example, including the processor 201 and other components connected to the bus 205).

By way of example, the Processor 201 may be an integrated circuit chip having Signal processing capabilities, such as a general purpose Processor, a Digital Signal Processor (DSP), or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or the like, wherein the general purpose Processor may be a microprocessor or any conventional Processor or the like.

As an example of the video information processing apparatus provided by the embodiment of the present invention being implemented by hardware, the apparatus provided by the embodiment of the present invention may be implemented by directly using the processor 201 in the form of a hardware decoding processor, for example, by being executed by one or more Application Specific Integrated Circuits (ASICs), DSPs, Programmable Logic Devices (PLDs), Complex Programmable Logic Devices (CPLDs), Field Programmable Gate Arrays (FPGAs), or other electronic components, to implement the video information processing method provided by the embodiment of the present invention.

The memory 202 in the embodiment of the present invention is used to store various types of data to support the operation of the video information processing apparatus. Examples of such data include: any executable instructions for operating on the video information processing apparatus, such as executable instructions, a program that implements the slave video information processing method of the embodiment of the present invention may be contained in the executable instructions.

In other embodiments, the video information processing apparatus provided by the embodiment of the present invention may be implemented by software, and fig. 2 shows the video information processing apparatus stored in the memory 202, which may be software in the form of a program, a plug-in, and the like, and includes a series of modules, and as an example of the program stored in the memory 202, the video information processing apparatus may include the following software module information transmission module 2081 and video information processing module 2082. When the software modules in the video information processing apparatus are read into the RAM by the processor 201 and executed, the video information processing method provided by the embodiment of the present invention is implemented, where the functions of each software module in the video information processing apparatus include:

the information transmission module 2081, which is used for obtaining a target video;

the information processing module 2082 is used for processing the target video to form a multi-level video tag matched with the target video;

the information processing module 2082 is configured to determine a corresponding text generation framework in response to the target video and a multi-level video tag matching with the target video, where the text generation framework is composed of a conditional probability combination and a pseudo terminator set;

the information processing module 2082 is configured to generate text description information matched with the target video through the text generation framework.

According to the electronic device shown in fig. 2, in one aspect of the present application, the present application also provides a computer program product or a computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer readable storage medium, and the processor executes the computer instructions to cause the computer device to execute different embodiments and combinations of embodiments provided in various alternative implementations of the video information processing method.

The video information processing method provided by the embodiment of the present invention is described with reference to the video information processing apparatus shown in fig. 2, wherein, taking a game video as an example, in order to make the game video more attractive and improve the user experience, the game video may be further processed (for example, dubbing, inserting, special effects, configuring subtitles, etc. on the game video), so that the game video is more attractive.

The special terminal with the video information processing device may be an electronic device with the video information processing device in the embodiment shown in fig. 2. The following is a description of the steps shown in fig. 3.

Step 301: the video information processing apparatus acquires an original video, and acquires a target video based on the acquired original video.

In some embodiments of the present invention, taking a game video as an example, because the playing time of an original video is longer, and the original video includes videos corresponding to operations of different target objects at different stages of a game process, a playing time parameter and a storage location parameter corresponding to the target video may be determined according to timing information of the target video; and extracting the target video corresponding to the time sequence information from the original video based on the playing time length parameter and the storage position parameter corresponding to the target video. Specifically, when extracting a target video corresponding to timing information from an original video, the operation of the target object at the first stage of the game process is recorded in the target video with timing information of 2 minutes 05 seconds to 2 minutes 35 seconds (total duration of 30 seconds), the frame number of the video frame to be extracted may be determined according to the playing duration, and the corresponding storage location number may be determined according to the corresponding frame number, for example, when the original video has video frame 1, video frame 2, video frame 3, video frame 4, video frame 5, video frame 6, video frame 7, video frame 8, video frame 9, and video frame 10 (total 10 video frames), the frame number of the video frame to be extracted is determined according to the playing duration, the corresponding storage location is determined, and video frame 1, video frame 2, video frame 3, video frame b, and the like may be sequentially extracted from the storage location, And the video frame 4, and the extracted video frame 1, the extracted video frame 2, the extracted video frame 3 and the extracted video frame 4 form a target video corresponding to the time sequence information.

Because the source of the original video has diversity (may be a video resource in the internet or a local video file stored by the electronic device), the corresponding target video can be accurately acquired in the corresponding cloud server network by acquiring the play time parameter and the storage location parameter corresponding to the target video.

Step 302: and the video information processing device processes the target video to obtain a multi-level video label matched with the target video.

In some embodiments of the present invention, processing the extracted different target videos based on the video processing model to form video tags matching the target videos may be implemented by:

carrying out optical character recognition on text information in a video frame of the target video to obtain a recognition text corresponding to the video frame; determining an event label matched with the target video according to the identification text; carrying out object identification processing on image information in a video frame of the target video, and determining a target object label matched with the target video; because there are many target objects in the game video, tens of target objects (for example, different game characters) can appear in the same video frame at the same time, so that the target objects appearing in the video frame can be identified through the object identification network, and the target object tags matched with the target video are acquired. Wherein, the deep neural network used for object position identification can be Faster convolutional neural network fast-RCNN based on region, single multi-bounding box detector SSD, YOLO series neural network, etc., the deep neural network used for object category identification can include VGG network, ResNet series network, inclusion series network, for example, to realize identification of skill type target object and intelligence type target object in game process, or to identify male target object or female target object in role playing game, further, identification of target object appearing in video frame can be adjusted according to different game process, for example, selecting pre-trained VGG network to identify video frame less than ten target objects to realize number adaptation with target object in game process, and the generation efficiency of the text information is improved.

Classifying image information in a video frame of the target video based on an image classification network in the video processing model, and determining a picture category label matched with the target video; the picture category label may be picture content in a video frame of a game event of a current game event in the video of the electronic game event, for example, the event label of the event label "double-click" may further include a picture label of a next level, where the picture label may include: the killing subject role name, the killed object role name, the killing skill name, the killing time and the like.

And determining a multi-level video label matched with the target video according to the incidence relation among the event label, the target object label and the picture category label.

Specifically, because events in game videos or cartoon videos are different, corresponding target objects are also different, and therefore videos or cartoon videos of different types of games can be processed through the pre-trained video processing model, operation of non-technical personnel is facilitated, and a process for generating text description information is simplified. Wherein the pre-trained video processing model may include: a feature matching network, an object recognition network, and an image classification network. In the game process, different target videos can be formed by different operations of the user, and the text information is presented in the video frame of the target video, so that the optical character recognition can be accurately performed on the text information in the video frame through the feature matching network, and the recognition text corresponding to the target video frame is obtained.

In some embodiments of the present invention, because the number of target objects in a game video is large, the extracted target video may have a few frames or missing frames, which affects the accuracy of the generated text description information and also affects the viewing experience of the user, and therefore, the video time parameter matching the event tag in the original video may be determined based on the event tag matching the target video; and based on the video time parameter, re-extracting video frames from the original video to form a new target video. For example: when the video frames 1, 2, 3 and 4 are extracted and the extracted video frames 1, 2, 3 and 4 are combined into the target video corresponding to the time sequence information, and the event tag is determined to be the 'three-killing' event of the role playing game, the video time parameter matched with the event tag in the original video is determined to be the video frames included in the video frames 1 to 6, so that the video frames 5 and 6 can be extracted from the original video again and combined with the extracted video frames 1, 2, 3 and 4 to form the target video, thereby not only ensuring the accuracy of the text description information, but also ensuring the integrity of the game video watched by the user.

Referring to fig. 4, fig. 4 is a schematic diagram of generating a video tag in an embodiment of the present invention, taking processing information in a game video to form a corresponding text description as an example, and describing an operation effect of a corresponding game player through corresponding broadcast information of a game interface in a game, as shown in fig. 4, because different game operations of the user in the game can form different game effects to be presented on the game interface in the game process, and the game process can also present the display effects through the character images in the video frames, therefore, can match the characters in broadcasting through the characteristic matching network, utilize the character information that utilizes the image form to present in the video frame in with the game video of OCR technique, turn into corresponding characters feature vector that can discern to extract corresponding event label, wherein OCR technique's processing procedure includes: firstly, performing text analysis on the optical character recognition text to obtain single-line texts contained in the optical character recognition text; respectively obtaining text identifications corresponding to the single-line texts, wherein the text identifications are used for identifying the corresponding single-line texts; and acquiring a single-line target text corresponding to each single-line text in the target language text based on the text identification, taking each acquired single-line target text as the target text, and extracting a corresponding event label by processing the target text. For example, event tags that can be extracted from game video are: "five kill", "four kill", "three kill", "double kill", and "single kill", etc. The game video may also be passed through a feature matching network in a pre-trained video processing model to obtain event labels, such as: "five kill", "four kill", "three kill", "double kill", and "single kill", etc.

Taking fig. 4 as an example, in the process of processing the game video, the blood bar display in fig. 4 may be matched by template matching, so that the position of the target object (the main game role in the game) may be obtained, then the picture of the area below the blood bar in fig. 4 is extracted as the target area, and the image of the skill area is combined, and classification and identification are performed by using the convolutional neural network as the object identification network, so that which game role the main game role (the target object) is, and the target object label (i.e. the game role name label in the current game video), such as "hough", "shang xiang", "li bai", "wei", and the like, may be obtained.

Furthermore, classification judgment can be performed on game pictures through a convolutional neural network in an image classification network, video frames included in the target video are processed through the image classification network, image feature vectors to be identified can be obtained, the position of the target object in the video frame image can be determined through the obtained image feature vectors, for example, whether the target object in the game video is under a tower or in a bush or the like can be determined, and therefore face-changing category labels such as killing by getting stronger over the tower, squatting by the bush or the like can be obtained; therefore, various highlights and video clips can be extracted and the label can be killed by fast light through the detected various labels and the time information of the detection picture in the video. When determining the picture category label through the image classification network, the image classification network can use a MobileNetV3 network, the MobileNetV3 network can use a small number of model parameters to realize and configure the image classification network suitable for terminal device deployment, the network architecture is based on a network structure search technology, the use performance and the classification precision of the terminal device are taken into consideration in the image classification process, so that the final network architecture can adapt to the performance and precision requirements of terminal device deployment, and therefore the identification of elements such as scenes, characters, objects, identifications and the like in images of different game videos (corresponding to different game processes) can be realized, so that the blood volume number, the bullet number, the weapon types and positions of target objects in the game videos can be determined, and the picture category label can be determined through the identification results of different elements.

In some embodiments of the present invention, since the sources of the game videos are various, and may include long videos or short videos, and the target objects and the types of pictures included in different game types are also various, but different target object tags and different types of pictures may all establish an association relationship with the event tags, it may be determined that the event tag matched with the target video is the first type tag first; then, the target object label and the picture category label matched with the target video can be determined to be a second type label; based on the incidence relation among the event label, the target object label and the picture category label, the first type label and the second type label are processed by hierarchical combination, wherein, the hierarchical combination process may combine the object representation part-whole hierarchical structure according to the tree structure (i.e. combine the first type tags and the second type tags according to the association relationship according to the tree structure) to form the multi-level video tags matching with the target video, it should be noted that the number of the first type tags is 1, the number of the second type tags is at least 1, for example, for the game program of the WeChat applet, since the game logic is simple, only the event tag as the first type tag and the picture category tag as the second type tag are included in the corresponding game video.

When the game video is a video of a role playing game with complex logic, the event tag can be used as a first type tag, other tags can be used as second type tags, that is, the event tags are attribute tags, and the extracted video segment (short video) in each game video is only 1 event tag, and 1 event tag corresponds to a plurality of attribute tags. For example: the event label is "fife", the object labels are "game role a" and "game role B", and the screen label is "tower-crossing zap", so that "fife" as a first-level label can correspond to a plurality of attribute labels, including: "Game role A" and "game role B" and "Poulture over tower".

Step 303: and determining a corresponding text generation frame in response to the target video and the multi-level video tag matched with the target video.

In some embodiments of the present invention, determining a text generation framework corresponding to the video processing model according to the video tag matched with the target video may be implemented by:

configuring a corresponding starting identifier for the text generation frame; determining a conditional probability parameter corresponding to the video tag based on the video tag matched with the target video; configuring a limited set of corresponding non-terminators for the text generation framework; determining a limited set of pseudo terminators in the text generation framework based on the conditional probability parameters; determining a limited set of derived non-terminator subsets in the text generation framework based on the conditional probability parameter. Specifically, since the text description information generated by the neural network model in the conventional technology is often too mechanical, it is difficult to arouse the interest of the user in watching the target video, and meanwhile, a special technician is required to supervise the learning and deployment of the neural network model, which is not beneficial to the use of an ordinary user, the text description information can be efficiently generated by configuring a corresponding text generation framework, so as to vividly describe the target video. The text generation framework provided by the present application can be described by a framework, and the framework can be represented by a quadruple G ═ (N, Σ, R, S), where the quadruple structure sequentially includes from right to left:

s belongs to N and is a unique starting symbol; r is a finite set of derivations α; Σ is a finite set of pseudo terminators; n is a finite set of non-terminators, described separately below:

the four-tuple G ═ N, (Σ, R, S) is a framework structure formed by pseudo instruction information, and is used to form text description information by corresponding pseudo instruction information and by different label information, where the pseudo instruction bai (pseudo instruction) is an instruction used to inform the assembler how to assemble. The pseudo-instructions neither control the operation of the machine nor are they assembled into machine code, but are only recognized by the assembler and direct the assembly as to how it should proceed. The address to the program or to the register is loaded into the register. The finite set is a set consisting of finite elements, and R is a derived finite set; Σ is a finite set of pseudo terminators; n is a limited set of non-terminators, indicating that the number of elements in the set of different pseudo-instruction information of the framework structure is limited and adapted to the environment in which the textual description information is generated.

Based on the provided framework, derivation is started by a starting symbol S, text description information can be formed by generating text descriptions of the video probabilistically according to the input multi-level video label condition, meanwhile, different text descriptions of the same target video can be generated each time through combination of different information, and more optional text description information can be provided to be recommended to a user as a title of the target video so as to attract different users to watch the target video. Furthermore, the contents of the middle two parts in the four-tuple can be flexibly adjusted according to the game environment or the content characteristics of the video, for example, a small program game, because the game contents are simple, a limited set of simple pseudo terminators lambda can be configured, and for role playing games or other types of games with multiple event tags, a limited set of complex pseudo terminators lambda can be configured, so as to realize the formation of more accurate text description information.

In some embodiments of the present invention, generating the text description information matched with the target video through the text generation framework may be implemented by:

determining input video tags matched with the text generation framework; performing first conditional probability matching processing on the input video label based on a limited set of derivation non-terminator subsets in the text generation framework to form first candidate text information; performing second probability matching processing on the input video label based on the limited set of pseudo terminators in the text generation framework to form second candidate text information; and generating text description information matched with the target video by combining the first candidate text information and the second candidate text information. The input video tags matched with the text generation frame can be selected from the multi-level video tags, and the number of the obtained multi-level video tags is different due to the fact that the video duration of the target video is different. In the running process of the game process, the game can be acquired in a screen video modeThe complete game video is used as an original video, different game short videos are extracted from the original video, and for the game short videos, the video time length is limited within 30 seconds, so that at least two video tags of different levels can be screened from the obtained multi-level video tags to be used as corresponding input video tags; in some embodiments of the present invention, in the process of processing the game length video, since the video duration exceeds five minutes, all the multi-level video tags matched with the target video can be used as corresponding input video tags to generate text description information matched with the target video through the corresponding input video tags, so that a user watching the game video determines the content of the currently played game video through the corresponding text description information, and a better video watching experience is obtained. Continuing with the quad G ═ N, (Σ, R, S) structure, R is a finite set of derivations a,

n is more than or equal to 1 and t is more than or equal to 1, wherein alpha belongs to N, beta belongs to (N ^ E); when first candidate text information is generated, determining corresponding text selection probability and a first conditional Boolean value based on the input video label; and then performing first conditional probability matching processing on a first type label in the input video labels based on the text selection probability and the first conditional Boolean value to form first candidate text information. In particular, in a quadruple structure, qi is the probability that each beta sequence is chosen,

the larger qi is, the higher the possibility that the corresponding beta sequence is selected; d is a first conditional boolean value, optionally context-aware, determined in dependence on the input video tag, if di exists and di is False, the corresponding β -sequence is deleted and its probability value qi is averagely assigned to other β -sequences; each beta sequence needs to conform to a syntactic structure in the Chinese text description; the condition d can ensure that the sentence structure and the combination between sentences are more diverse.

And then determining second candidate text information according to a second type of label in the input video labels, wherein the second candidate text information can be formed by performing second conditional probability matching processing on the second type of label in the input video labels according to the second conditional boolean value through a finite set of pseudo terminators in the text generation framework, and the description continues by taking a four-tuple structure as an example, wherein Σ is a finite set of pseudo terminators λ, λ → [ c ]₁]λ₁[p₁]|[c₂]λ₂[p₂].....[c_m]λ_m[p_m]，m≥1；λ_iIs a specific text, which may be a word, a phrase or a sentence, such as the labels "laser sighting telescope", "red dot sighting telescope", "holographic sighting telescope" in the second type of label, which indicates the way in which the target object in the game is aimed, for example; pi is each lambda_iThe probability of being selected is determined by the probability of being selected,

p_ithe larger, the_iThe greater the likelihood of being selected; c is an optional conditional Boolean value of the perceptual context, i.e. a second conditional Boolean value, depending on the input video tag and the previously generated first candidate text, λ if ci is present and ci is False_iWill be deleted and its probability value p_iThe condition c can guarantee the diversity of the text and the accuracy of the text collocation.

In some embodiments of the present invention, in order to implement processing of game videos of different types of games by using the video information processing method provided by the present application, a word choice of λ may be selected from images of description objects of corresponding types of games, so that the generated text description information has the same language habit as a target object (game character) of the corresponding game type, and the generated text description information of the game videos also has diversity. In some embodiments of the invention, the primary target objects in the images of the game video are different game characters, and the invention also creates representations of the game characters. The representation of the target object may include information related to skills, weapons, actions, gender, characters, lines, and skin.

Step 304: and the video information processing device generates text description information matched with the target video through the text generation framework.

Step 305: the video information processing device outputs text description information matched with the target video.

Taking the processing of the game video as an example, the following description is continued on the video information processing method provided by the present application, and referring to fig. 5, fig. 5 is a schematic flow chart of an optional process of the video information processing method provided by the embodiment of the present invention, which specifically includes the following steps:

step 501: and extracting video clips from the target game video and configuring corresponding tags.

Step 502: and determining a conditional probability parameter in the video information processing process based on the label information of the game video clip.

Step 503: determining the pictures matched with the objects described in the game video.

Step 504: and determining a description object in the video clip, and generating a character description matched with the target object through a quadruple based on the conditional probability parameter and the corresponding user portrait.

Referring to table 1, in some embodiments of the present invention, different conditional probabilities may be configured for the video information processing process, so as to adapt the video information processing process to different game types. The invention provides a video description character automatic generation frame based on conditional probability, as shown in table 1, wherein for a role playing type game, game roles in the game can be killed by each other, as for the killing type, the killing types can be divided into five killing types, four killing types, three killing types and the like as first type labels, under each large type, the killing types in the five killing types can be further divided into a plurality of small types, for example, the killing small types contained in the five killing types have five killing limit reversal, the one-neck-harry type, the five-even-against-the-world type and the like are used as second type labels, and the game roles in the game comprise: the game character A, the game character B and the game character C form a target object label as a second type label, and the corresponding picture category label as the second type label comprises the following components: "Yuetaqiangsha" and "grassy squat people"

TABLE 1

As can be seen from Table 1, the more derivable derivatives (β sequences) are selectable, the more characters (λ) the pseudo terminator (λ) can be selected_i) The more text description information is generated, the stronger the diversity. The probabilities (p and q) may give different weights or preferences to the derived and candidate words of the candidate. The first conditional boolean value and the second conditional boolean value can realize the derivation of candidates or the adaptive change of candidate text information according to the input video tag, more specifically, the first conditional boolean value enables the selection of a derivation formula to be more accurate, the second conditional boolean value enables the selection of candidate words to be more accurate, and the second conditional boolean value also enables the collocation between the candidate words to be more accurate and smooth.

Fig. 6 is a schematic data structure diagram of the video information processing method provided by the present invention, referring to table 1, where S is a start symbol, and the sequence has only one S symbol at the start; (2) the english symbol with a single quotation mark represents a pseudo terminator (i.e. λ in the frame), meaning that it cannot be deduced, i.e. replaced by other symbols, where 'start' represents the beginning text of the description and 'end' represents the end text of the description; (3) english symbols without apostrophes (i.e. α and β in the frame) can be derived, further substituted by other symbols, and finally represented by a pseudo terminator sequence; (4) the number in brackets on the right side of the derivation formula indicates the probability of being selected (i.e. q in the frame), in the example, the right side of S has only one derivation formula, so the probability is 1.0, if a plurality of derivation formulas exist, the derivation formulas are divided by the symbol | and the sum of the probabilities of all the derivation formulas is 1.0, and then the sequence 'start' video 'end' is obtained; specifically, the case where only one derivation is shown on the right side of S can be represented as: under the condition of five kills, selecting text description information to be that the probability of the game role A five kills is 1.0; the right side of S may be expressed by a plurality of derivations: the five-killing probability is 0.2, the single-killing probability is 0.5, and the four-killing probability is 0.3.

When generating a textual description matching the target object by a quad based on the conditional probability parameters and the corresponding user profile, assuming that the first type label of the input target video is "Wusui", the second type label is a game character "Hoyi", and d in the quad frame is a first conditional Boolean value, then the Quaternary condition d _ kill4 on the right of the derivation equation is not satisfied, the value is False (False), the Wusui condition d _ kill5 is satisfied, and the value is True (True). Since the condition d _ kill4 is false, the derivative of kill4 is deleted in the generation of the text description information, and its probability value is averagely assigned to two derivatives of kill5 and kill, that is, the probability that kill5 corresponding to quincide shown in table 1 is selected is changed from 0.2 to 0.2+0.3/2 to 0.35, and the probability that kill5 corresponding to monoscide is selected is changed from 0.5 to 0.5+0.3/2 to 0.65, where it is assumed that kill5 is selected, then the sequence 'start' kill5'end' is obtained.

In forming the first candidate text information, the specific word represented by the pseudo terminator may be selected from the candidate words (i.e., λ in the frame)_i) Selection is performed. For example, 'start' indicates the beginning of the description, since the event tag of the input is 'Wusui', it can be seen from the set [ 'see Wusui action for everybody,' know what is Wusui?]One is chosen with a certain probability distribution (i.e., p in the frame), assuming that the "know what is the five absolutely? ". If the event tag is "suicide", it can be seen from the set with a certain probability distribution [ "let everyone see suicide operation,", "know what is quadra kill? ",""]One of which is selected as the first candidate text.

Since the sequence 'start' kill5'end' is not all pseudo terminators, it is necessary to continue derivation by a quadruple structure, and the right equation of kill5 is substituted to obtain the sequence 'start "hero" penta _ kill "pun" end', so far, since all the obtained sequence 'start "hero" penta _ kill "pun" end' are pseudo terminators, the derivation of the text description information can be finished.

Specifically, the corresponding target object label of the game character of the input video is "hough", so 'hero' selects one from [ "hough of counterday", "hough" ] with a corresponding probability based on the second conditional boolean value (i.e., c in the frame), assuming that "descendant of counterday" is selected. Similarly, assume that the 'penta kill' selection "sees a kill of a group that kills an enemy. Meanwhile, ' pun ' represents a punctuation mark in the generation process of the text description information, and since ' hero ' in the second candidate text information selects ' Hoyi of the inverse day ', the value of ' pun ' can be taken as '! "to strengthen the tone of voice, make the collocation between the words more accurate and smooth, if 'hero' chooses other, then 'pun' takes the value as. "of course, since the text description information can be used as the title of the video, the 'end can also' select the hollow character to meet the video title requirement of the video recommendation system. The textual description information thus generated describing the target video in natural language may be "know what is the top five? Or, when the Hover on the contrary is a group that kills the enemy! "for the user to determine the content of the target video through the text description information.

Therefore, referring to fig. 7 and 8, fig. 7 is an optional effect schematic diagram of the video information processing method provided by the embodiment of the present invention, fig. 8 is an optional effect schematic diagram of the video information processing method provided by the embodiment of the present invention, and by the video information processing method provided by the present application, text description information matched with a target video is generated through a text generation framework, so that the content of the target video can be described through a natural language, and non-technical personnel can timely and accurately convert the video information content into a text of a corresponding natural language for output, and a high-quality training sample does not need to be configured for a neural network model in the conventional technology, so that the sharing speed of the video information content is effectively increased, the sharing scene of the video information content is expanded, and the use experience of a user is improved, meanwhile, the text description information generated by the frame which cannot be generated through the text is various and stronger, for example, as shown in fig. 8, two kinds of text description information of the same target video can be generated to attract different users to watch, and can also be migrated and applied to different game videos, so that the development cost of operators is saved.

The above description is only exemplary of the present invention and should not be taken as limiting the scope of the present invention, and any modifications, equivalents, improvements, etc. made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims

1. A method for processing video information, the method comprising:

acquiring a target video, and processing the target video to obtain a multi-level video tag matched with the target video;

in response to the target video and a multi-level video tag matched with the target video, determining a corresponding text generation frame, wherein the text generation frame is composed of a conditional probability combination and a pseudo terminator set;

2. The method of claim 1, further comprising:

acquiring an original video and determining time sequence information of the original video;

determining a playing time parameter and a storage position parameter corresponding to the original video according to the time sequence information of the original video;

and extracting a target video from the original video based on the playing time length parameter and the storage position parameter corresponding to the original video.

3. The method of claim 1, wherein processing the target video to form a multi-level video tag matching the target video comprises:

based on a feature matching network in a video processing model, carrying out optical character recognition on text information in a video frame of the target video to obtain an optical character recognition text corresponding to the video frame;

determining an event label matched with the target video according to the optical character recognition text;

based on an object recognition network in the video processing model, carrying out object recognition processing on image information in a video frame of the target video, and determining a target object label matched with the target video;

classifying image information in a video frame of the target video based on an image classification network in the video processing model, and determining a picture category label matched with the target video;

and determining a multi-level video label matched with the target video based on the event label, the target object label and the picture category label.

4. The method of claim 3, further comprising:

determining a video time parameter matched with an event label in the original video based on the event label matched with the target video;

based on the video time parameter, re-extracting the corresponding video frame from the original video;

and forming a target video matched with the event label based on the extracted video frames.

5. The method according to claim 3, wherein the determining a multi-level video tag matching the target video through the association relationship among the event tag, the target object tag and the picture category tag comprises:

determining that an event label matched with the target video is a first type label;

determining that a target object label and a picture category label matched with the target video are second type labels;

and carrying out hierarchical combination processing on the first type label and the second type label to form a multi-level video label matched with the target video.

6. The method of claim 1, wherein determining a text generation framework corresponding to the video processing model in response to the target video and a multi-level video tag matching the target video comprises:

configuring a corresponding starting identifier for the text generation frame;

determining a conditional probability parameter corresponding to the video label based on the video label matched with the target video;

configuring a limited set of corresponding non-terminators for the text generation framework;

determining a limited set of pseudo terminators in the text generation framework based on the conditional probability parameters;

determining a limited set of derived non-terminator subsets in the text generation framework based on the conditional probability parameter.

7. The method of claim 6, wherein generating, by the text generation framework, text description information that matches the target video comprises:

determining input video tags matched with the text generation framework;

performing first conditional probability matching processing on the input video label based on a limited set of derivation non-terminator subsets in the text generation framework to form first candidate text information;

performing second probability matching processing on the input video label based on the limited set of pseudo terminators in the text generation framework to form second candidate text information;

and generating text description information matched with the target video by combining the first candidate text information and the second candidate text information.

8. The method of claim 7, wherein performing a first conditional probability matching process on the input video tag based on a limited set of derived non-terminator subsets in a text generation framework to form a first candidate text message comprises:

determining a corresponding text selection probability based on the input video tag;

determining a first conditional Boolean value based on the label of the target video;

and performing first conditional probability matching processing on a first type label in the input video labels based on the text selection probability and the first conditional Boolean value to form first candidate text information.

9. The method of claim 7, wherein performing a second probability matching process on the input video tag based on the limited set of pseudo terminators in the text generation framework to form a second candidate text information comprises:

determining a second conditional boolean value based on the tag of the target video and the first candidate text information;

and performing second conditional probability matching processing on a second type label in the input video label according to the second conditional Boolean value through the limited set of the pseudo terminator in the text generation frame to form second candidate text information.

10. The method of claim 1, further comprising:

determining portrait information of a target object when the target video is a game type video, wherein the portrait information of the target object: including at least one of skills, weapons, actions, gender, characters, lines, and skin of the target object in a game-type video;

and determining the input video label matched with the text generation frame based on the portrait information of the target object.

11. A video information processing apparatus, characterized in that the apparatus comprises:

the information transmission module is used for acquiring a target video;

the information processing module is used for processing the target video to form a multi-level video label matched with the target video;

the information processing module is used for responding to the target video and a multi-level video label matched with the target video and determining a corresponding text generation frame, wherein the text generation frame is composed of a conditional probability combination and a pseudo terminator set;

12. The apparatus of claim 11,

13. The apparatus of claim 11,

and the information processing module is used for determining a multi-level video label matched with the target video according to the incidence relation among the event label, the target object label and the picture category label.

14. An electronic device, characterized in that the electronic device comprises:

a memory for storing executable instructions;

a processor for implementing the video information processing method of any one of claims 1 to 10 when executing the executable instructions stored in the memory.

15. A computer-readable storage medium storing executable instructions, wherein the executable instructions, when executed by a processor, implement the video information processing method of any one of claims 1 to 10.