CN113392690A

CN113392690A - Video semantic annotation method, device, equipment and storage medium

Info

Publication number: CN113392690A
Application number: CN202110002075.3A
Authority: CN
Inventors: 刘刚
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2021-01-04
Filing date: 2021-01-04
Publication date: 2021-09-14

Abstract

The application discloses a video semantic annotation method, a video semantic annotation device, video semantic annotation equipment and a storage medium, and belongs to the field of video semantic understanding. The method comprises the following steps: acquiring a video frame sequence in a video; extracting multi-modal information of at least two dimensions from video frames in a sequence of video frames; determining a target plot event matched from at least two plot events based on the combination of the multi-modal information of at least two dimensions as the plot event of the video frame; and performing semantic annotation on the video frame sequence according to the plot events of the video frames to obtain semantic information of the video. The related semantic understanding model does not need to be trained, the labeling efficiency and accuracy of semantic information are improved, and the video after semantic labeling is pushed to the user, so that the user can watch more high-quality video contents, and the video browsing experience of the user is improved.

Description

Video semantic annotation method, device, equipment and storage medium

Technical Field

The present application relates to the field of video semantic understanding, and in particular, to a video semantic annotation method, apparatus, device, and storage medium.

Background

When a user watches a video, information related to the video, such as a video name, a video tag (type), a highlight and the like, is displayed in the video playing interface.

The method comprises the steps that a video producer uploads a video to a video platform, the video platform carries out information labeling on the uploaded video and then stores the video, and when a user carries out video search through labeling information, the video platform sends the video corresponding to the labeling information to a client corresponding to the user. Taking a game video as an example, after a game anchor (a video producer) uploads the game video to a video platform, semantic annotation is performed on highlights possibly existing in the game video through a machine learning model, and semantic information of the game video is output. If the time axis is between 1 minute and 30 seconds and 2 minutes, the video clips are marked when the game fighters play the group.

In the technical scheme, the machine learning model needs a large amount of semantic annotation sample videos in the training process, and the semantic annotation sample videos need to be calibrated in a manual mode, so that difficulty is increased for obtaining a large amount of semantic annotation sample videos, and when the semantic annotation sample videos are lacked, the machine learning model obtained by training with less semantic annotation sample videos is poor in semantic annotation accuracy.

Disclosure of Invention

The embodiment of the application provides a video semantic annotation method, a device, equipment and a storage medium, semantic annotation can be performed on a video according to a local event by determining the matched local event after combination of virtual element information and the local information extracted from a video frame, semantic information of the video can be obtained without a machine learning model obtained through sample video training, and the annotation efficiency and accuracy of the semantic information are improved. The technical scheme is as follows:

according to one aspect of the application, a video semantic annotation method is provided, and the method comprises the following steps:

acquiring a video frame sequence in a video;

extracting multi-modal information in at least two dimensions from video frames in the sequence of video frames;

determining a target plot event matched in at least two plot events based on the combination of the multi-modal information of at least two dimensions as the plot event of the video frame;

and performing semantic annotation on the video frame sequence according to the plot events of the video frames to obtain semantic information of the video.

According to another aspect of the present application, there is provided a video semantic annotation device, including:

the acquisition module is used for acquiring a video frame sequence in a video;

an extraction module for extracting multi-modal information of at least two dimensions from video frames in the sequence of video frames;

the processing module is used for determining a target plot event matched from at least two plot events based on the combination of the multi-modal information of at least two dimensions as the plot event of the video frame;

and the marking module is used for carrying out semantic marking on the video frame sequence according to the plot events of the video frames to obtain semantic information of the video.

According to another aspect of the present application, there is provided a computer device comprising: a processor and a memory, the memory having stored therein at least one instruction, at least one program, a set of codes, or a set of instructions, the at least one instruction, the at least one program, the set of codes, or the set of instructions being loaded and executed by the processor to implement the video semantic annotation method as described above.

According to another aspect of the present application, there is provided a computer-readable storage medium having stored therein a computer program, which is loaded and executed by a processor to implement the video semantic annotation method as described above.

According to another aspect of the application, a computer program product or computer program is provided, comprising computer instructions stored in a computer readable storage medium. A processor of a computer device reads the computer instructions from the computer-readable storage medium, and executes the computer instructions to cause the computer device to execute the video semantic annotation method as described above.

The beneficial effects brought by the technical scheme provided by the embodiment of the application at least comprise:

the target scenario event matched with the multi-modal information combination extracted from the video frame is determined as the scenario event of the video frame, so that the scenario event corresponding to the video frame is utilized to perform semantic annotation on the video frame, a machine learning model obtained through sample video training is not needed to perform semantic information annotation on the video frame, the efficiency and the accuracy of semantic annotation on the video frame are improved, and a user can quickly capture a highlight segment in the video frame according to the video frame with the semantic information annotation.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a block diagram of a computer system provided in an exemplary embodiment of the present application;

FIG. 2 is a system framework diagram of a video platform provided by an exemplary embodiment of the present application;

FIG. 3 is a flowchart of a video semantic annotation method provided by an exemplary embodiment of the present application;

FIG. 4 is a schematic diagram of annotation of semantic information of a video provided by an exemplary embodiment of the present application;

FIG. 5 is a flowchart of a video semantic annotation method provided by another exemplary embodiment of the present application;

FIG. 6 is a schematic illustration of a preset game event provided by an exemplary embodiment of the present application;

FIG. 7 is a schematic diagram of a video playback interface with semantic annotation information according to another exemplary embodiment of the present application;

fig. 8 is a block diagram of a method for identifying a character identifier of a first virtual character according to an exemplary embodiment of the present application;

FIG. 9 is a diagram illustrating a manner in which virtual element information and session information may be extracted according to an exemplary embodiment of the present application;

FIG. 10 is a schematic illustration of video frame types provided by an exemplary embodiment of the present application;

FIG. 11 is a schematic diagram of a video frame containing a result of a match-up provided by an exemplary embodiment of the present application;

FIG. 12 is a diagram of a video frame containing session status information provided by an exemplary embodiment of the present application;

FIG. 13 is a flowchart framework of a video semantic annotation method provided by an exemplary embodiment of the present application;

FIG. 14 is an interface diagram of a live user interface provided by an exemplary embodiment of the present application;

FIG. 15 is a diagram of a virtual environment screen provided by an exemplary embodiment of the present application;

FIG. 16 is a system framework diagram of a video platform provided by another exemplary embodiment of the present application;

FIG. 17 is a block diagram of a video semantic annotation apparatus provided in an exemplary embodiment of the present application;

FIG. 18 is a block diagram of a computer device according to an exemplary embodiment of the present application;

fig. 19 is a schematic device structure diagram of a server according to an exemplary embodiment of the present application.

Detailed Description

To make the objects, technical solutions and advantages of the present application more clear, embodiments of the present application will be described in further detail below with reference to the accompanying drawings.

First, terms referred to in the embodiments of the present application are described:

message source (Feeds, Web Feed, News Feed, Syndicated Feed): also named as source material, or feed, or information supply, or feed, or abstract, or source, or news subscription, or web source (English: web feed, news feed, synthesized feed) is a data format. The information distribution platform distributes the latest information to users, and the latest information is usually arranged in a Timeline (Timeline), wherein the Timeline is the most primitive, most intuitive and most basic display form of the message source. A prerequisite for a user to be able to subscribe to a website is that the website provides a source of messages. The confluence of Feeds is called polymerization (Aggregation), and the software for polymerization is called polymerizer (Aggregator). Aggregators are software dedicated to subscribing to websites for end users, and typically include Really Simple Syndication (RSS) readers, Feed readers, news readers, and the like.

Virtual environment: is a virtual environment that is displayed (or provided) when an application is run on the terminal. The virtual environment may be a simulation environment of a real world, a semi-simulation semi-fictional environment, or a pure fictional environment. The virtual environment may be any one of a two-dimensional virtual environment, a 2.5-dimensional virtual environment, and a three-dimensional virtual environment, which is not limited in this application. The following embodiments are illustrated with the virtual environment being a three-dimensional virtual environment. In some embodiments, the virtual environment is used to provide a combat environment for at least two master virtual characters. The virtual environment comprises a lower left corner area and an upper right corner area which are symmetrical, the main control virtual characters belonging to two enemy camps respectively occupy one of the areas, and target buildings, or base points, or crystals in the deep area of the other party are destroyed to serve as winning targets.

Virtual roles: refers to a movable object in a virtual environment. The movable object can be a virtual character, a virtual animal, an animation character, etc., such as: a person or an animal displayed in a three-dimensional virtual environment. Optionally, the virtual character is a three-dimensional volumetric model created based on animated skeletal techniques. Each virtual character has its own shape and volume in the three-dimensional virtual environment, occupying a portion of the space in the three-dimensional virtual environment. In the embodiment of the present application, a first virtual role is taken as an example of a virtual role controlled by a user, and the first virtual role generally refers to one or more first virtual roles in a virtual environment.

Multiplayer Online Battle sports Games (MOBA): in the virtual environment, different virtual teams belonging to at least two enemy camps respectively occupy respective map areas, and compete with one another by taking a certain winning condition as a target. Such winning conditions include, but are not limited to: the method comprises the following steps of occupying site points or destroying enemy battle site points, killing virtual characters of enemy battles, guaranteeing the survival of the enemy battles in a specified scene and time, seizing certain resources, and comparing the resource with the resource of the other party in the specified time. The tactical competitions can be carried out by taking a game as a unit, and the map of each tactical competition can be the same or different. Each virtual team includes one or more virtual roles, such as 1, 2, 3, or 5. The duration of a play of the MOBA game is from the moment the game is started to the moment the winning condition is achieved.

Artificial Intelligence (AI): the method is a theory, method, technology and application system for simulating, extending and expanding human intelligence by using a digital computer or a machine controlled by the digital computer, sensing the environment, acquiring knowledge and obtaining the best result by using the knowledge. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.

The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.

Machine Learning (ML): the method is a multi-field cross discipline and relates to a plurality of disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. The special research on how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. Machine learning is the core of artificial intelligence, is the fundamental approach for computers to have intelligence, and is applied to all fields of artificial intelligence. Machine learning and deep learning generally include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and teaching learning.

Cloud Computing (Cloud Computing): refers to the delivery and usage patterns of Internet Technology (IT) infrastructure, refers to the acquisition of needed resources through a network in an on-demand, easily scalable manner; the generalized cloud computing refers to a delivery and use mode of a service, and refers to obtaining a required service in an on-demand and easily-extensible manner through a network. Such services may be IT and software, internet related, or other services. Cloud Computing is a product of development and fusion of traditional computers and Network Technologies, such as Grid Computing (Grid Computing), Distributed Computing (Distributed Computing), Parallel Computing (Parallel Computing), Utility Computing (Utility Computing), Network Storage (Network Storage Technologies), Virtualization (Virtualization), Load balancing (Load Balance), and the like.

With the development of diversification of internet, real-time data stream and connecting equipment and the promotion of demands of search service, social network, mobile commerce, open collaboration and the like, cloud computing is rapidly developed. Different from the prior parallel distributed computing, the generation of cloud computing can promote the revolutionary change of the whole internet mode and the enterprise management mode in concept.

The embodiment of the application provides a video semantic annotation method which can be used for storing an annotated video by combining a cloud computing technology and accurately pushing the annotated video to a client corresponding to a user according to the requirement of the user.

The video semantic annotation method based on the virtual environment can be applied to computer equipment with strong data processing capacity. In a possible implementation manner, the video semantic annotation method based on the virtual environment provided by the embodiment of the application can be applied to a personal computer, a workstation or a server, that is, semantic information in a video can be identified through the personal computer, the workstation or the server, so that the video content can be conveniently understood. Illustratively, the video semantic annotation method based on the virtual environment is applied to a background server of an application program, so that a terminal installed with the application program can receive video content containing semantic information by means of the background server.

FIG. 1 illustrates a schematic diagram of a computer system provided by an exemplary embodiment of the present application. The computer system 100 includes a first terminal 110, a second terminal 111, and a server 120, wherein the first terminal 110 and the second terminal 111 respectively communicate data with the server 120 via a communication network. Illustratively, the communication network may be a wired network or a wireless network, and the communication network may be at least one of a local area network, a metropolitan area network, and a wide area network.

The first terminal 110 has an application installed and running therein, and the application is an application having a video playing function. The application may be a video application (including a short video application), a live application, a music application, a social application, a Virtual Reality application (VR), an Augmented Reality Application (AR), a gaming application, a shopping application, a payment application, a group purchase application, and the like. Illustratively, the first terminal 110 has a video application installed thereon, the first terminal 110 is a terminal used by a first user (video producer) who records a game video and distributes the game video in the video application.

The second terminal 111 has installed and runs therein an application that is the same application as the application in the first terminal 110 or a different application in the same type. The application may be a video application (including a short video application), a live application, a music application, a social application, a Virtual Reality application (VR), an Augmented Reality Application (AR), a gaming application, a shopping application, a payment application, a group purchase application, and the like. Illustratively, the second terminal 111 has a video application installed thereon, and the second terminal is a terminal used by a second user, and the second user (video consumer) watches a game video in the video application, and the game video may be a video distributed by the first user.

In some embodiments, the first terminal 110 and the second terminal 111 may be mobile terminals such as a smart phone, a smart watch, a tablet computer, a laptop portable notebook computer, a smart robot, and may also be terminals such as a desktop computer and a projection computer, and the type of the terminals is not limited in the embodiments of the present application. It is understood that the first terminal 110 and the second terminal 111 can be the same terminal, such as a video producer who uses the first terminal 110 to publish videos and uses the first terminal 110 to watch videos, and the video producer who is also a video consumer.

The server 120 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a Network service, cloud communication, a middleware service, a domain name service, a security service, a Content Delivery Network (CDN), a big data and artificial intelligence platform, and the like. In one possible implementation, the server 120 is a backend server for applications installed in the terminal.

As shown in fig. 1, in the present embodiment, the first terminal 110 transmits a video to the server 120, and the server 120 acquires a video 11 with the video as a video including a MOBA game screen, and performs multimodal information extraction 12 from the video 11. Illustratively, the multimodal information includes virtual element information and game information, the virtual element information includes element information corresponding to active elements and static elements in the virtual environment, and the game information includes information corresponding to the first virtual character in the control. And determining a target plot event matched from the at least two plot events based on the combination of the multi-modal information of at least two dimensions as the plot event 13 of the video frame, and labeling the video frame according to the plot event 13 of the video frame to obtain semantic labeling information 14 corresponding to the video frame.

The server 120 sends the annotated video to the second terminal 111, and a video application installed in the second terminal 111 displays a video playing interface of the game video. The video playing interface comprises video content of a game video, a game name, a game label, a highlight in the video and a video of a related topic corresponding to the game video, wherein the highlight and the video of the related topic are obtained according to the obtained semantic annotation information 14 corresponding to the video. The semantic information of the game video is labeled through the server 120, so that a user watching the video can be helped to quickly capture the wonderful moment of the game video, and the watching experience of the user is improved.

It can be understood that, in the embodiment, only the server corresponding to the video application program in the terminal is taken as an example, and in practical application, the video semantic annotation method may also be applied to the server corresponding to the live broadcast application program, which is not limited in the embodiment of the present application.

For convenience of description, the following embodiments are described as examples in which the video semantic annotation method is executed by a server corresponding to an application program.

Fig. 2 illustrates a system framework diagram of a video platform provided by an exemplary embodiment of the present application. The description will be given by taking the case where the video producer records the game video. The video producer uploads the recorded game video to a background server corresponding to the video application (i.e. the video content production end 21), and illustratively, the background server of the video application includes an uplink content interface server 22, a video content storage server 23, a content database server 24, a scheduling center server 25, a statistical interface and analysis server 28, and a downlink content interface server 38.

The video producer uploads the game video to the upstream content interface content server 22, and the upstream content interface server 22 stores the game video in the video content storage server 23; the uplink content interface server 22 stores meta information of the game video in the content database server 24, wherein the meta information includes at least one of the size of the video file, the cover picture, the video link, the conversion rate, the file format, the video title, the release time and the author; the upstream content interface server 22 sends the game video to the dispatch center server 25.

The dispatch center server 25 calls the video deduplication service 26 to perform deduplication processing on the game video, where the deduplication processing includes at least one of title deduplication processing, cover sheet deduplication processing, document deduplication processing, video fingerprint deduplication processing, and audio fingerprint deduplication processing on the game video. The video deduplication service 26 performs deduplication processing on the game video and then feeds back a deduplication result to the scheduling center server 25.

The scheduling center server 25 calls the manual auditing system 27 to audit the game video after the duplication removal, an auditing result is fed back to the scheduling center server 25, the manual auditing system 27 feeds the audited result back to the scheduling center server 25, or the video duplication removal service 26 directly sends the game video after the duplication removal processing to the manual auditing system 27, and the manual auditing system 27 audits the game video and feeds the auditing result back to the scheduling center service. The manual review is a preliminary review of whether the video content relates to pornographic, gambling, politically sensitive characteristics, and then the reviewer needs to label the classification of the video or confirm whether the label of the video content is correct.

The manual auditing system 27 sends auditing information during auditing the game video to the statistics interface and analysis server 28, and the statistics interface and analysis server 28 is further configured to receive scheduling information of the scheduling center server 25 during task scheduling. In some embodiments, the statistics interface and analysis server 28 is configured to perform preliminary statistics and analysis on the marked sample video, store the results of the statistics and analysis in the video content sample database 33, and obtain the sample video from the video content sample database 33.

The dispatch center server 25 calls a video content understanding main service 29 to semantically label the video content of the game video, and the video content understanding main service 29 comprises a video preprocessing service 30, a video information extraction service 31 and a video understanding service 32. The three services are constructed based on the video content processing model 34, the video frames are processed by calling the video content processing model 34 corresponding to each service, and multi-mode information of at least two dimensions is extracted from the video frames, so that various plot events can be constructed based on the multi-mode information. The video preprocessing service 30 is used for preprocessing the game video, such as video frame classification of the game video frame and enhancing the resolution of the game video frame; the video information extraction service 31 is used for extracting multi-modal information from the game video frames; the video understanding service 32 is configured to obtain a scenario event generated in the virtual environment according to the extracted multi-modal information, match the scenario event with a target scenario event, and perform semantic information tagging on the game video according to the target scenario event if the scenario event is matched with the target scenario event.

The video content processing model 34 is trained from sample videos obtained from the video content sample database 33.

The video content understanding main service 29 feeds back the game video labeled with the semantic information to the scheduling center server 25, and the scheduling center server 25 sends the video address corresponding to the game video labeled with the semantic information to the downlink content interface server 38.

When a user watches a game video (i.e., the video content consuming end 39), the video address is obtained from the downstream content interface server 38, and after the video address is obtained, the client corresponding to the user obtains the game video from the video content storage server 23 according to the video address. It will be appreciated that the video addresses in the downstream content interface server 38 are each game videos tagged with semantic information, thereby enabling the user to determine highlights in the game video and associated videos with the game video.

Through the system framework, a user can quickly acquire wonderful segments in the video, and the user can selectively watch the video to help the user quickly understand ideas expressed by the video content.

Fig. 3 shows a flowchart of a video semantic annotation method provided by an exemplary embodiment of the present application, which is applied to the server 120 shown in fig. 1. The method comprises the following steps:

step 301, a sequence of video frames in a video is obtained.

Illustratively, the server receives a video sent by a terminal, an application program is installed in the terminal, and a user acquires the video through the application program, or the user acquires the video by using a camera application in a terminal system, calls the camera application by the application program to acquire the video, and sends the recorded video to the server through the application program. The application programs include video application programs (including short video application programs) and live broadcast application programs and other application programs supporting video acquisition functions.

Illustratively, the server stores video in advance, or the server acquires video from an open video data set.

A video comprises a plurality of video frames forming a sequence of video frames. In the embodiment of the application, the example that the video includes a game video is taken as an example, and the video acquired by the server is a video for acquiring a picture when the game application program runs.

The game application program comprises the following two pictures when running: a virtual environment screen during game play (first screen) and a screen during non-game play (second screen).

The first picture includes, but is not limited to, the following pictures: the game method comprises the steps of enabling a first virtual role to be a picture when the first virtual role performs game matching in a virtual environment according to the identity of a participant, enabling the first virtual role to view other virtual roles for game matching according to the identity of a spectator, displaying a card according to a game strategy, controlling a virtual vehicle (such as a virtual vehicle, a virtual ship and the like) to run in the virtual environment by a user, resisting attack of a Non-user Player Character (NPC), and eliminating pictures corresponding to a preset number of same elements.

The second picture includes, but is not limited to, the following pictures: the game system comprises a role selection picture when a user selects a first virtual role participating in the game, a game loading picture before entering a virtual environment, a skill selection picture for selecting a skill (or virtual prop) carried by the first virtual role, a message receiving picture for receiving a notification message, a prop purchasing picture for purchasing the virtual prop, a role training picture for simulating the first virtual role to participate in the game and the like.

The picture is a picture contained in a video frame in a video, that is, a frame or frames corresponding to the picture in a sequence of video frames.

Step 302, multi-modal information in at least two dimensions is extracted from video frames in a sequence of video frames.

Multimodal information refers to information having multiple sources or forms. Multimodal information includes, but is not limited to, the following types of information:

when the video frame sequence is a video frame sequence related to the MOBA game, the multi-modal information includes information corresponding to the virtual character and game-to-game information;

when the video frame sequence is a video frame sequence related to a First Person shooter Game (FPS), the multi-modal information includes information corresponding to a virtual character and Game-play information;

when the video frame sequence is about a racing game, the multi-mode information comprises information corresponding to the virtual carrier, time information and sequencing information;

when the video frame sequence is a video frame sequence related to a simulated formative game, the multi-modal information includes information corresponding to a formative object and formative (or business) result information;

when the sequence of video frames is a sequence of video frames for a card game, the multi-modal information includes card information and betting result information;

when the video frame sequence is a video frame sequence related to a turret defense-type game, the multimodal information includes information corresponding to a protected object and information corresponding to an NPC;

when the sequence of video frames is a sequence of video frames relating to a music-like game, the multimodal information includes information corresponding to virtual music elements (e.g., block elements representing musical notes), score information, and hitching information (information resulting from consecutive hits on a plurality of virtual music elements).

Schematically, the description will be given by taking a video frame sequence as an example of a video frame sequence of the MOBA game.

The server extracts information from each video frame in the sequence of video frames, and as shown in fig. 4, extracts information 312 from video frames 311 in the sequence of video frames, where the information 312 includes virtual element information and game information. The virtual element information refers to information corresponding to virtual elements in the virtual environment, and includes information corresponding to active elements and static elements displayed in a virtual environment picture, and the opposite information refers to information related to the opposite of the first virtual character, for example, the broadcast information control displays that the first virtual character a kills the second virtual character b.

The game information comprises game state information and game result information, the game state information is obtained from a control displayed on the user interface, and the game state comprises skill information or prop information used when the first virtual role attacks the other virtual role. For example, a broadcast information control is displayed on the user interface, and the broadcast first virtual character a kills the second virtual character b. The session result information is information corresponding to a session result generated after the end of a session.

Illustratively, the server invokes an information extraction model, which is a machine learning model with information extraction capabilities, to extract multimodal information in at least two dimensions from the sequence of video frames. The information extraction model can be trained based on an unsupervised learning mode, so that the information extraction model can be trained based on fewer samples and can also be obtained after training.

Illustratively, the server stores a feature template of the multi-modal information, matches the multi-modal information in the video frame, and if the matching result is consistent, determines that the extracted multi-modal information is the multi-modal information corresponding to the video frame.

And step 303, determining a target plot event matched in the at least two plot events based on the combination of the multi-modal information of the at least two dimensions as the plot event of the video frame.

Wherein, the target scenario event includes but is not limited to the following events:

the first type of game event is used for representing the event that the first virtual role plays a game in the virtual environment;

the second type of game event is used for representing the event that the virtual vehicle runs in the virtual environment;

a third type of game-play event, the third type of game-play event being used to characterize an event in which cards are presented according to a game strategy;

defending from the injury event, defend from the injury event used for characterizing (using the fictitious props or skills) defending from the non-user and taking the role NPC to the event that the injury of the protected target corresponds to;

and the element elimination event is used for representing an event that the same elements meeting the preset number are dragged to the same position through a dragging operation and the same elements are eliminated.

The server combines a plurality of modal information extracted from the video frame sequence into a plurality of scenario events, compares the scenario events with the target scenario events, and determines the scenario events of the video frames if the scenario events are consistent with the target scenario events.

Illustratively, the sequence of video frames is a sequence of video frames for a MOBA game. The server extracts life value information corresponding to the first virtual character and broadcast information corresponding to the broadcast information control from the video frame sequence, the life value information is that the life value of the first virtual character is lower than a preset life value within a period of time, the broadcast information is information that the first virtual character finishes killing the enemy virtual character for preset times within the period of time, and the scenario event is named 'silk blood multiple killing'.

And 304, performing semantic annotation on the video frame sequence according to the plot events of the video frames to obtain semantic information of the video.

The semantic annotation refers to the annotation of the category to which each pixel point in the video frame belongs. By performing semantic annotation on the video, information contained in the video content can be extracted, so that a video viewer can understand the video content conveniently. The semantic information is used for representing descriptive information corresponding to the video content.

As shown in fig. 4, the video frame sequence is semantically labeled according to the matched target scenario event after the virtual element information and the opposite office information are combined, so as to obtain video semantic information 313. In some embodiments, the semantic information further includes at least one of a type to which the video content belongs, a user account corresponding to the video distribution, and a first virtual character, such as a user a (user account a) distributing a game video, where the game video is a game type related to multiple MOBAs, a scenario event generated when the first virtual character performs a game-play in the video is an event corresponding to a game scenario, and the first virtual character used by the user is a character K.

In summary, in the method provided in this embodiment, the target scenario event matched with the multi-modal information combination extracted from the video frame is determined as the scenario event of the video frame, so that the scenario event corresponding to the video frame is utilized to perform semantic annotation on the video frame, and the semantic information of the video frame can be annotated without a machine learning model obtained through sample video training, thereby improving the efficiency and accuracy of performing semantic annotation on the video frame, and enabling a user to quickly capture a highlight segment in the video frame according to the video frame annotated with the semantic information.

Fig. 5 shows a flowchart of a video semantic annotation method according to another exemplary embodiment of the present application. The method is applied in a server 120 as shown in fig. 1. The method comprises the following steps:

step 501, acquiring a video frame sequence in a video.

Illustratively, the video is a game video recorded by a game host in a live process, and the game video is a video related to the MOBA game. The server acquires a video frame sequence from a game video for semantic annotation, wherein pictures included in the video frame sequence are game pictures, and the game pictures include a picture when a game host controls a first virtual character (the first virtual character is the identity participating in game play) to play game, a picture when the game host controls the first virtual character to watch other virtual characters to play game, a picture of the first virtual character selected by the game host to participate in game play, a picture of a virtual item (or skill) carried by the first virtual character selected by the game host, a game loading picture before entering a virtual environment, a picture for purchasing the virtual item, a picture for simulating the first virtual character to participate in game play, and the like.

The pictures of the first virtual roles during the game-play further comprise pictures of mutual attacks among the first virtual roles, pictures of broadcasting or prompting game-play conditions, and pictures of mutual messages sent among the first virtual roles belonging to the same team. Illustratively, at least one control is displayed in the screen for broadcasting or prompting the game play status, and the control may be a broadcast information control or a score control (for counting game play scores).

Illustratively, the game video is a video about a race game, and the user controls a virtual vehicle such as a virtual vehicle or a virtual ship to perform a speed match, the match is provided with a fixed route, and the first virtual vehicle which rushes through the destination is a winner. The game video comprises at least one of the following pictures: the method comprises the steps of displaying a picture of a virtual carrier running in a virtual environment, a picture of the virtual carrier when the virtual carrier rushes through an end point, a picture of a user selecting the virtual carrier, a picture corresponding to a ranking after the end of a game, a picture corresponding to a ranking which changes dynamically in the game process and a picture of the speed of the virtual carrier.

Illustratively, the game video is a video about turret defense, and the user defends the NPC from harming the protected object by using a virtual prop or a virtual element, or attacks the NPC by using the virtual prop or the virtual element.

Step 502, multi-modal information in at least two dimensions is extracted from video frames in a sequence of video frames.

Illustratively, the server extracts the multi-modal information from the sequence of video frames by calling an information extraction model, and the embodiment is described by taking the example that the multi-modal information includes virtual element information and game information. The video frames of the sequence of video frames include a virtual environment view including a view of the virtual environment in which the first virtual character is active and at least one control.

Calling a first information extraction model to extract virtual element information from a virtual environment picture, wherein the virtual element information comprises information corresponding to at least one element of an active element and a static element in the virtual environment picture; and calling a second information extraction model to extract the game-play information from the control, wherein the game-play information comprises information corresponding to the first virtual role in the control. The first information extraction model is a machine learning model used for extracting virtual element information from the virtual environment picture, and can be a classification model, such as a classification model constructed by a convolutional neural network; the second information extraction model is a machine learning model for extracting the game information from the control, and the second information extraction model can also be a classification model.

The division is performed according to the motion state of the virtual elements, including an active element and a static element, the active element including a first virtual role, a neutral virtual role, and a Non-user play-like virtual role (NPC). Static elements include building elements (e.g., defense towers), elements related to the environment (e.g., plant elements, weather elements, etc.) in the virtual environment. The neutral virtual character is a virtual character which does not belong to any one of the two parties of the game, the neutral virtual character corresponds to a life value, the neutral virtual character can be attacked by the first virtual character of the game, and the life value of the neutral virtual character is continuously reduced to zero under the attack of the first virtual character. In some embodiments, when the life value of the neutral avatar decreases to zero, a transition is made to a reward for the first avatar to collect. The reward may be a currency for purchasing the virtual item, a skill value for promoting the first virtual character, the virtual item, or the like. The NPC is a virtual character which belongs to the same team as the first virtual character and is not controlled by a user, corresponds to the life value and can perform corresponding attack action according to preset game logic.

In one example, the server calls a first information extraction model to extract the character identification of the first virtual character from the virtual environment picture, and the server calls a second information extraction model to extract the office information between the first virtual characters from the control, such as the number of the first virtual character a killing other virtual characters.

And step 503, determining a target plot event matched in the at least two plot events based on the combination of the multi-modal information of the at least two dimensions as the plot event of the video frame.

Illustratively, a target scenario event is stored in the server in advance, and the event combined by the virtual multi-modal information is matched with the target scenario event to determine the scenario event corresponding to the video frame.

Step 503 may be replaced by the following steps:

step 5031, a corresponding relation is obtained, and the corresponding relation is used for representing the corresponding relation between the event after the multi-modal information combination and the target storyline event.

Illustratively, the server stores the corresponding relationship between the event combined by the multi-modal information and the target scenario event. The corresponding relationship may be at least one of a functional relationship and a table lookup relationship.

Step 5032, acquiring a target storyline event according to the event after the combination of the corresponding relation and the multi-modal information, and determining the target storyline event as a storyline event of the video frame.

Illustratively, in a game video of the MOBA game type, the target scenario events (first type of game play events) include, but are not limited to, the following events:

the first office event is used for representing that the first virtual role reduces the life value of the second virtual role to zero within a preset time period, the life value of the first virtual role is lower than a preset life value, and the number of the second virtual roles is a preset number;

the second office-to-office event is used for representing that the duration of the first virtual role in the hidden state in the virtual environment exceeds the preset duration, and the first virtual corner reduces the life value of the second virtual role to zero;

and a third office event pair, wherein the third office event pair is used for representing that the first virtual character reduces the life value of the building element in the virtual environment, the surrounding range of the building element does not comprise the second virtual character, and the building element is a building element corresponding to the team where the second virtual character is located.

As shown in fig. 6, virtual element information 314 including strange information (neutral virtual character information), defense tower information (building element information), life value information and hero information (information of the first virtual character), small map information (map preview of the virtual environment), and bush information (scene information) and office information 315 are extracted from the video frame. The strange is a neutral virtual character in the virtual environment and does not belong to any member for carrying out the game, and two parties for the game can obtain rewards by killing the strange; the defense towers are building elements owned by two parties carrying out the game, and each team owns at least one defense tower. The grass is a grass in the virtual environment and is used for hiding the first virtual character, and the first virtual character can squat to defend 'stealing' the enemy virtual character in the grass. The match information 315 includes hit state information, match result information, and broadcast information.

According to the corresponding relation between the event formed by combining the virtual element information 314 and the opposite office information 315 and the scenario information stream 316 (preset opposite office event), the scenario information stream 316 which the virtual element information 314 and the opposite office information 315 conform to is determined, and the conforming scenario information stream 316 is determined as the opposite office event corresponding to the video frame.

Illustratively, the event that combines the virtual element information 314 and the office event information 315 corresponds to a first office event 317, where the first office event 317 is: the life value (blood streak) of hero (virtual character) a is lower than the preset life value for a period of time, the hero a completes the killing for a preset number of times within the period of time, and the first contra event 317 is named as suixue suicidal effect.

Illustratively, the event that combines the virtual element information 314 and the local event information 315 corresponds to a second local event 318, where the second local event 318 is: after a period of time, the staying time of hero a in the grass exceeds a preset time, hero a finishes killing, and the second contra-office event 318 is named as the squat of the grass.

Illustratively, the event that combines the virtual element information 314 and the office information 315 corresponds to a third office event 319, where the third office event 319 is: in a period of time, the hero a of the a camps independently hits the crystal (building element) of the B camps, and the hero of the B camps does not exist in the surrounding range of the crystal of the B camps, and the third pair of local events 319 is named as pilocards. The two parties of the game have a crystal respectively, and after the defense tower on one path of the enemy is destroyed, the crystal of the enemy can be destroyed to win the game.

In step 504, a start timestamp and an end timestamp of a video segment in the sequence of video frames to which a plurality of consecutive video frames having the same storyline event belong are determined.

Extracting information of each video frame in the video frame sequence, when a plurality of continuous video frames represent the same scenario event, forming a video segment by the plurality of continuous video frames, acquiring a timestamp of the video frame corresponding to the starting time of the scenario event, wherein the timestamp is a starting timestamp, and then acquiring a timestamp of the video frame corresponding to the ending time of the scenario event, wherein the timestamp is an ending timestamp.

And 505, performing semantic annotation on the video segments according to the scenario event, the start timestamp and the end timestamp to obtain semantic information corresponding to the video segments in the video.

By combining the start time stamp and the end time stamp, semantic annotation can be performed on video segments of scenario events occurring in the video frame sequence, for example, a game video with semantic information that a game event occurring from 2 nd minute 22 seconds to 3 rd minute 03 seconds is a second game event. In some embodiments, the server further calls the video classification model to identify the video, and obtains the video type, such as a game video, a food video, a science popularization video, a fun video, a ghost video, and the like. And synthesizing the video type and the labeled game event to obtain semantic information of the video.

As shown in fig. 7, the annotated video semantic information 400 can be displayed in a video playing interface. The video semantic information 400 includes game names, game tags, game themes, and game play events. The game tag can be obtained through information input by a video producer during video uploading, or can be obtained through detecting a video frame; the game theme may also be obtained from information entered by the video producer when uploading the video, or from detection of video frames. Illustratively, a scenario event and a timeline of the scenario event are also displayed on the video playing interface, and a user can directly jump the playing progress of the played video to the scenario event by clicking the scenario event.

The method provided by the embodiment also associates the scenario events and the time in the video with one another by combining the start time stamps and the end time stamps corresponding to the scenario events, so that a user can conveniently and quickly position the scenario events in the video frames according to the time stamps.

In the method provided by the embodiment, the multi-modal information is further refined, and the event after the target scenario event and the multi-modal information are combined is associated, so that the corresponding multi-modal information in the video picture is extracted in a corresponding mode aiming at different types of videos, the extracted information is more in line with the video type, the target scenario event which is more in line with the video is matched, and the efficiency and the accuracy of performing semantic annotation on the video are improved.

Taking the game video as the MOBA game video as an example, how to extract the virtual element information and the game-play information in the video frame is explained.

Firstly, extracting virtual element information.

1. The virtual element information includes a role identifier corresponding to the first virtual role.

Step 511, obtain the first role identifier of the first virtual role participating in the game.

As shown in fig. 8, the server obtains a first character identification 71 of a first virtual character participating in the game play through the game loading page. When two parties of the game play the game, the server acquires a first role identifier (namely, hero name) of a first virtual role selected by each user, for example, the user 1 selects hero a, the user 2 selects hero b, the user 3 selects hero c, the hero a and the hero b have a teammate relationship, and the hero a and the hero c are in a hostile relationship. Illustratively, the server calls a first information extraction model to extract the first character identifier of each first virtual character from the game loading page; illustratively, after the user selects the first virtual character, the client sends the character identifier of the selected first virtual character to the server, and the server obtains the first character identifier 71 of the first virtual character. In some embodiments, the client sends the game loading page to the server, and the server determines the first character identifier of the first virtual character participating in the game play by performing character recognition on characters in the game loading page.

Step 512, calling the element classification model to identify a life value region corresponding to the first virtual role in the virtual environment picture, and obtaining a second role identifier of the first virtual role corresponding to the life value region.

Illustratively, in the process of dealing with the office, the character identifier of the first virtual character and the life value area of the first virtual character are displayed in the same area, the life value area is usually located above the head of the first virtual character, and the character identifier of the first virtual character (the second character identifier 73) is displayed above the life value area.

The server calls an element classification model to identify a second role identification of the first virtual role from the picture frame of the game. The element classification model is obtained by training based on a sample video marked with role identification in advance. Illustratively, the element classification model first locates a life value region in the video frame, and identifies a second role identifier of the first virtual role from the life value region.

Step 513, in response to the first role identifier matching with the second role identifier, obtaining a first confidence corresponding to the second role identifier.

Feature matching is performed on the second character identifier 73 and the first character identifiers 71 obtained in step 512, and a first confidence 75 between the second character identifier 73 and each first character identifier is calculated. Confidence level refers to the degree of match between the first character identifier and the second character identifier, and if the first confidence level 75 between the first character identifier 71 and the second character identifier 73 is 0.9, there is a matching probability of 0.9 between the first character identifier 71 and the second character identifier 73, that is, the first character identifier 71 is close to the second character identifier 73.

And 514, calling a positioning and tracking model to identify the first virtual role in the virtual environment picture according to the first role identification to obtain a third role identification of the first virtual role displayed in the virtual environment picture, wherein the third role identification corresponds to a second confidence coefficient, and the positioning and tracking model participates in the first virtual role one-to-one correspondence of the corresponding office.

The server initializes a positioning and tracking model for each first virtual character in the video frame according to the acquired first character identifier 71, the positioning and tracking model performs positioning and tracking on the first virtual character at all times, selects the first virtual character in the video frame, extracts the selected first virtual character, extracts a third character identifier 72 according to the selected first virtual character, and calculates a second confidence 74 between the third character identifier 72 and the first character identifier 71.

Illustratively, the localization tracking Model is a tracking Model (dim Model) with discriminant Prediction capability. The Dimp model has the capability of distinguishing and predicting the tracking target, so that the tracking target in a video frame picture is locked, and the loss of the tracking target is avoided. The positioning accuracy and the positioning efficiency of the tracking target are improved, and the role identification of the first virtual role is conveniently identified in the follow-up process.

The method comprises the steps that a positioning tracking model initializes a tracker for each first virtual role in an initialized virtual environment picture, namely, a tracker is configured for each first virtual role, the tracker tracks the first virtual role according to role identification, when the first virtual role leaves the virtual environment picture (if the first virtual role exits the current match, or the first virtual role is killed), whether a new target is tracked or not is judged, if a new target is tracked, whether the new target corresponds to a matched tracker or not needs to be determined, if the new target does not correspond to the matched tracker, the tracker is initialized for the new target again, and if the new target corresponds to the tracker, the new target and the corresponding tracker are matched again, and the new target is tracked.

And step 515, obtaining the role identification of the first virtual role according to the first confidence degree and the second confidence degree.

Comparing the first confidence 75 obtained in step 513 with the second confidence 74 obtained in step 514, and selecting the character identifier with higher confidence as the character identifier extracted from the video frame by the server.

It should be noted that step 513 and step 514 may be executed simultaneously, or step 514 may be executed before step 513.

In summary, in the method provided in this embodiment, the role identifier of the first virtual role in the video frame is identified by a dual extraction manner of the localization tracking model and the classification model, so that the server can accurately extract the role identifier of the first virtual role from the video frame, and a result of performing semantic annotation on the video frame by using the role identifier of the first virtual role subsequently is more accurate.

2. The virtual element information includes location information corresponding to building elements in the virtual environment.

Step 521, extracting building element features corresponding to the building elements from the virtual environment picture.

Illustratively, the server invokes a feature extraction model to extract building element features from the video frames. Taking the building element as the defense tower as an example, the extracted building element characteristics can be the appearance shape of the defense tower, the position of the defense tower and the life value of the defense tower. The feature extraction model is obtained after training according to a sample video, and the sample video comprises corresponding building element features.

And 522, matching the building element characteristics with preset building element characteristics to obtain first matching characteristics.

Illustratively, the server stores preset building element features in advance, or the server includes a feature matching library including the preset building element features. And performing feature matching on the building element features extracted in the step 521 and preset building element features to determine first matching features, wherein the first matching features are features corresponding to the position information of the building elements.

Step 523, a first convolutional neural network is called to classify the first matching features, so as to obtain the location information corresponding to the building elements.

And calling a first Convolutional Neural Network (CNN) by the server to classify the first matching features to obtain position information corresponding to the building elements. Illustratively, the first convolutional neural network is obtained by training according to a sample video containing preset building element features, the preset building element features include features corresponding to position information of the building elements, and the trained first convolutional neural network can extract the position information corresponding to the building elements based on the first matching features.

As shown in fig. 9, with the building element as the defense tower, the server extracts defense tower information 41 from the video frame, the defense tower information 41 including a defense tower location and a defense tower life value. The position of a defense tower in the virtual environment is determined through the position of the defense tower, and when a first virtual character is close to the defense tower, the position of the first virtual character can be indirectly obtained; whether the first virtual character interacts with the defense tower can be determined through the defense tower life value, so that an event of the game, such as the first virtual character striking the defense tower, can be determined.

In summary, in the method provided in this embodiment, the position information corresponding to the building element in the virtual environment is extracted, and the position of the building element in the virtual environment is determined according to the position information corresponding to the building element, so that when the building element interacts with the first virtual role, more game events can be generated, so that the semantic annotation has richer game events (video scenarios), and the efficiency and accuracy of performing semantic annotation on the video frame are improved.

3. The virtual element information includes location information corresponding to a neutral virtual character in the virtual environment.

And 531, extracting character element characteristics corresponding to the neutral virtual character from the virtual environment picture.

The server extracts virtual element information from the video frame, and also extracts neutral virtual character information from the video frame, the neutral virtual character being an NPC character that does not belong to either of the partner and partner. Illustratively, the server calls a feature extraction model to extract the character element features corresponding to the neutral virtual character from the video frame, wherein the feature extraction model is obtained by training a sample video containing the character element features.

And 532, matching the role element characteristics with preset role element characteristics to obtain second matching characteristics.

Illustratively, the server stores preset role element features in advance, or the server includes a feature matching library including the preset role element features. And matching the character element characteristics in the step 531 with the preset character element characteristics to determine second matching characteristics, wherein the second matching characteristics are characteristics corresponding to the position information of the neutral virtual character.

And step 533, calling a second convolutional neural network to classify the second matching features, so as to obtain position information corresponding to the neutral virtual character.

Illustratively, the second convolutional neural network is obtained by training a sample video containing preset role element features, the preset role element features include features corresponding to the position information of the neutral virtual role, and the trained second convolutional neural network can extract the position information corresponding to the building elements based on the second matching features.

As shown in fig. 9, taking the neutral virtual character as a wild, the server extracts wild information 42 from the video frame, where the wild information 42 includes a wild location and a wild life value, and determines the location of the wild in the virtual environment through the wild location, and when there is a first virtual character close to the neutral virtual character, the location of the first virtual character can be indirectly obtained; whether the first virtual character interacts with the strange can be determined through the strange life value, so that a game event is determined, and a game event corresponding to a reward is obtained if the first virtual character kills the strange.

In summary, in the method provided in this embodiment, the position information corresponding to the neutral virtual character in the virtual environment is extracted, and the position of the first virtual character in the virtual environment is indirectly determined according to the position information corresponding to the neutral virtual character, so that when the first virtual character interacts with the neutral virtual character, more game events can be generated, so that the semantic annotation has richer game events (video scenarios), and the efficiency and accuracy of performing semantic annotation on video frames are improved.

And secondly, extracting the game information.

1. The session alignment information includes session alignment end information.

Step 541, calling a video frame classification model to classify the video frame sequence to obtain a video frame corresponding to the office ending state, where the video frame corresponding to the office ending state includes a control.

Taking the game type as an MOBA game as an example, the type of the video frame includes at least one of the following types:

a. and the game picture frame comprises a video frame corresponding to the picture when the first virtual role performs game matching.

As shown in fig. 10 (a), in the virtual environment, the first virtual character 44 and the other virtual character 45 perform a game in the virtual environment, the first virtual character 44 and the other virtual character 45 may be in a teammate relationship or an enemy relationship, and a video frame corresponding to a game process is a game screen video frame.

b. And the role selection picture frame is used for selecting a video frame corresponding to the first virtual role participating in the game.

As shown in fig. 10 (B), a rectangle 47 located on the left side indicates a user account belonging to the a formation, a rectangle 48 located on the right side indicates a user account belonging to the B formation, and a rectangle 46 located on the lower side indicates a first avatar selectable by the user. The user can select a first virtual role participating in the game in the picture, and the video frame corresponding to the picture is a role selection picture frame.

c. And the game-matching ending picture frame is used for representing a corresponding video frame after the game matching is ended.

As shown in fig. 10 (c), after one-round match is completed, a match result 49 is displayed on the screen, and if the match result 49 is a match win, the video frame corresponding to the screen is set as a match-completed screen frame.

d. And loading a picture frame, wherein the loaded picture frame is used for prompting a video frame corresponding to the first virtual role participating in the game.

As shown in fig. 10 (d), before the start of the match, that is, before the first virtual character enters the virtual environment, a game loading screen is displayed in which a match battle stand participating in the match and a character poster 50 of the first virtual character in each match battle stand are displayed. In some embodiments, the character poster 50 is displayed with the skin (or clothing) worn by the first virtual character. The video frame corresponding to the game loading picture is a loading picture frame.

e. And the non-target video frame is used for representing the video frame corresponding to the picture area of the non-local picture.

As shown in fig. 10 (e), the game video is a video captured while the game anchor is playing, and when the game anchor is not playing a game or the game anchor enters a non-game interface, a non-target screen is displayed in the live screen. For example, the game is paused by the game anchor, the advertisement 51 is displayed in the live frame, and the video frame corresponding to the live frame where the advertisement 51 is located is a non-target video frame. It is understood that the non-target video frame may also be a load picture frame, a character selection picture frame.

The video frame classification model is a machine learning model obtained after training according to different types of sample videos. The video frame corresponding to the alignment end state is a picture displayed after one alignment is finished. And indicating the end of the game in the current round through a control in the video frame corresponding to the game ending state, such as displaying a control corresponding to the game result or displaying a control corresponding to the game ranking list.

And 542, performing character recognition on the control to obtain a character recognition result.

As shown in the left diagram of fig. 11, the screen 52 displayed after the end of the game includes a game ending control including characters, and the characters in the screen 52 are recognized to obtain the character recognition result 53 as a win. Illustratively, the characters in the image are classified and recognized through a training classification model to obtain character recognition results, and the classification model packet is obtained through training based on sample video frames, wherein the sample video frames comprise sample video frames labeled with the character recognition results, such as sample video frames labeled with characters of winning, failing and ending.

And 543, in response to that the character recognition result comprises a match result, extracting match ending information from the video frame corresponding to the match ending state according to the match result.

In some embodiments, a "click screen continuation" is also displayed in the video frame corresponding to the game ending status, and when the "click screen continuation" is recognized to indicate that a game is ended, the video frame is used as an ending label of the game, and a subsequent video frame is used as a next game for analysis and understanding.

In summary, in the method of this embodiment, the session ending information is extracted from the video frame corresponding to the session ending status, and the session ending information is used as the ending information corresponding to the current session and also can be used as the starting information corresponding to the next session, so as to split between two adjacent sessions, and enable the server to accurately determine the starting time and the ending time of the one session.

2. The office information includes office state information between the first virtual roles.

And 551, calling a video classification model to classify the video frame sequence to obtain a local image frame, wherein the local image frame comprises a control.

Similarly, the server calls the video frame classification model to identify the video sequence frames, and determines the picture frame of the game from the video frame sequence. The video frame classification model is a machine learning model obtained after training according to different types of sample videos. The office picture frame comprises a video frame corresponding to a picture when the first virtual role performs office check.

Illustratively, the picture frame of the match office contains an information broadcast control of the match office state, such as broadcasting a first virtual character to kill two virtual characters, or score of the match office, or gain score.

And step 552, identifying the hitting state in the control to obtain a hitting state result, wherein the hitting state result is an event result corresponding to the hitting event generated between the first virtual characters.

Illustratively, the server calls the classification model to identify the area in the play picture frame that generates the impact state, i.e. the first avatar. The classification model is trained from sample videos containing impact events. As shown in fig. 12, when the first virtual character 54 and the neutral virtual character 55 are framed, the first virtual character 54 strikes the neutral virtual character 55.

When a hitting event is generated between the first virtual roles, a broadcast information control is usually displayed in a user interface, and the opposite-office state information is obtained by identifying the broadcast information control.

The striking is that the first virtual character strikes other virtual characters or the first virtual character strikes a neutral virtual character, and the striking can reduce the life value of an attacked object. The classification model is obtained after training according to a sample video containing striking state labels and is used for framing the virtual character generating the striking event in the video frame.

And a step 553 of extracting the game state information from the game frame according to the game state result.

As shown in fig. 9, the office-to-office status information can be obtained through the broadcast information control 43, and if the classification model identifies that the first virtual character a in the a bank kills the first virtual character B in the B bank, the server obtains the office-to-office status information as that the first virtual character a kills the first virtual character B.

In summary, according to the method provided by this embodiment, the hit state result is obtained by identifying the event result of the hit event generated between the first virtual characters, so that the hit information between the first virtual characters is extracted from the hit picture frame according to the hit state result, and the server can perform semantic annotation on the video frame by combining more comprehensive information through the hit information between the first virtual characters, thereby improving the efficiency and accuracy of performing semantic annotation on the video frame.

Fig. 13 is a flowchart of a video semantic annotation method according to an exemplary embodiment of the present application. The method may be applied in a server 120 as shown in fig. 1. The process comprises the following three parts: video pre-processing 56, video information extraction 57, and video semantic annotation 58.

Firstly, video preprocessing.

The video preprocessing 56 refers to subjecting a video distributed by a user (video producer) to a frame cropping process 561, a game frame classification process 562, and a resolution enhancement process 563.

1. Frame cutting processing 561:

illustratively, a game video captured during the live broadcast of the game anchor is taken as an example, as shown in fig. 14. Some game masters record live videos in a picture-in-picture mode during the live process, namely, frames are added on the virtual environment pictures (game interfaces). In this case, if the subsequent recognition task is directly performed, the interference to the recognition result is large, and therefore, it is necessary to recognize the game video region first to improve the labeling accuracy and recognition efficiency when performing semantic labeling on the game video.

The video frame in the video frame sequence comprises a virtual environment picture area and a live broadcast picture area, wherein the live broadcast picture area is used for representing a picture area shot in the live broadcast process of the anchor broadcast, and the live broadcast picture area does not comprise the virtual environment picture area. Firstly, determining a boundary between a virtual environment picture area and a live broadcast picture area; and then, cutting the live-action picture area according to the boundary to obtain a video frame containing the virtual environment picture area.

Determining a boundary between the virtual environment picture and the live picture area by:

and S1, acquiring the video frame sequence after binarization processing.

The binarization processing is to set the gray value of a pixel point in a video frame to be 0 or 255, that is, the whole video frame is converted into black and white. And carrying out binarization processing on the game video to obtain a video frame sequence after binarization processing.

And S2, converting pixel points contained in the video frames in the video frame sequence after the binarization processing into a Huffman space according to a Huffman algorithm.

And converting each pixel point contained in the video frame after the binarization processing into a Huffman space according to a Huffman algorithm, wherein each pixel point corresponds to a curve in the Huffman space, and when the curves are intersected, the curves show that the pixel points corresponding to the curves are on the same curve in the video frame. Therefore, a curve formed by each pixel point in the video frame after the binarization processing can be determined according to the intersected curves in the Hoffman space.

S3, in response to the fact that the number of curves intersected with the same pixel point in the Hoffman space is larger than or equal to the number threshold value, determining that a straight line corresponding to the pixel point exists in the video frame.

Theoretically, one pixel corresponds to a plurality of curves or curves in any direction, and in practical application, the calculation is performed by limiting the number of the curves (namely, the limited number of directions), so that when the number of the curves passing through a certain intersection in the huffman space exceeds a number threshold, namely, the intersection corresponds to one curve (or straight line) in the video frame.

And S4, determining the boundary between the virtual environment picture area and the live picture area according to the straight line corresponding to the pixel point.

As shown in fig. 14, the boundary 61 is a straight line determined by the huffman algorithm, and the virtual environment picture 62 is divided into the first live view area 63 and the second live view area 64 based on the straight line, thereby determining the virtual environment picture area in the video frame. The first live view area 63 is a view area corresponding to the avatar of the anchor, and the second live view area 64 is a view area for capturing the anchor in real time.

2. Game frame Classification processing 562:

since video producers distribute video from a variety of sources and with varying quality, game video often contains a large number of non-target frames. As shown in fig. 10, during the game running, the game video includes a virtual environment frame, an anchor chat frame, and a game non-game-play frame (e.g., a video frame corresponding to a shop interface, a lobby interface, or a trading interface), and different frames include different information, such as a character selection frame and a loading frame including a character identifier of a first virtual character participating in a game play, a game play ending frame including game play result information, and a game play frame including hit state information during a game play. By filtering out non-target video frames in advance and adopting a corresponding identification mode for specific types of game frames, the processing efficiency of videos is improved and computing resources are saved.

Illustratively, a classification model constructed by a convolutional neural network is obtained by training based on sample images in an ImageNet database, and the classification model is trained by the type of the marked video so as to realize the identification of a target video frame in the game video. The ImageNet database is a large visualization database used for visual object recognition software research. Uniform Resource Locators (URLs) for over 1400 million images are manually annotated by the ImageNet database to indicate objects in the picture, some images also including bounding boxes.

3. Resolution enhancement processing 363:

firstly, calling a Backbone network Backbone to perform feature extraction on a video frame in the video frame sequence to obtain a video feature corresponding to the video frame; and then, calling an enhanced super-resolution generation countermeasure network ESRGAN to process the video characteristics to obtain the video frame with enhanced resolution.

Due to the encoding and decoding operations in the video transmission process, the definition of the game video issued by a video producer is not very high, the resolution is about 720p generally, and a large error is generated for extracting control information with small size such as a small map. In the embodiment of the application, the video features are extracted from the video frames through the backhaul network, the video features are processed through an Enhanced Super-Resolution generation countermeasure network (ESRGAN), the Resolution of the video can be Enhanced through the ESRGAN, so that the video frames with Enhanced Resolution are obtained, and in the subsequent process of extracting multi-mode information, the extraction can be performed based on the video frames with higher definition, so that the information extraction is more accurate. The method for enhancing the picture resolution through the ESRGAN network and the backhaul network is well-established in the art and will not be described herein.

Super-Resolution Imaging (SR or SRI), which is a technique for improving image Resolution, is used for image processing and ultra-high Resolution microscopes.

The generation of a countermeasure Network (GAN) is a non-supervised learning method, and learning is performed by making two neural networks game with each other. The generation countermeasure network is composed of a generation network and a discrimination network. The generating network takes random samples from the Latent Space (patent Space) as input, and its output needs to imitate the real samples in the training set as much as possible. The input of the discrimination network is the real sample or the output of the generation network, and the purpose is to distinguish the output of the generation network from the real sample as much as possible. The generated network should "spoof" the discriminant network as much as possible. The two networks resist each other and continuously adjust parameters, and the final purpose is to make the judgment network unable to judge whether the output result of the generated network is real or not.

Fig. 15 (a) is a small map screen in a video uploaded by a video publisher, fig. 15(b) is an effect of enhancing the resolution of a small map in the manner provided by the embodiment of the present application, and the definition of the small map of fig. 15(b) is better than that of the small map of fig. 15 (a).

And II, extracting video information.

The video information extraction 57 is to extract virtual element information and game information from the video. The embodiment of the present application extracts the virtual element information and the office information in the video by the video information extraction service 31 as shown in fig. 2. The information extraction method has been described in detail in the above embodiments, and is not described herein again.

The extracted information is written into a distributed file storage database (mongodb database), the mongodb database is used for supporting multi-language query and geographic position query, the characteristics of flexible data structures are achieved, and the subsequent efficient utilization of the extracted information is achieved. Taking a game video as an example, the video information extraction service 31 extracts information including life value information (including life values corresponding to a first virtual character, a neutral virtual character, and a defense tower), broadcast information corresponding to a broadcast control, defense tower information, strange information, grove information, minimap information, a character identifier of the first virtual character, game result information, game state information, and the like.

For extracting different information, different information extraction services are correspondingly provided, the different information extraction services are obtained by carrying out standardization according to a video content processing model, the standardization refers to that the video content processing model is built in a service frame, namely weight parameters corresponding to the video content processing model are added into the service frame, and when the information extraction services are used, actual input parameters (such as videos or extracted element characteristics from the videos) need to be provided for the server, so that virtual element information and bureau information about the videos are obtained.

As shown in fig. 16, the video content processing model 34 is obtained by a video framing service 35 and sample videos in the video content sample library 33. The background server of the video application program further comprises a download file system 36, wherein the download file system 36 is used for downloading and acquiring original video content from the video content storage server 23, controlling the downloading speed and progress, is usually a group of parallel servers and is composed of related task scheduling and distribution clusters; the downloaded video file calls the frame extraction service 35 to obtain necessary video information from the video source file, that is, to extract the virtual element information and the office information as information for subsequently constructing the video content processing model 34.

And thirdly, video semantic annotation.

The video semantic annotation 58 refers to performing semantic annotation on a video to obtain semantic annotation information, and the semantic annotation information is used for describing video content.

The video semantic annotation 58 performs semantic annotation on the video by calling the video understanding service 32 shown in fig. 2, the video understanding service integrates video multi-mode information (including virtual element information and game matching information) extracted from the video information extraction service 31, analyzes a scenario appearing in the game video through a scenario logic edited in advance (preset game matching events), and recalls the corresponding game matching events when preset scenario conditions are met, so as to mark out a highlight segment.

Illustratively, when the information extracted from the video information extraction service 31 is directly used, there are problems of special effects, occlusion, etc. in the game video, and the problems of missing detection and false detection cannot be avoided, which leads to low generalization and low precision of script logic, so the information extracted from the video information extraction service 31 is first post-processed. The post-processing comprises smoothing and closure processing, mutual verification is carried out by combining the information extracted by two adjacent frames, and the false detection and missing detection influence caused by the problems of shielding, special effects and the like of the detection algorithm can be further reduced.

The smoothing process is to insert a lost frame into the decoded video to form a smooth video and improve the quality of the video.

The closure is a function capable of reading other function internal variables, namely, the variable defined in the function content can be accessed in the internal action domain of the function itself, and cannot be accessed outside the function. The closure extraction process will combine multiple video frames to extract video information.

And storing the information extracted from the video into the mongdb database, wherein the information can be used as basic elements, and a large number of complicated video scripts (preset local events) can be obtained by continuously editing the form of a new script and configuring a processing flow.

In conclusion, by extracting virtual element information from the video and completing a large amount of semantic annotation on the office information, manual annotation is not needed; by means of video frame cutting, game frame classification and resolution enhancement, the efficiency and the precision of information extraction are improved by preprocessing the video; extracting video basic information by adopting a method based on computer vision, and reducing errors by smoothing and closing packet processing; most algorithm technologies adopt an unsupervised learning method without labeling data, and for scenes which are difficult to solve by part of unsupervised learning methods, a mode of synthesizing sample data is adopted, so that semantic labeling can be performed on videos without a machine learning model.

The following are embodiments of the apparatus of the present application, and for details that are not described in detail in the embodiments of the apparatus, reference may be made to corresponding descriptions in the above method embodiments, and details are not described herein again.

Fig. 17 is a block diagram illustrating a video semantic annotation device according to an exemplary embodiment of the present application. The apparatus can be implemented as all or a part of a terminal by software, hardware or a combination of both, and includes:

an obtaining module 1710, configured to obtain a sequence of video frames in a video;

an extraction module 1720 for extracting multi-modal information in at least two dimensions from a video frame of a sequence of video frames;

a processing module 1730, configured to determine a target storyline event matched out of at least two storyline events based on a combination of the multi-modal information of at least two dimensions as a storyline event of a video frame;

and the labeling module 1740 is configured to perform semantic labeling on the video frame sequence according to the scenario event of the video frame to obtain semantic information of the video.

In an alternative embodiment, the labeling module 1740 is configured to determine a start timestamp and an end timestamp of a video segment in the sequence of video frames, to which a plurality of consecutive video frames with the same storyline event belong; and performing semantic annotation on the video segments according to the scenario event, the starting timestamp and the ending timestamp to obtain semantic information corresponding to the video segments in the video.

In an optional embodiment, the obtaining module 1710 is configured to obtain a corresponding relationship, where the corresponding relationship is used to represent a corresponding relationship between an event after the multi-modal information combination and a target scenario event; and acquiring a target plot event according to the event combined by the corresponding relation and the multi-mode information, and determining the target plot event as the plot event of the video frame.

In an alternative embodiment, a video frame of the sequence of video frames includes a virtual environment picture including a picture of a virtual environment in which the first avatar is active and at least one control;

the extracting module 1720 is configured to invoke the first information extraction model to extract virtual element information from the virtual environment picture, where the virtual element information includes information corresponding to at least one of an active element and a static element in the virtual environment picture; and calling a second information extraction model to extract the game-play information from the control, wherein the game-play information comprises information corresponding to the first virtual role in the control.

In an optional embodiment, the virtual element information includes a role identifier corresponding to a first virtual role, and the first information extraction model includes an element classification model and a positioning and tracking model;

the obtaining module 1710, configured to obtain a first role identifier of a first virtual role participating in the opposite office;

the extracting module 1720 is configured to invoke the element classification model to identify a life value region corresponding to a first virtual role in the virtual environment picture, so as to obtain a second role identifier of the first virtual role corresponding to the life value region;

the processing module 1730 is configured to, in response to matching of the first role identifier and the second role identifier, obtain a first confidence corresponding to the second role identifier;

the extracting module 1720 is configured to invoke the positioning and tracking model to identify the first virtual character in the virtual environment picture according to the first character identifier, to obtain a third character identifier of the first virtual character displayed in the virtual environment picture, where the third character identifier corresponds to a second confidence level, and the positioning and tracking model corresponds to the first virtual character participating in the matching; and obtaining the role identification of the first virtual role according to the first confidence coefficient and the second confidence coefficient.

In an optional embodiment, the virtual element information includes position information corresponding to building elements in the virtual environment, and the first information extraction model includes a first convolutional neural network;

the extracting module 1720 is configured to extract building element features corresponding to building elements from a virtual environment picture; the processing module 1730 is configured to match the building element characteristics with preset building element characteristics to obtain first matching characteristics; the extracting module 1720 is configured to invoke the first convolutional neural network to classify the first matching feature, so as to obtain location information corresponding to the building element.

In an optional embodiment, the virtual element information includes position information corresponding to a neutral virtual character in the virtual environment, and the first information extraction model includes a second convolutional neural network;

the extraction module 1720 is configured to extract a character element feature corresponding to the neutral virtual character from the virtual environment picture; the processing module 1730 is configured to match the role element characteristics with preset role element characteristics to obtain second matching characteristics; the extracting module 1720 is configured to invoke a second convolutional neural network to classify the second matching feature, so as to obtain position information corresponding to the neutral virtual character.

In an optional embodiment, the match information comprises match ending information, and the second information extraction model comprises a video frame classification model;

the extraction module 1720 is configured to invoke a video frame classification model to classify a video frame sequence to obtain a video frame corresponding to the office ending state, where the video frame corresponding to the office ending state includes a control; performing character recognition on the control to obtain a character recognition result; and responding to the character recognition result comprising a game matching result, and acquiring game matching end information according to the game matching result.

In an optional embodiment, the match information comprises match state information between the first virtual roles, and the second information extraction model comprises a video frame classification model;

the extraction module 1720 is configured to invoke a video classification model to classify a video frame sequence to obtain a local image frame, where the local image frame includes a control; identifying the beating state in the control to obtain a beating state result, wherein the beating state result is an event result corresponding to a beating event generated between the first virtual characters; and acquiring game state information according to the hitting state result.

In an optional embodiment, a video frame in the video frame sequence includes a virtual environment picture area and a live broadcast picture area, the live broadcast picture area is used for representing a picture area for shooting a live broadcast process of a main broadcast, and the live broadcast picture area does not include the virtual environment picture area;

the processing module 1730 is configured to determine a boundary between a virtual environment picture area and a live view picture area; and cutting the live-broadcasting picture area according to the boundary to obtain a video frame corresponding to the picture area containing the virtual environment.

In an optional embodiment, the obtaining module 1710 is configured to obtain a video frame sequence after binarization processing; the processing module 1730 is configured to convert pixel points included in a video frame in the video frame sequence after the binarization processing to a huffman space according to a huffman algorithm; in response to the fact that the number of curves intersected with the same pixel point in the Hoffman space is larger than or equal to a number threshold value, determining that a straight line corresponding to the pixel point exists in the video frame; and determining the boundary between the virtual environment picture area and the live broadcast picture area according to the straight line corresponding to the pixel point.

In an optional embodiment, the extraction module 1720 is configured to invoke a Backbone network backhaul to perform feature extraction on a video frame in a video frame sequence, so as to obtain a video feature corresponding to the video frame; the processing module 1730 is configured to invoke the enhanced super resolution generation countermeasure network ESRGAN to process the video features, so as to obtain a video frame with enhanced resolution.

In summary, in the apparatus provided in this embodiment, the target scenario event matched with the multi-modal information combination extracted from the video frame is determined as the scenario event of the video frame, so that the scenario event corresponding to the video frame is utilized to perform semantic annotation on the video frame, and the semantic information of the video frame can be annotated without a machine learning model obtained through sample video training, thereby improving the efficiency and accuracy of performing semantic annotation on the video frame, and enabling a user to quickly capture a highlight segment in the video frame according to the video frame annotated with the semantic information.

The device provided by the embodiment also corresponds the plot events and the time in the video to one another by combining the start time stamps and the end time stamps corresponding to the plot events, so that a user can conveniently and quickly position the plot events in the video frames according to the time stamps.

The device provided by the embodiment further extracts the corresponding multi-modal information in the video picture in a corresponding manner for different types of videos by refining the multi-modal information and associating the event after the target scenario event is combined with the multi-modal information, so that the extracted information is more in line with the video type, the target scenario event which is more in line with the video is matched, and the efficiency and the accuracy of performing semantic annotation on the video are improved.

The device provided by this embodiment further identifies the role identifier of the first virtual role in the video frame through a dual extraction mode of the positioning tracking model and the classification model, so that the server can accurately extract the role identifier of the first virtual role from the video frame, and a result of performing semantic annotation on the video frame by subsequently using the role identifier of the first virtual role is more accurate.

The device provided by the embodiment also extracts the position information corresponding to the building element in the virtual environment, determines the position of the building element in the virtual environment according to the position information corresponding to the building element, and can generate more game events when the building element interacts with the first virtual role, so that the semantic annotation has richer game events (video play books), and the efficiency and accuracy of performing the semantic annotation on the video frame are improved.

The device provided by this embodiment further extracts the position information corresponding to the neutral virtual character in the virtual environment, and indirectly determines the position of the first virtual character in the virtual environment according to the position information corresponding to the neutral virtual character, so that when the first virtual character interacts with the neutral virtual character, more game events can be generated, so that the semantic annotation has richer game events (video script), and the efficiency and accuracy of performing semantic annotation on video frames are improved.

The apparatus provided in this embodiment further extracts the session ending information from the video frame corresponding to the session ending status, and uses the session ending information as the ending information corresponding to the current session, or as the starting information corresponding to the next session, so as to split between two adjacent sessions, so that the server can accurately determine the starting time and the ending time of a session.

The device provided by the embodiment also identifies the event result of the striking event generated between the first virtual roles to obtain the striking state result, so that the office alignment information between the first virtual roles is extracted from the office alignment picture frame according to the striking state result, and the server can perform semantic annotation on the video frame by combining more comprehensive information through the office alignment information between the first virtual roles, thereby improving the efficiency and accuracy of performing semantic annotation on the video frame.

The device provided by the embodiment also determines the boundary between the virtual environment picture area and the live broadcast picture area in the video frame through the Huffman algorithm, so that the server can cut the live broadcast picture area in the video frame, the virtual environment picture area is reserved, the content in the virtual environment picture is concentrated on being identified, and the information extraction speed of the server on the video frame and the labeling efficiency of labeling semantic information are improved.

The device provided by the embodiment also improves the resolution of the video through the ESRGAN network and the backhaul network, so that information extraction is performed based on the video with higher definition, and the extracted multi-mode information is more accurate.

It should be noted that: the video semantic annotation device provided in the foregoing embodiment is only illustrated by dividing the functional modules, and in practical applications, the functions may be distributed by different functional modules according to needs, that is, the internal structure of the device is divided into different functional modules, so as to complete all or part of the functions described above. In addition, the video semantic annotation device provided by the above embodiment and the video semantic annotation method embodiment belong to the same concept, and the specific implementation process thereof is detailed in the method embodiment and will not be described herein again.

Fig. 18 shows a block diagram of a computer device 1800, provided in an example embodiment of the present application. The computer device 1800 may be a terminal installed with a live client in the above embodiment, and the computer device may be: smart phones, tablet computers, MP3 players (Moving Picture Experts Group Audio Layer III, motion video Experts compression standard Audio Layer 3), MP4 players (Moving Picture Experts Group Audio Layer IV, motion video Experts compression standard Audio Layer 4). Computer device 1800 may also be referred to by other names such as user equipment, portable terminal, smart watch, smart robot, smart speaker, and so on.

Generally, computer device 1800 includes: a processor 1801 and a memory 1802.

The processor 1801 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and so on. The processor 1801 may be implemented in at least one hardware form of a DSP (Digital Signal Processing), an FPGA (Field-Programmable Gate Array), and a PLA (Programmable Logic Array). The processor 1801 may also include a main processor and a coprocessor, where the main processor is a processor for Processing data in an awake state, and is also called a Central Processing Unit (CPU); a coprocessor is a low power processor for processing data in a standby state. In some embodiments, the processor 1801 may be integrated with a GPU (Graphics Processing Unit), which is responsible for rendering and drawing content required to be displayed on the display screen. In some embodiments, the processor 1801 may further include an AI (Artificial Intelligence) processor for processing computing operations related to machine learning.

Memory 1802 may include one or more computer-readable storage media, which may be tangible and non-transitory. Memory 1802 may also include high speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In some embodiments, a non-transitory computer readable storage medium in memory 1802 is used to store at least one instruction for execution by processor 1801 to implement the video semantic annotation methods provided herein.

In some embodiments, computer device 1800 may also optionally include: a peripheral interface 1803 and at least one peripheral. Specifically, the peripheral device includes: at least one of radio frequency circuitry 1804, touch screen display 1805, camera assembly 1806, audio circuitry 1807, positioning component 1808, and power supply 1809.

The peripheral interface 1803 may be used to connect at least one peripheral associated with I/O (Input/Output) to the processor 1801 and the memory 1802. In some embodiments, the processor 1801, memory 1802, and peripheral interface 1803 are integrated on the same chip or circuit board; in some other embodiments, any one or two of the processor 1801, the memory 1802, and the peripheral device interface 1803 may be implemented on separate chips or circuit boards, which is not limited in this embodiment.

The Radio Frequency circuit 1804 is used for receiving and transmitting RF (Radio Frequency) signals, also called electromagnetic signals. The radio frequency circuitry 1804 communicates with communication networks and other communication devices via electromagnetic signals. The rf circuit 1804 converts electrical signals into electromagnetic signals for transmission, or converts received electromagnetic signals into electrical signals. Optionally, the radio frequency circuitry 1804 includes: an antenna system, an RF transceiver, one or more amplifiers, a tuner, an oscillator, a digital signal processor, a codec chipset, a subscriber identity module card, and so forth. The radio frequency circuitry 1804 may communicate with other terminals via at least one wireless communication protocol. The wireless communication protocols include, but are not limited to: the world wide web, metropolitan area networks, intranets, generations of mobile communication networks (2G, 3G, 4G, and 5G), Wireless local area networks, and/or WiFi (Wireless Fidelity) networks. In some embodiments, the rf circuit 1804 may also include NFC (Near Field Communication) related circuits, which are not limited in this application.

The touch display 1805 is used to display a UI (User Interface). The UI may include graphics, text, icons, video, and any combination thereof. The touch display screen 1805 also has the ability to capture touch signals on or over the surface of the touch display screen 1805. The touch signal may be input to the processor 1801 as a control signal for processing. The touch screen 1805 is used to provide virtual buttons and/or a virtual keyboard, also referred to as soft buttons and/or a soft keyboard. In some embodiments, the touch display 1805 may be one, providing a front panel of the computer device 1800; in other embodiments, the number of the touch display screens 1805 may be at least two, respectively disposed on different surfaces of the computer device 1800 or in a folded design; in other embodiments, the touch display 1805 may be a flexible display disposed on a curved surface or on a folded surface of the computer device 1800. Even more, the touch display 1805 may be arranged in a non-rectangular irregular pattern, i.e. a shaped screen. The touch Display screen 1805 may be made of LCD (Liquid Crystal Display), OLED (Organic Light-Emitting Diode), or the like.

The camera assembly 1806 is used to capture images or video. Optionally, the camera assembly 1806 includes a front camera and a rear camera. Generally, a front camera is used for realizing video call or self-shooting, and a rear camera is used for realizing shooting of pictures or videos. In some embodiments, the number of the rear cameras is at least two, and each of the rear cameras is any one of a main camera, a depth-of-field camera and a wide-angle camera, so that the main camera and the depth-of-field camera are fused to realize a background blurring function, and the main camera and the wide-angle camera are fused to realize a panoramic shooting function and a VR (Virtual Reality) shooting function. In some embodiments, camera assembly 1806 may also include a flash. The flash lamp can be a monochrome temperature flash lamp or a bicolor temperature flash lamp. The double-color-temperature flash lamp is a combination of a warm-light flash lamp and a cold-light flash lamp, and can be used for light compensation at different color temperatures.

The audio circuit 1807 is used to provide an audio interface between a user and the computer device 1800. The audio circuitry 1807 may include a microphone and a speaker. The microphone is used for collecting sound waves of a user and the environment, converting the sound waves into electric signals, and inputting the electric signals to the processor 1801 for processing or inputting the electric signals to the radio frequency circuit 1804 to achieve voice communication. The microphones may be multiple and placed at different locations on the computer device 1800 for stereo sound capture or noise reduction purposes. The microphone may also be an array microphone or an omni-directional pick-up microphone. The speaker is used to convert electrical signals from the processor 1801 or the radio frequency circuitry 1804 to sound waves. The loudspeaker can be a traditional film loudspeaker or a piezoelectric ceramic loudspeaker. When the speaker is a piezoelectric ceramic speaker, the speaker can be used for purposes such as converting an electric signal into a sound wave audible to a human being, or converting an electric signal into a sound wave inaudible to a human being to measure a distance. In some embodiments, audio circuitry 1807 may also include a headphone jack.

The Location component 1808 is used to locate a current geographic Location of the computer device 1800 for navigation or LBS (Location Based Service). The Positioning component 1808 may be a Positioning component based on a Global Positioning System (GPS) in the united states, a beidou System in china, or a galileo System in russia.

The power supply 1809 is used to power various components within the computer device 1800. The power supply 1809 may be ac, dc, disposable or rechargeable. When the power supply 1809 includes a rechargeable battery, the rechargeable battery may be a wired rechargeable battery or a wireless rechargeable battery. The wired rechargeable battery is a battery charged through a wired line, and the wireless rechargeable battery is a battery charged through a wireless coil. The rechargeable battery may also be used to support fast charge technology.

In some embodiments, computer device 1800 also includes one or more sensors 1810. The one or more sensors 1810 include, but are not limited to: acceleration sensor 1811, gyro sensor 1812, pressure sensor 1813, fingerprint sensor 1814, optical sensor 1815, and proximity sensor 1816.

The acceleration sensor 1811 detects the magnitude of acceleration in three coordinate axes of a coordinate system established with the computer apparatus 1800. For example, the acceleration sensor 1811 is used to detect the components of the gravitational acceleration in three coordinate axes. The processor 1801 may control the touch display 1805 to display a user interface in a landscape view or a portrait view according to the gravitational acceleration signal of the acceleration sensor 1811 set. The acceleration sensor 1811 may be used for collection of motion data of a game or a user.

The gyro sensor 1812 may detect a body direction and a rotation angle of the computer device 1800, and the gyro sensor 1812 may collect a 3D motion of the user on the computer device 1800 together with the acceleration sensor 1811. The processor 1801 may implement the following functions according to the data collected by the gyro sensor 1812: motion sensing (such as changing the UI according to a user's tilting operation), image stabilization at the time of photographing, game control, and inertial navigation.

Pressure sensors 1813 may be disposed on the side bezel of computer device 1800 and/or on the lower layer of touch display 1805. When the pressure sensor 1813 is disposed on a side frame of the computer apparatus 1800, a user's grip signal on the computer apparatus 1800 can be detected, and left-right hand recognition or shortcut operation can be performed based on the grip signal. When the pressure sensor 1813 is disposed at the lower layer of the touch display screen 1805, the operability control on the UI interface can be controlled according to the pressure operation of the user on the touch display screen 1805. The operability control comprises at least one of a button control, a scroll bar control, an icon control and a menu control.

The fingerprint sensor 1814 is used to collect a fingerprint of a user to identify the identity of the user according to the collected fingerprint. Upon recognizing that the user's identity is a trusted identity, the processor 1801 authorizes the user to perform relevant sensitive operations, including unlocking a screen, viewing encrypted information, downloading software, paying, and changing settings, etc. The fingerprint sensor 1814 may be disposed on the front, back, or side of the computer device 1800. When a physical key or vendor Logo is provided on the computer device 1800, the fingerprint sensor 1814 may be integrated with the physical key or vendor Logo.

The optical sensor 1815 is used to collect the ambient light intensity. In one embodiment, the processor 1801 may control the display brightness of the touch display 1805 based on the ambient light intensity collected by the optical sensor 1815. Specifically, when the ambient light intensity is high, the display brightness of the touch display screen 1805 is increased; when the ambient light intensity is low, the display brightness of the touch display 1805 is turned down. In another embodiment, the processor 1801 may also dynamically adjust the shooting parameters of the camera assembly 1806 according to the intensity of the ambient light collected by the optical sensor 1815.

A proximity sensor 1816, also known as a distance sensor, is typically disposed on the front face of the computer device 1800. The proximity sensor 1816 is used to gather the distance between the user and the front of the computer device 1800. In one embodiment, the touch display 1805 is controlled by the processor 1801 to switch from the light screen state to the rest screen state when the proximity sensor 1816 detects that the distance between the user and the front of the computer device 1800 is gradually decreased; when the proximity sensor 1816 detects that the distance between the user and the front of the computer device 1800 is gradually increasing, the touch display 1805 is controlled by the processor 1801 to switch from the breath-screen state to the bright-screen state.

Those skilled in the art will appreciate that the configuration illustrated in FIG. 18 is not intended to be limiting with respect to the computer device 1800 and may include more or fewer components than those illustrated, or some components may be combined, or a different arrangement of components may be employed.

Fig. 19 shows a schematic structural diagram of a server according to an exemplary embodiment of the present application. The server may be the server 120 in the computer system 100 shown in fig. 1.

The server 1900 includes a Central Processing Unit (CPU) 1901, a system Memory 1904 including a Random Access Memory (RAM) 1902 and a Read Only Memory (ROM) 1903, and a system bus 1905 connecting the system Memory 1904 and the Central Processing Unit 1901. The server 1900 also includes a basic Input/Output System (I/O System)1906 for facilitating information transfer between devices within the computer, and a mass storage device 1907 for storing an operating System 1913, application programs 1914, and other program modules 1915.

The basic input/output system 1906 includes a display 1908 for displaying information and an input device 1909, such as a mouse, keyboard, etc., for user input of information. Wherein the display 1908 and the input device 1909 are both connected to the central processing unit 1901 through an input-output controller 1910 coupled to the system bus 1905. The basic input/output system 1906 may also include an input/output controller 1910 for receiving and processing input from a number of other devices, such as a keyboard, mouse, or electronic stylus. Similarly, input-output controller 1910 also provides output to a display screen, a printer, or other type of output device.

The mass storage device 1907 is connected to the central processing unit 1901 through a mass storage controller (not shown) connected to the system bus 1905. The mass storage device 1907 and its associated computer-readable media provide non-volatile storage for the server 1900. That is, the mass storage device 1907 may include a computer-readable medium (not shown) such as a hard disk or Compact disk Read Only Memory (CD-ROM) drive.

Computer-readable media may include computer storage media and communication media. Computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Computer storage media includes RAM, ROM, Erasable Programmable Read-Only Memory (EPROM), Electrically Erasable Programmable Read-Only Memory (EEPROM), flash Memory or other Solid State Memory technology, CD-ROM, Digital Versatile Disks (DVD), or Solid State Drives (SSD), other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage, or other magnetic storage devices. The Random Access Memory may include a resistive Random Access Memory (ReRAM) and a Dynamic Random Access Memory (DRAM). Of course, those skilled in the art will appreciate that computer storage media is not limited to the foregoing. The system memory 1904 and mass storage device 1907 described above may be collectively referred to as memory.

According to various embodiments of the application, the server 1900 may also operate with remote computers connected to a network through a network, such as the Internet. That is, the server 1900 may be connected to the network 1912 through the network interface unit 1911 connected to the system bus 1905 or may be connected to another type of network or remote computer system (not shown) using the network interface unit 1911.

The memory further includes one or more programs, and the one or more programs are stored in the memory and configured to be executed by the CPU.

In an alternative embodiment, a computer device is provided that includes a processor and a memory having at least one instruction, at least one program, set of codes, or set of instructions stored therein, the at least one instruction, at least one program, set of codes, or set of instructions being loaded and executed by the processor to implement the video semantic annotation method as described above.

In an alternative embodiment, a computer-readable storage medium is provided that has at least one instruction, at least one program, set of codes, or set of instructions stored therein, which is loaded and executed by a processor to implement the video semantic annotation method as described above.

Optionally, the computer-readable storage medium may include: a Read Only Memory (ROM), a Random Access Memory (RAM), a Solid State Drive (SSD), or an optical disc. The Random Access Memory may include a resistive Random Access Memory (ReRAM) and a Dynamic Random Access Memory (DRAM). The above-mentioned serial numbers of the embodiments of the present application are for description only and do not represent the merits of the embodiments.

The embodiments of the present application further provide a computer device, where the computer device includes a processor and a memory, where the memory stores at least one instruction, at least one program, a code set, or a set of instructions, and the at least one instruction, the at least one program, the code set, or the set of instructions is loaded and executed by the processor to implement the video semantic annotation method provided in the foregoing method embodiments.

The embodiment of the present application further provides a computer-readable storage medium, where at least one instruction, at least one program, a code set, or a set of instructions is stored in the storage medium, and the at least one instruction, the at least one program, the code set, or the set of instructions is loaded and executed by a processor to implement the video semantic annotation method provided by the foregoing method embodiments.

Embodiments of the present application also provide a computer program product or a computer program, which includes computer instructions stored in a computer-readable storage medium. A processor of a computer device reads the computer instructions from the computer-readable storage medium, and executes the computer instructions to cause the computer device to execute the video semantic annotation method as described above.

It should be understood that reference to "a plurality" herein means two or more. "and/or" describes the association relationship of the associated objects, meaning that there may be three relationships, e.g., a and/or B, which may mean: a exists alone, A and B exist simultaneously, and B exists alone. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship.

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, where the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.

The above description is only exemplary of the present application and should not be taken as limiting, as any modification, equivalent replacement, or improvement made within the spirit and principle of the present application should be included in the protection scope of the present application.

Claims

1. A video semantic annotation method, characterized in that the method comprises:

acquiring a video frame sequence in a video;

2. The method according to claim 1, wherein the semantically labeling the video frame sequence according to the scenario event of the video frame to obtain semantic information of the video comprises:

determining a start time stamp and an end time stamp of a video segment in the sequence of video frames to which a plurality of consecutive video frames having the same storyline event belong;

and performing semantic annotation on the video segments according to the plot events, the starting timestamps and the ending timestamps to obtain semantic information corresponding to the video segments in the video.

3. The method of claim 2, wherein determining a target storyline event matched among at least two storyline events based on the combination of the multi-modal information in the at least two dimensions as a storyline event for the video frame comprises:

acquiring a corresponding relation, wherein the corresponding relation is used for representing the corresponding relation between the event after the multi-modal information combination and the target scenario event;

and acquiring the target plot event according to the event combined by the corresponding relation and the multi-mode information, and determining the target plot event as the plot event of the video frame.

4. A method according to any of claims 1 to 3, wherein the video frames of the sequence of video frames comprise a virtual environment picture and at least one control, the virtual environment picture comprising a picture of the virtual environment in which the first avatar is active;

the extracting of multimodal information in at least two dimensions from video frames in the sequence of video frames comprises:

calling a first information extraction model to extract virtual element information from the virtual environment picture, wherein the virtual element information comprises information corresponding to at least one element of an active element and a static element in the virtual environment picture;

and calling a second information extraction model to extract the game-matching information from the control, wherein the game-matching information comprises information corresponding to the first virtual role in the control.

5. The method of claim 4, wherein the virtual element information comprises a character identifier corresponding to the first virtual character, and the first information extraction model comprises an element classification model and a location tracking model;

the calling of the first information extraction model to extract the virtual element information from the virtual environment picture comprises the following steps:

acquiring a first role identifier of the first virtual role participating in the game;

calling the element classification model to identify a life value area corresponding to the first virtual role in the virtual environment picture to obtain a second role identification of the first virtual role corresponding to the life value area;

responding to the matching of the first role identification and the second role identification, and obtaining a first confidence corresponding to the second role identification;

calling the positioning and tracking model to identify the first virtual role in the virtual environment picture according to the first role identification to obtain a third role identification of the first virtual role displayed in the virtual environment picture, wherein the third role identification corresponds to a second confidence coefficient, and the positioning and tracking model corresponds to the first virtual role participating in the matching one to one;

and obtaining the role identification of the first virtual role according to the first confidence coefficient and the second confidence coefficient.

6. The method of claim 4, wherein the virtual element information comprises location information corresponding to building elements in the virtual environment, and the first information extraction model comprises a first convolutional neural network;

building element features corresponding to the building elements are extracted from the virtual environment pictures;

matching the building element characteristics with preset building element characteristics to obtain first matching characteristics;

and calling the first convolution neural network to classify the first matching features to obtain the position information corresponding to the building elements.

7. The method of claim 4, wherein the virtual element information comprises location information corresponding to a neutral virtual character in the virtual environment, and the first information extraction model comprises a second convolutional neural network;

extracting character element characteristics corresponding to the neutral virtual character from the virtual environment picture;

matching the role element characteristics with preset role element characteristics to obtain second matching characteristics;

and calling the second convolutional neural network to classify the second matching characteristics to obtain the position information corresponding to the neutral virtual character.

8. The method of claim 4, wherein the match information comprises match end information, and the second information extraction model comprises a video frame classification model;

the calling of the second information extraction model to extract the game information from the control comprises the following steps:

calling the video frame classification model to classify the video frame sequence to obtain a video frame corresponding to the office ending state, wherein the video frame corresponding to the office ending state comprises the control;

performing character recognition on the control to obtain a character recognition result;

and responding to the character recognition result comprising a game matching result, and acquiring the game matching end information according to the game matching result.

9. The method of claim 4, wherein the match information includes match status information between the first avatars, and the second information extraction model includes a video frame classification model;

calling the video classification model to classify the video frame sequence to obtain a local picture frame, wherein the local picture frame comprises the control;

identifying the beating state in the control to obtain a beating state result, wherein the beating state result is an event result corresponding to the beating event generated between the first virtual characters;

and obtaining the game state information according to the hitting state result.

10. A method according to any one of claims 1 to 3, wherein the video frames of the sequence of video frames comprise a virtual environment picture area and a live picture area, the live picture area being used to characterize a picture area for taking a live video of the anchor, and the live picture area not comprising the virtual environment picture area;

before the extracting multi-modal information of at least two dimensions from video frames in the video frame sequence, the method comprises the following steps:

determining a boundary between the virtual environment picture area and the live picture area;

and cutting the live broadcast picture area according to the boundary to obtain a video frame containing the virtual environment picture area.

11. The method of claim 10, wherein determining the boundary between the virtual environment screen region and the live screen region comprises:

acquiring a video frame sequence after binarization processing;

converting pixel points contained in the video frames in the video frame sequence after the binarization processing into a Huffman space according to a Huffman algorithm;

in response to the fact that the number of curves intersected with the same pixel point in the Hoffman space is larger than or equal to a number threshold value, determining that a straight line corresponding to the pixel point exists in the video frame;

and determining the boundary between the virtual environment picture area and the live broadcast picture area according to the straight line corresponding to the pixel point.

12. The method according to any of claims 1 to 3, wherein before extracting the multi-modal information in at least two dimensions from the video frames in the sequence of video frames, the method further comprises:

calling a Backbone network Backbone to perform feature extraction on the video frames in the video frame sequence to obtain video features corresponding to the video frames;

and calling an enhanced super-resolution generation countermeasure network ESRGAN to process the video characteristics to obtain the video frame with enhanced resolution.

13. An apparatus for semantic annotation of video, the apparatus comprising:

the acquisition module is used for acquiring a video frame sequence in a video;

14. A computer device comprising a processor and a memory, the memory having stored therein at least one instruction, at least one program, set of codes, or set of instructions, the instruction, the program, the set of codes, or the set of instructions being loaded and executed by the processor to implement the video semantic annotation method according to any one of claims 1 to 12.

15. A computer-readable storage medium, in which a computer program is stored, the computer program being loaded and executed by a processor to implement the video semantic annotation method according to any one of claims 1 to 12.