CN114339362A

CN114339362A - Video bullet screen matching method and device, computer equipment and storage medium

Info

Publication number: CN114339362A
Application number: CN202111494410.2A
Authority: CN
Inventors: 张皓
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2021-12-08
Filing date: 2021-12-08
Publication date: 2022-04-12
Anticipated expiration: 2041-12-08
Also published as: CN114339362B

Abstract

The application relates to a video bullet screen matching method, a video bullet screen matching device, a computer device, a storage medium and a computer program product. The method comprises the following steps: extracting initial text characteristics corresponding to a real-time bullet screen corresponding to a target video; determining a plurality of video segments to be matched corresponding to the real-time barrage from the target video, and acquiring fusion video characteristics corresponding to the video segments to be matched; the fusion video features are obtained by performing feature fusion on a target video frame feature sequence corresponding to a video segment to be matched; calculating the matching degree of the real-time bullet screen and each video clip to be matched respectively based on the initial text features and the fusion video features; determining a target video clip from each video clip to be matched based on the matching degree; and establishing an incidence relation between the real-time bullet screen and the target video clip, wherein the incidence relation is used for synchronously playing the real-time bullet screen when the target video clip is played. By adopting the method, the matching accuracy of the bullet screen and the video clip can be improved.

Description

Video bullet screen matching method and device, computer equipment and storage medium

Technical Field

The present application relates to the field of computer technologies, and in particular, to a video bullet screen matching method and apparatus, a computer device, and a storage medium.

Background

With the development of network media, a bullet screen technology appears, and a bullet screen refers to a commenting subtitle popped up when a video is watched on a network. The barrage can enable a user to watch video comments in real time when watching videos, and is a novel information interaction mode.

In the conventional technology, a bullet screen is usually matched with a video clip according to the delivery time of the bullet screen, and the bullet screen delivered by a user is displayed in the video clip corresponding to the delivery time of the bullet screen. However, sometimes the user has played the corresponding video episode when the bullet screen is published, and matching the bullet screen with the video clip based on the publication time of the bullet screen may result in mismatching the bullet screen with the video clip.

Disclosure of Invention

In view of the foregoing, it is desirable to provide a video bullet screen matching method, device, computer readable storage medium and computer program product, which can improve the matching accuracy of bullet screen and video clip.

In one aspect, the present application provides a video bullet screen matching method, including:

acquiring a real-time bullet screen corresponding to a target video, and extracting initial text characteristics corresponding to the real-time bullet screen;

determining a plurality of video segments to be matched corresponding to the real-time bullet screen from the target video, and acquiring fusion video characteristics corresponding to the video segments to be matched; the fusion video features are obtained by performing feature fusion on a target video frame feature sequence corresponding to a video clip to be matched, the target video frame feature sequence is obtained by performing feature extraction on each video frame in the video clip to be matched, and the target video frame feature sequence comprises target video frame features corresponding to each video frame in the same video clip to be matched;

calculating the matching degree of the real-time bullet screen and each video clip to be matched respectively based on the initial text features and the fusion video features;

determining a target video clip from the video clips to be matched based on the matching degree;

and establishing an incidence relation between the real-time bullet screen and the target video clip, wherein the incidence relation is used for synchronously playing the real-time bullet screen when the target video clip is played.

On the other hand, this application still provides a video bullet curtain matching device, the device includes:

the bullet screen processing module is used for acquiring a real-time bullet screen corresponding to the target video and extracting initial text characteristics corresponding to the real-time bullet screen;

the video feature acquisition module is used for determining a plurality of video segments to be matched corresponding to the real-time barrage from the target video and acquiring fusion video features corresponding to the video segments to be matched; the fusion video features are obtained by performing feature fusion on a target video frame feature sequence corresponding to a video clip to be matched, the target video frame feature sequence is obtained by performing feature extraction on each video frame in the video clip to be matched, and the target video frame feature sequence comprises target video frame features corresponding to each video frame in the same video clip to be matched;

the matching degree calculation module is used for calculating the matching degree of the real-time bullet screen and each video clip to be matched respectively based on the initial text features and the fusion video features;

the target video clip determining module is used for determining a target video clip from the video clips to be matched based on the matching degree;

and the incidence relation establishing module is used for establishing the incidence relation between the real-time bullet screen and the target video clip, and the incidence relation is used for synchronously playing the real-time bullet screen when the target video clip is played.

The application also provides a video bullet screen matching method, which comprises the following steps:

acquiring a real-time bullet screen corresponding to a target video;

sending the real-time barrage to a server so that the server extracts initial text features corresponding to the real-time barrage, determines a plurality of video segments to be matched corresponding to the real-time barrage from the target video, acquires fusion video features corresponding to the video segments to be matched, calculates the matching degree of the real-time barrage with each video segment to be matched based on the initial text features and the fusion video features, determines target video segments from the video segments to be matched based on the matching degree, and establishes the incidence relation between the real-time barrage and the target video segments; the fusion video features are obtained by performing feature fusion on a target video frame feature sequence corresponding to a video clip to be matched, the target video frame feature sequence is obtained by performing feature extraction on each video frame in the video clip to be matched, and the target video frame feature sequence comprises target video frame features corresponding to each video frame in the same video clip to be matched;

and acquiring the incidence relation returned by the server, and synchronously playing the real-time barrage when the target video clip is played based on the incidence relation.

The application also provides a video bullet curtain matching device, the device includes:

the bullet screen acquisition module is used for acquiring a real-time bullet screen corresponding to the target video;

the data matching module is used for sending the real-time bullet screen to a server so that the server extracts initial text features corresponding to the real-time bullet screen, determining a plurality of video segments to be matched corresponding to the real-time bullet screen from the target video, acquiring fusion video features corresponding to the video segments to be matched, calculating the matching degree of the real-time bullet screen and each video segment to be matched based on the initial text features and the fusion video features, determining target video segments from the video segments to be matched based on the matching degree, and establishing the association relationship between the real-time bullet screen and the target video segments; the fusion video features are obtained by performing feature fusion on a target video frame feature sequence corresponding to a video clip to be matched, the target video frame feature sequence is obtained by performing feature extraction on each video frame in the video clip to be matched, and the target video frame feature sequence comprises target video frame features corresponding to each video frame in the same video clip to be matched;

and the bullet screen playing module is used for acquiring the incidence relation returned by the server and synchronously playing the real-time bullet screen when the target video clip is played based on the incidence relation.

A computer device comprises a memory and a processor, wherein the memory stores a computer program, and the processor realizes the steps of the video bullet screen matching method when executing the computer program.

A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, implements the steps of the above-mentioned video bullet screen matching method.

A computer program product comprising a computer program which when executed by a processor implements the steps of the video bullet screen matching method described above.

According to the video bullet screen matching method, the video bullet screen matching device, the computer equipment and the storage medium, the initial text characteristics corresponding to the real-time bullet screen are extracted by acquiring the real-time bullet screen corresponding to the target video; determining a plurality of video segments to be matched corresponding to the real-time barrage from the target video, and acquiring fusion video characteristics corresponding to the video segments to be matched; the fusion video features are obtained by performing feature fusion on a target video frame feature sequence corresponding to a video segment to be matched, the target video frame feature sequence is obtained by performing feature extraction on each video frame in the video segment to be matched, and the target video frame feature sequence comprises target video frame features corresponding to each video frame in the same video segment to be matched; calculating the matching degree of the real-time barrage with each video segment to be matched respectively based on the initial text feature and the fusion video feature, determining a target video segment from each video segment to be matched based on the matching degree, and establishing an association relation between the real-time barrage and the target video segment, wherein the association relation is used for synchronously playing the real-time barrage when the target video segment is played. Therefore, each time the latest published real-time barrage of the user when watching the target video is obtained, the matching degree calculated based on the text features of the real-time barrage and the video features of the video segments is used for accurately determining the target video segments matched with the content of the real-time barrage, and then when the target video segments are played, the real-time barrage is synchronously played, the matching accuracy of the barrage and the video segments is improved, and the barrage and the video segments of the target video are always accurately matched and played.

Drawings

FIG. 1 is a diagram of an exemplary video bullet screen matching method;

FIG. 2 is a flowchart illustrating a video bullet screen matching method according to an embodiment;

FIG. 3 is a diagram of slicing a video in one embodiment;

FIG. 4 is a schematic flow diagram illustrating the generation of fused video features in one embodiment;

FIG. 5A is a schematic illustration of feature shifting in one embodiment;

FIG. 5B is a schematic illustration of feature shifting in another embodiment;

FIG. 6 is a schematic flow chart for feature fusion in one embodiment;

FIG. 7 is a diagram illustrating a text processing model training process in one embodiment;

FIG. 8 is a flowchart illustrating a video bullet screen matching method in accordance with another embodiment;

FIG. 9A is a schematic diagram of an example bullet screen interface;

FIG. 9B is a schematic view of an interface of a bullet screen in another embodiment;

FIG. 10 is a system diagram of a video bullet screen matching method in accordance with an embodiment;

fig. 11 is a block diagram showing the structure of a video bullet screen matching device in one embodiment;

fig. 12 is a block diagram showing the structure of a video bullet screen matching device in one embodiment;

FIG. 13 is a diagram showing an internal structure of a computer device in one embodiment;

FIG. 14 is a diagram illustrating an internal structure of a computer device according to an embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

Artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.

The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.

Computer Vision technology (CV) is a science for researching how to make a machine "see", and further refers to that a camera and a Computer are used to replace human eyes to perform machine Vision such as identification, tracking and measurement on a target, and further image processing is performed, so that the Computer processing becomes an image more suitable for human eyes to observe or is transmitted to an instrument to detect. As a scientific discipline, computer vision research-related theories and techniques attempt to build artificial intelligence systems that can capture information from images or multidimensional data. Computer vision technologies generally include image processing, image recognition, image semantic understanding, image retrieval, OCR, video processing, video semantic understanding, video content/behavior recognition, three-dimensional object reconstruction, 3D technologies, virtual reality, augmented reality, synchronous positioning, map construction, and other technologies, and also include common biometric technologies such as face recognition and fingerprint recognition.

Machine Learning (ML) is a multi-domain cross discipline, and relates to a plurality of disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. The special research on how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. Machine learning is the core of artificial intelligence, is the fundamental approach for computers to have intelligence, and is applied to all fields of artificial intelligence. Machine learning and deep learning generally include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and formal education learning.

The scheme provided by the embodiment of the application relates to the computer vision technology, the machine learning technology and other technologies of artificial intelligence, and is specifically explained by the following embodiments:

the video bullet screen matching method provided by the application can be applied to the application environment shown in fig. 1. The cast terminal 102 communicates with the server 104 through a network. The data storage system may store data that the server 104 needs to process. The data storage system may be integrated on the server 104, or may be located on the cloud or other network server. The playing terminal 102 may be, but is not limited to, various desktop computers, notebook computers, smart phones, tablet computers, internet of things devices, and portable wearable devices, where the internet of things devices may be smart televisions, smart car-mounted devices, and the like. The portable wearable device can be a smart watch, a smart bracelet, a head-mounted device, and the like. The playing terminal may be provided with an application program, where the application program may refer to a client installed in the terminal, and the client (which may also be referred to as an application client and an APP client) refers to a program installed and running in the terminal; the application program can also be an installation-free application program, namely, the application program can be used without downloading and installing, and the application program is also commonly called an applet and usually runs in a client as a subprogram; an application may also refer to a web application that is opened through a browser; and so on. The various applications described above are divided according to the application functions they provide, and the types of applications may include, but are not limited to: instant messaging applications, audio and video applications, and the like. The server 104 may be implemented as a stand-alone server or a server cluster consisting of a plurality of servers or a cloud server.

The play terminal 102 and the server 104 can be used separately to execute the video bullet screen matching method provided in the embodiment of the present application.

For example, the server obtains a real-time bullet screen corresponding to the target video, extracts initial text features corresponding to the real-time bullet screen, determines a plurality of video segments to be matched corresponding to the real-time bullet screen from the target video, obtains fusion video features corresponding to the video segments to be matched, and calculates the matching degree between the real-time bullet screen and each video segment to be matched based on the initial text features and the fusion video features. The fusion video features are obtained by performing feature fusion on a target video frame feature sequence corresponding to a video segment to be matched, the target video frame feature sequence is obtained by performing feature extraction on each video frame in the video segment to be matched, and the target video frame feature sequence comprises target video frame features corresponding to each video frame in the same video segment to be matched. And then, the server determines a target video segment from the video segments to be matched based on the matching degree, and establishes an association relation between the real-time barrage and the target video segment, wherein the association relation is used for synchronously playing the real-time barrage when the target video segment is played.

The play terminal 102 and the server 104 may also be cooperatively used to execute the video bullet screen matching method provided in the embodiment of the present application. For example, the playing terminal obtains a real-time bullet screen corresponding to the target video and sends the real-time bullet screen to the server. And the server determines a target video segment from a plurality of video segments to be matched corresponding to the real-time bullet screen through data processing, and establishes an incidence relation between the real-time bullet screen and the target video segment. The server can send the association relation to the playing terminal, so that the playing terminal can synchronously play the real-time barrage when playing the target video clip.

In an embodiment, as shown in fig. 2, a video bullet screen matching method is provided, and is described by taking an example of the method executed by a computer device, it is understood that the computer device may be the playing terminal 102 shown in fig. 1, or may be the server 104. In this embodiment, the video bullet screen matching method includes the following steps:

step S202, a real-time bullet screen corresponding to the target video is obtained, and initial text features corresponding to the real-time bullet screen are extracted.

The target video refers to a video currently played by the playing terminal. The real-time bullet screen refers to the latest bullet screen acquired by the playing terminal in real time, and is a bullet screen published by a video watching user in real time. The user can send own comments in the process of watching the video, and the comments sent by any user can be displayed in all the playing terminals playing the video in real time by the sliding subtitles, so that the interactivity among the viewers is increased. The real-time bullet screen can be input by means of typing, voice and the like.

The initial text features are obtained by extracting the features of the real-time bullet screen and can reflect the text content of the real-time bullet screen.

Specifically, the user can watch the target video on the playing terminal and release the bullet screen at any time. The computer equipment can acquire the real-time barrage published by the user when watching the target video, and extract the characteristics of the real-time barrage to obtain the initial text characteristics corresponding to the real-time barrage.

The computer device may extract the initial text features corresponding to the real-time bullet screen through a machine learning algorithm, for example, the computer device may extract the initial text features corresponding to the real-time bullet screen through a machine learning model. The computer device may input the real-time bullet screen into the machine learning model, with the output or intermediate data of the machine learning model as the initial text features. For example, if the machine learning model is a text feature extraction model, the output of the text feature extraction model may be used as the initial text feature, and if the machine learning model is a text classification model, the output of the feature extraction layer in the text classification model may be used as the initial text feature, that is, the intermediate data of the text classification model is used as the initial text feature.

Step S204, determining a plurality of video segments to be matched corresponding to the real-time barrage from the target video, and acquiring fusion video characteristics corresponding to the video segments to be matched; the fusion video features are obtained by performing feature fusion on a target video frame feature sequence corresponding to a video segment to be matched, the target video frame feature sequence is obtained by performing feature extraction on each video frame in the video segment to be matched, and the target video frame feature sequence comprises target video frame features corresponding to each video frame in the same video segment to be matched.

The target video can be divided into a plurality of video segments, and the video segment to be matched is a video segment needing to be matched with the real-time barrage. Each video segment of the target video can be used as a plurality of video segments to be matched corresponding to the real-time barrage. Or taking the video segment containing the adjacent video frame corresponding to the real-time bullet screen as the video segment to be matched corresponding to the real-time bullet screen from each video segment of the target video. The adjacent video frames corresponding to the real-time barrage refer to video frames of which the time distance between the video frame playing time and the barrage publishing time of the real-time barrage is smaller than the preset time distance, that is, a plurality of video frames continuously played before and after the barrage publishing time of the real-time barrage can be respectively used as the adjacent video frames corresponding to the real-time barrage. For example, ten video frames played last before the bullet screen publishing time of the real-time bullet screen and ten video frames played first after the bullet screen publishing time of the real-time bullet screen are obtained as the adjacent video frames corresponding to the real-time bullet screen, each adjacent video frame is sequenced according to the video frame timestamps, and four adjacent video frames are sequentially used as one video segment to be matched according to the sequencing result to obtain five video segments to be matched.

It can be understood that each video segment of the target video may include at least one video frame, and the video segments may be randomly divided, or obtained by performing video segmentation on the target video based on a particular video frame in the target video as a segmented video frame. The special video frame may specifically include at least one of a black frame, a scene change frame, and the like in the target video. The computer device may identify a particular video frame in the target video based on a custom algorithm or formula.

The target video frame feature sequence comprises target video frame features corresponding to all video frames in the same video clip to be matched. The target video frame features in the target video frame feature sequence may be ordered or unordered. And extracting the characteristics of each video frame in the video clip to be matched, and obtaining a target video frame characteristic sequence based on the extracted characteristics of each video frame. The video frame features obtained through feature extraction can be directly used as target video frame features, and the target video frame features are combined to obtain a target video frame feature sequence, for example, three-dimensional convolution processing is performed on video frames in a video segment to be matched to obtain target video frame features corresponding to the video frames, and the target video frame features are combined to obtain the target video frame feature sequence. The video frame features obtained through feature extraction may also be used as initial video frame features, the initial video frame features are further subjected to feature processing to obtain target video frame features, and each target video frame feature is combined to obtain a target video frame feature sequence, for example, feature extraction is performed on each video frame in a video segment to be matched to obtain initial video frame features, feature displacement is performed on each initial video frame feature to obtain intermediate video frame features corresponding to each video frame, two-dimensional convolution processing is performed on each intermediate video frame feature to obtain target video frame features corresponding to each video frame, and each target video frame feature is combined to obtain a target video frame feature sequence. It is understood that the video frame features represent data at the video frame level, represent local features of a video segment, and can represent semantic information of the video frame.

The fusion video features are obtained by performing feature fusion on a target video frame feature sequence corresponding to a video segment to be matched. The fused video features represent data at a video level, represent global features of video clips, and can represent semantic information of the whole video clip. Feature fusion is used to compress data, convert video frame-level data to video-level data, and fuse local features into global features. It can be understood that a target video frame feature sequence corresponding to a video segment to be matched is composed of a plurality of target video frame features, and feature fusion is performed on the target video frame feature sequence, so that data composed of a plurality of feature vectors can be fused into a feature vector to represent the data.

Specifically, after the real-time bullet screen corresponding to the target video is obtained, the computer device may first determine a plurality of video segments to be matched corresponding to the real-time bullet screen in the target video, then obtain the fusion video features corresponding to the respective video segments to be matched, subsequently match the bullet screen with the video features through the text features of the bullet screen and the video features of the video segments, and determine the bullet screen and the video segments which are matched with each other.

It can be understood that the fusion video features may be pre-calculated before the real-time bullet screen is obtained, for example, in order to improve matching efficiency, the computer device may pre-segment the target video to obtain a plurality of video segments, pre-extract a target video frame feature sequence corresponding to each video segment, respectively perform feature fusion on the target video frame feature sequences corresponding to each video segment to obtain fusion video features corresponding to each video segment, and store each fusion video feature. Then, after obtaining the real-time bullet screen corresponding to the target video, the computer device determines a plurality of video segments to be matched corresponding to the real-time bullet screen, for example, each video segment of the target video is used as a video segment to be matched corresponding to the real-time bullet screen, and then, the fusion video features corresponding to each video segment to be matched can be directly obtained from the pre-stored data.

The fusion video features may also be obtained by real-time calculation after the real-time barrage is obtained, for example, after the real-time barrage corresponding to the target video is obtained, the computer device determines a plurality of video segments to be matched corresponding to the real-time barrage, extracts a target video frame feature sequence corresponding to each video segment to be matched, and performs feature compression on the target video frame feature sequence corresponding to each video segment to obtain the fusion video features corresponding to each video segment.

The computer device may perform associated storage on the related data of the target video and the video identifier of the target video, for example, perform associated storage on each video segment of the target video and the video identifier of the target video, so that each video segment of the target video, and even each feature corresponding to each video segment, may be found based on the video identifier. The real-time barrage received by the computer equipment can carry the video identification corresponding to the target video, and then the computer equipment can determine the target video corresponding to the real-time barrage based on the video identification, so that the related data corresponding to the target video can be obtained. The real-time bullet screen that computer equipment received can also carry bullet screen time of posting.

And step S206, calculating the matching degree of the real-time barrage and each video clip to be matched respectively based on the initial text features and the fusion video features.

The matching degree refers to the matching degree and the matching score of the real-time bullet screen and the video clip to be matched. It can be understood that the greater the matching degree between the barrage and the video segment is, the more similar and matched the contents of the barrage and the video segment are.

Specifically, after obtaining the initial text feature corresponding to the real-time bullet screen and the fusion video feature corresponding to the video segment to be matched, the computer device may calculate the matching degree between the real-time bullet screen and the video segment to be matched based on the initial text feature and the fusion video feature, for example, calculate the feature similarity between the initial text feature and the fusion video feature, and use the feature similarity as the matching degree between the real-time bullet screen and the video segment to be matched. The computer device may calculate the degree of match based on a custom algorithm or formula. The computer device may calculate the matching degree through a machine learning algorithm, for example, the matching degree may be calculated through a machine learning model, the initial text feature corresponding to the real-time bullet screen and the fusion video feature corresponding to the video segment to be matched are input into the trained video text matching model, and the output data of the video text matching model is used as the matching degree of the real-time bullet screen and the video segment to be matched. The video clips to be matched corresponding to the real-time barrage are multiple, so that the matching degree of the real-time barrage and each video clip to be matched can be calculated finally, and multiple matching degrees can be calculated finally.

And step S208, determining a target video clip from the video clips to be matched based on the matching degree.

Specifically, the computer device may determine a target video segment from the video segments to be matched based on the matching degree, where the target video segment is a video segment that is most matched or better matched with the real-time barrage, and the scenario of the target video segment and the content of the real-time barrage are consistent with each other, and the real-time barrage may be considered to belong to the target video segment.

In one embodiment, determining a target video segment from the video segments to be matched based on the matching degree comprises: and acquiring the video clip to be matched corresponding to the maximum matching degree from all the matching degrees as a target video clip.

Specifically, when the target video segment is determined, the computer device may select the video segment to be matched corresponding to the maximum matching degree from the matching degrees as the target video segment, so that the most matched video segment to be matched is used as the target video segment corresponding to the real-time barrage.

It can be understood that the computer device may also select at least one to-be-matched video segment with a matching degree greater than a preset matching degree as the target video segment. One bullet screen can be matched with at least one video segment, and the same bullet screen can be played synchronously with at least one video segment. Of course, the matching degree may also be a label used for indicating whether to match, and the to-be-matched video segment corresponding to the label indicating the matching is obtained as the target video segment.

Step S210, establishing an incidence relation between the real-time bullet screen and the target video clip, wherein the incidence relation is used for synchronously playing the real-time bullet screen when the target video clip is played.

Specifically, after determining a target video segment corresponding to the real-time bullet screen, the computer device may establish an association relationship between the real-time bullet screen and the target video segment, where the association relationship is used to play the real-time bullet screen synchronously when the target video segment is played. Therefore, in the process of playing the target video by any subsequent playing terminal, once the target video segment is played, the corresponding real-time bullet screen is synchronously played, and finally the aim of calibrating the bullet screen at any moment is achieved. It is understood that the real-time bullet screen can be played synchronously with any video frame of the target video segment, for example, the real-time bullet screen can be played synchronously with the starting video frame of the target video segment, that is, the real-time bullet screen is played at the beginning of the target video segment.

In one embodiment, there may be a plurality of playing terminals corresponding to the target video, and if the computer device is a server, each playing terminal may send a respective real-time barrage to the server, so that the server performs matching between the barrage and the video clip. The server can send the association relation to all the playing terminals every time the association relation between one real-time barrage and the corresponding target video segment is determined, so that all the playing terminals can synchronously play the corresponding real-time barrages when playing the target video segment. If the computer device is a playing terminal, any playing terminal can match the bullet screen with the video clip locally after acquiring the real-time bullet screen sent by the user. The playing terminal can send the association relation to the server every time the association relation between one real-time barrage and the corresponding target video segment is determined, so that the server can send the association relation to other playing terminals, and finally, all the playing terminals synchronously play the corresponding real-time barrages when playing the target video segment.

For a certain real-time bullet screen, after the playing terminal corresponding to the real-time bullet screen determines the corresponding association relationship, if the playing terminal has already played the target video segment corresponding to the real-time bullet screen, the playing terminal can remind the user that the video segment corresponding to the content of the bullet screen released by the user has already been played through interactive information, for example, a small window pops up on the playing interface of the target video, and the user is reminded through characters in the small window. The interactive information can further carry the position information of the target video segment corresponding to the real-time barrage in the target video, and the user can return to play the target video segment corresponding to the real-time barrage by triggering the position information and restart to play the target video segment corresponding to the real-time barrage so that the user can check the barrage content sent by the user. And for other playing terminals, if the target video segment corresponding to the real-time barrage has been played, the users of the playing terminals do not need to be reminded, and if the target video segment corresponding to the real-time barrage has not been played, the real-time barrage and the corresponding target video segment are synchronously played.

The number of the playing terminals corresponding to the target video can be multiple, and all users can issue the barrage anytime and anywhere when watching the target video, so that the number of the barrages corresponding to the target video can be multiple. All the bullet screens corresponding to the target video can determine the corresponding target video segments through the method, and all the bullet screens corresponding to the target video can be played synchronously with the corresponding target video segments.

In the video bullet screen matching method, the initial text characteristics corresponding to the real-time bullet screen are extracted by acquiring the real-time bullet screen corresponding to the target video; determining a plurality of video segments to be matched corresponding to the real-time barrage from the target video, and acquiring fusion video characteristics corresponding to the video segments to be matched; the fusion video features are obtained by performing feature fusion on a target video frame feature sequence corresponding to a video segment to be matched, the target video frame feature sequence is obtained by performing feature extraction on each video frame in the video segment to be matched, and the target video frame feature sequence comprises target video frame features corresponding to each video frame in the same video segment to be matched; calculating the matching degree of the real-time barrage with each video segment to be matched respectively based on the initial text feature and the fusion video feature, determining a target video segment from each video segment to be matched based on the matching degree, and establishing an association relation between the real-time barrage and the target video segment, wherein the association relation is used for synchronously playing the real-time barrage when the target video segment is played. Therefore, each time the latest published real-time barrage of the user when watching the target video is obtained, the matching degree calculated based on the text features of the real-time barrage and the video features of the video segments is used for accurately determining the target video segments matched with the content of the real-time barrage, and then when the target video segments are played, the real-time barrage is synchronously played, the matching accuracy of the barrage and the video segments is improved, and the barrage and the video segments of the target video are always accurately matched and played.

It can be understood that the bullet screen matching method of the application can also be applied to calibrating bullet screen time of historical bullet screens.

In one embodiment, determining a plurality of video segments to be matched corresponding to real-time bullet screens from a target video includes:

determining a segmentation video frame from each target video frame based on pixel information corresponding to each target video frame of the target video; performing video segmentation on a target video based on a segmented video frame to obtain a plurality of initial video segments; and determining a plurality of video clips to be matched corresponding to the real-time bullet screens from each initial video clip.

The target video frame refers to any one video frame in the target video. The pixel information corresponding to the target video frame is obtained based on the pixel values of all the pixel points in the target video frame, and the pixel values of all the pixel points comprise the pixel values of all the pixel points in at least one color space.

Specifically, the computer device may segment the target video, segment the target video into a plurality of initial video segments, and determine a video segment to be matched corresponding to the real-time barrage from each of the initial video segments. When video segmentation is performed, the computer device may determine segmented video frames from each target video frame based on pixel information corresponding to each target video frame of the target video, where the pixel information of the segmented video frames meets a preset condition, and has certain characteristics and features. For example, a black frame in the target video may be determined from each target video frame based on the pixel information, the black frame being typically used for a transition in the video, the black frame being treated as a sliced video frame. Or determining a scene switching frame in the target video from each target video frame based on the pixel information, and taking the scene switching frame as a segmentation video frame. The pixel characteristics of special video frames such as black frames and scene switching frames can be extracted as the preset conditions for determining the segmentation of the video frames.

After determining the segmented video frame, the computer device may perform video segmentation on the target video based on the segmented video frame, and segment the target video into a plurality of initial video segments with the segmented video frame as a segmentation point. The initial video segment obtained by segmentation may or may not contain the segmented video frame, for example, the initial video segment may not contain a black frame. Furthermore, the computer device may determine a plurality of to-be-matched video segments corresponding to the real-time barrage from each initial video segment, for example, each initial video segment may be respectively used as the to-be-matched video segment corresponding to the real-time barrage, or a plurality of video segments may be selected from the initial video segments and respectively used as the to-be-matched video segments corresponding to the real-time barrage.

It can be understood that the video segmentation can be performed on the target video in advance before the real-time barrage is acquired, or the video segmentation can be performed on the target video after the real-time barrage is acquired.

In the above embodiment, the segmentation video frames are determined from each target video frame based on the pixel information corresponding to each target video frame of the target video, the video segmentation is performed on the target video based on the segmentation video frames, the target video frames with strong relevance can be divided into the same initial video segment, the target video frames with weak relevance are divided into different initial video segments, and the determination of the video segment to be matched corresponding to the real-time barrage from such initial video segments is helpful for improving the matching accuracy of the barrage and the video segments.

In one embodiment, determining a split video frame from each target video frame based on pixel information corresponding to each target video frame of a target video includes:

acquiring a first pixel value of each pixel point in each target video frame in a first color space; counting first pixel values corresponding to the same target video frame to obtain pixel information corresponding to each target video frame; and taking the target video frame with the pixel information smaller than the first threshold value as a segmentation video frame.

The first color space is an RGB color space. The first threshold for determining the sliced video frame may be set according to actual needs.

Specifically, when performing video slicing, the computer device may treat the black frame in the target video as a sliced video frame. The computer device can obtain first pixel values of all pixel points in all target video frames in the RGB color space, count all the first pixel values corresponding to the same target video frame, and calculate the average value of the RGB pixels of each target video frame to obtain pixel information corresponding to each target video frame. The computer device may treat a target video frame having pixel information less than a first threshold as a black frame, which is typically used for transitions in video, and thus may treat the black frame as a sliced video frame.

For example, the first pixel value of a pixel point may be represented by (R, G, B), and R, G and B represent the values of the pixel's color on the three color channels of red, green, and blue, respectively. The counting of the respective first pixel values may be an average value counted respectively over three color channels. The first threshold may include sub-thresholds corresponding to the three color channels, and the target video frames whose average values on the three color channels are smaller than the corresponding sub-thresholds may be regarded as black frames.

In the above embodiment, each first pixel value corresponding to the same target video frame is counted to obtain pixel information corresponding to each target video frame, the target video frame with the pixel information smaller than the first threshold is taken as a segmentation video frame, a black frame for transition in a video can be taken as a segmentation video frame, video segmentation is performed based on the black frame, and video segments corresponding to different scenes can be segmented.

acquiring a second pixel value of each pixel point in the same target video frame in a second color space to obtain pixel information corresponding to each target video frame; calculating pixel change information between the adjacent target video frames based on second pixel values corresponding to the matched pixel points in the adjacent target video frames; and determining the split video frame from the adjacent target video frames with the pixel change information larger than the second threshold value.

Wherein, the second color space is HSV color space. The second threshold value can be set according to actual needs. The adjacent target video frames refer to two adjacent target video frames, and the matched pixel points in the adjacent target video frames refer to pixel points at the same position in the two target video frames, for example, the central pixel points of the two video frames can be regarded as matched pixel points.

Specifically, when performing video slicing, the computer device may take a scene change frame in the target video as a sliced video frame. Whether scene switching occurs can be judged through the variable quantity between adjacent frames in the HSV color space, and then the scene switching frame is determined. Compared with an RGB color space, the HSV color space can separate color change from intensity change, in scene switching detection, the color change usually implies that a scene is changed, and the intensity change usually is influenced by factors such as illumination and does not usually represent that the scene is changed. Therefore, the scene change frame in the video can be accurately found based on the amount of change between adjacent frames in the HSV color space.

The computer device may obtain a second pixel value of each pixel point in the same target video frame in the second color space as pixel information corresponding to the target video frame, so as to obtain pixel information corresponding to each target video frame. Further, the computer device may calculate second pixel value differences based on second pixel values corresponding to matched pixel points in two adjacent target video frames to obtain a plurality of second pixel value differences, and calculate pixel variation information between the adjacent frames based on the respective second pixel value differences, for example, may calculate an average value of the respective second pixel value differences as the pixel variation information, and may calculate a sum of the respective second pixel value differences as the pixel variation information. All adjacent target video frames in the target video can be calculated to obtain corresponding pixel change information, and then a scene switching frame is determined from the target video frames based on the pixel change information and is used as a segmentation video frame. Specifically, the split video frame may be determined from adjacent target video frames whose pixel change information is greater than the second threshold, for example, the video frame a and the video frame B are adjacent target video frames, the pixel change information between the video frame a and the video frame B is greater than the second threshold, any one of the video frame a and the video frame B may be used as the split video frame, and one of two video segments obtained based on the split video frame includes the video frame a and one includes the video frame B.

For example, the video frame a and the video frame B include M × N pixels, and the video frame a and the video frame B are two adjacent video frames. Calculating the difference value of HSV pixels between a first pixel point in a first line in a video frame A and a first pixel point in a first line in a video frame B, calculating the difference value of HSV pixels between a second pixel point in the first line in the video frame A and a second pixel point in the first line in the video frame B, calculating the difference value of HSV pixels between a third pixel point in the first line in the video frame A and a third pixel point in the first line in the video frame B, repeating the steps, calculating the difference value of HSV pixels between matched pixel points in the video frame A and the video frame B, finally obtaining M-N second pixel value difference values, and calculating the sum of all the second pixel value difference values as pixel change information. And if the pixel change information is larger than the second threshold value, determining that scene switching occurs, and taking any one of the video frame A and the video frame B as a segmentation video frame.

It can be understood that the second pixel value difference between all matched pixel points in the adjacent target video frame may be calculated, and the pixel change information may be obtained based on each second pixel value difference. The second pixel value difference between partially matched pixels in adjacent target video frames may also be calculated, and pixel change information is obtained based on each second pixel value difference, for example, the target video frame is divided into a plurality of image regions, and at least one pixel is selected from the pixels covered by each image region to calculate the second pixel value difference.

In the above embodiment, the second pixel value of each pixel point in the same target video frame in the second color space is obtained, the pixel information corresponding to each target video frame is obtained, the pixel change information between adjacent target video frames is calculated based on the second pixel value corresponding to the matched pixel point in the adjacent target video frame, the adjacent target video frames whose pixel change information is greater than the second threshold value can be regarded as video frames in video clips corresponding to different scenes, respectively, the split video frame is determined from the adjacent target video frames whose pixel change information is greater than the second threshold value, and the video clips corresponding to different scenes can be split based on the split video frame for video splitting.

Referring to fig. 3, in performing video slicing, a computer device may simultaneously refer to two pieces of information, one is black frame information in a video, i.e., a black frame in a video, and one is scene change information, i.e., a scene change frame in a video. The computer equipment can detect the black frame of the video of the target video to obtain the black frame in the target video. For video black frame detection, each target video frame is extracted from a target video, the average value of RGB pixels of each frame of image is calculated and compared with a first threshold value, and if the average value of the RGB pixels is smaller than the first threshold value, the frame is regarded as a black frame. Black frames are typically used for transitions in video, so the detected black frames can characterize the transition points of a video segment. The computer equipment can perform scene switching detection on the target video to obtain a scene switching frame in the target video. For scene switching detection, firstly extracting each target video frame from a target video, converting each frame of image from an RGB color space to an HSV color space, then calculating the variation between adjacent frames in the HSV color space, if the variation is greater than a second threshold value, considering that scene switching occurs, and determining a scene switching frame. The detection results of video black frame detection and scene switching detection are integrated, a plurality of split video frames can be determined from a complete video, and the complete video can be split into a plurality of video segments according to the split video frames.

In one embodiment, as shown in fig. 4, acquiring a fusion video feature corresponding to each to-be-matched video segment includes:

step S402, respectively extracting the characteristics of each video clip to be matched to obtain an initial video frame characteristic sequence corresponding to each video clip to be matched; the initial video frame feature sequence is obtained by sequencing initial video frame features respectively corresponding to each video frame in the same video clip to be matched according to the video frame time stamp, wherein the initial video frame features comprise video frame sub-features respectively corresponding to a plurality of feature channels.

Specifically, when generating the fusion video features corresponding to the video segments to be matched, the computer device may first perform feature extraction on the video segments to be matched to obtain an initial video frame feature sequence corresponding to the video segments to be matched, then perform feature shift on the video frame sub-features corresponding to the target feature channels in the initial video frame feature sequence to obtain an intermediate video frame feature sequence, then perform two-dimensional convolution processing on the intermediate video frame feature sequence to obtain a target video frame feature sequence, and finally perform feature fusion on the target video frame feature sequence to obtain the fusion video features.

The initial video frame feature sequence comprises initial video frame features corresponding to all video frames in the same video clip to be matched. The initial video frame feature sequence is obtained by sequencing initial video frame features corresponding to all video frames in the same video segment to be matched according to the video frame time stamps. For example, the video segment to be matched includes four video frames, and the sequencing result of the four video frames according to the sequencing of the video frame timestamps is as follows: video frame a-video frame B-video frame C-video frame D. The method comprises the steps of extracting features of a video segment to be matched to obtain initial video frame features a corresponding to a video frame A, initial video frame features B corresponding to a video frame B, initial video frame features C corresponding to a video frame C and initial video frame features D corresponding to a video frame D, sequencing all the initial video frame features according to video frame time stamps to obtain an initial video frame feature sequence, wherein the initial video frame feature sequence is specifically initial video frame features a-initial video frame features B-initial video frame features C-initial video frame features D.

The initial video frame characteristics corresponding to one video frame comprise video frame sub-characteristics corresponding to a plurality of characteristic channels respectively, and the video frame sub-characteristics corresponding to different characteristic channels represent the characteristics of different information extracted from the image. For example, the initial video frame feature may be represented by C × H × W, which indicates that there are C feature channels, each feature channel is a feature map and a feature vector with the size of H × W, and H × W indicates the size of the video frame sub-feature. Video frame sub-features may also be considered feature maps.

The computer device may perform feature extraction based on a custom algorithm or formula. The computer device may perform feature extraction on each video segment to be matched through a machine learning algorithm, for example, the video segment to be matched is input into a convolutional neural network, feature extraction is performed through the convolutional neural network, and output data of the convolutional neural network is used as an initial video frame feature sequence. And (3) obtaining initial video frame characteristic sequences corresponding to the video clips to be matched respectively through data processing of computer equipment. The initial video frame features in the initial video frame feature sequence mainly represent content information of each video frame.

Step S404, in the initial video frame feature sequence corresponding to the same video segment to be matched, feature shifting is carried out on the video frame sub-features corresponding to the target feature channel based on the sequencing information of the initial video frame features, and an intermediate video frame feature sequence corresponding to each video segment to be matched is obtained.

The characteristic shifting refers to moving the video frame sub-characteristics corresponding to the target characteristic channel in the initial video frame characteristics, so as to change the video frame sub-characteristics corresponding to the target characteristic channel in the initial video frame characteristics. The target feature channel may include at least one feature channel, and the target feature channel may be specifically set according to actual needs, for example, the video frame feature includes video frame sub-features corresponding to the eight feature channels, and two feature channels may be selected from the eight feature channels as the target feature channels.

The ordering information of the initial video frame features refers to the arrangement order of the initial video frame features in the initial video frame feature sequence. The ordering information of the initial video frame characteristics may also be considered as a temporal ordering of the video frame timestamps of the respective initial video frames.

Specifically, for any one initial video frame feature sequence, the computer device may perform feature shift on the video frame sub-features corresponding to the target feature channel based on the ordering information of the initial video frame features to obtain intermediate video frame features corresponding to each video frame, and the intermediate video frame features corresponding to each video frame constitute an intermediate video frame feature sequence corresponding to the video segment to be matched. The intermediate video frame characteristics of one video frame comprise video frame sub-characteristics on a target characteristic channel after characteristic displacement and original video frame sub-characteristics on other characteristic channels. The features of the intermediate video frames in the sequence of features of the intermediate video frames may be ordered or unordered. And (4) obtaining the intermediate video frame characteristic sequences corresponding to the video clips to be matched respectively through data processing of computer equipment.

It is understood that the shift direction of the feature shift may be a forward shift along the time sequence or a backward shift along the time sequence. If there are multiple target feature channels, the shift directions corresponding to the target feature channels may be the same or different. The shift distance of the feature shift may be at least one time unit, for example, a video frame sub-feature corresponding to a target feature channel in an initial video frame feature of a current video frame may be used as a video frame sub-feature corresponding to a target feature channel in an intermediate video frame feature of a next video frame; the video frame sub-feature corresponding to the target feature channel in the initial video frame feature of the current video frame can be used as the video frame sub-feature corresponding to the target feature channel in the intermediate video frame feature of the next video frame.

And step S406, respectively carrying out two-dimensional convolution processing on each intermediate video frame feature sequence to obtain a target video frame feature sequence corresponding to each current video clip to be matched.

The two-dimensional convolution processing refers to convolution processing performed in video frame characteristics corresponding to the same video frame. The specific process of the two-dimensional convolution processing may refer to various existing 2D convolutions, for example, for an intermediate video frame, sliding on a feature map corresponding to each feature channel by using a convolution kernel, multiplying a pixel value on the feature map by a numerical value on the corresponding convolution kernel, then adding all multiplied values as a gray value of a pixel on the feature map corresponding to the intermediate pixel of the convolution kernel, and finally sliding all images to obtain the target video frame feature.

Specifically, for any one intermediate video frame feature sequence, the computer device may perform two-dimensional convolution processing on the intermediate video frame feature sequence, perform two-dimensional convolution processing on each intermediate video frame feature to obtain a target video frame feature corresponding to each video frame, and the target video frame features corresponding to each video frame constitute a target video frame feature sequence corresponding to the video segment to be matched. The target video frame feature sequence may be obtained by sorting each target video frame feature according to the video frame timestamp. And through data processing of computer equipment, the target video frame characteristic sequences corresponding to the video clips to be matched can be obtained.

It can be understood that the conventional two-dimensional convolution processing only utilizes information of a current frame, but in the present application, an intermediate video frame feature sequence is obtained through feature shift, an intermediate video frame feature corresponding to one video frame in the intermediate video frame feature sequence includes not only information of the current frame but also information of other frames, so that the two-dimensional convolution processing is performed on the intermediate video frame feature sequence, information of different frames can be considered, the obtained target video frame feature sequence fuses information in a time dimension, and in consideration of interrelation and content relevance among video frames, such target video frame feature sequence can better represent content of a video segment.

And step S408, respectively carrying out feature fusion on the feature sequences of the target video frames to obtain fusion video features corresponding to the current video segments to be matched.

Specifically, for any one intermediate video frame feature sequence, the computer device may perform feature fusion on the target video frame feature sequence to obtain a fusion video feature corresponding to the video segment to be matched. For example, the respective target video frame features in the target video frame feature sequence may be weighted and summed to obtain the fused video feature. The weights corresponding to the features of the target video frames may be the same or different. And after data processing of computer equipment, fusion video features corresponding to all the video segments to be matched can be obtained.

In the above embodiment, feature extraction is performed on a video segment to be matched to obtain an initial video frame feature sequence, feature shifting is performed on video frame sub-features corresponding to a target feature channel in the initial video frame feature sequence to enable information exchange between different video frames to obtain an intermediate video frame feature sequence, and two-dimensional convolution processing is performed on the intermediate video frame feature sequence to fuse information between different video frames to obtain a target video frame feature sequence. Thus, compared with three-dimensional convolution processing, the effect of three-dimensional convolution processing can be achieved through feature shifting and two-dimensional convolution processing, the calculation amount is smaller than that of three-dimensional convolution processing, and the calculation complexity is smaller than that of three-dimensional convolution processing. The feature fusion is carried out on the target video frame feature sequence, so that more accurate fusion video features can be obtained.

In one embodiment, in an initial video frame feature sequence corresponding to the same video segment to be matched, performing feature shift on video frame sub-features corresponding to a target feature channel based on ordering information of the initial video frame features to obtain an intermediate video frame feature sequence corresponding to each video segment to be matched, includes:

in an initial video frame feature sequence corresponding to a current video clip to be matched, taking a video frame sub-feature corresponding to a target feature channel in each initial video frame feature as a target sub-feature; aiming at the initial video frame characteristics, updating the target sub-characteristics corresponding to the adjacent video frames based on the target sub-characteristics corresponding to the current video frame to obtain the intermediate video frame characteristics corresponding to each video frame of the current video clip to be matched; and sequencing the characteristics of the intermediate video frames according to the video frame time stamps to obtain an intermediate video frame characteristic sequence corresponding to the current video clip to be matched.

The current video segment to be matched refers to a currently used video segment to be matched, and each video segment to be matched corresponding to the real-time barrage can be sequentially used as the current video segment to be matched. The current video frame refers to a currently used video frame, and each video frame in the current video clip to be matched can be sequentially used as the current video frame. The adjacent video frame includes at least one of a forward video frame and a backward video frame of the current video frame.

Specifically, when feature shifting is performed, the computer device may generate the intermediate video frame feature by updating the target sub-feature in the initial video frame features, with the video frame sub-feature corresponding to the target feature channel in each initial video frame feature as the target sub-feature. On the basis of the initial video frame characteristics, the computer equipment can update the target sub-characteristics corresponding to the adjacent video frames based on the target sub-characteristics corresponding to the current video frames, replace the original target sub-characteristics of the adjacent video frames with the target sub-characteristics corresponding to the current video frames, and keep the video frame sub-characteristics corresponding to other characteristic channels unchanged, so that the intermediate video frame characteristics corresponding to the adjacent video frames are obtained. After each video frame in the current video clip to be matched is sequentially used as the current video frame for target sub-feature updating, the intermediate video frame feature corresponding to each video frame of the current video clip to be matched can be obtained. Furthermore, the computer device can sequence the characteristics of each intermediate video frame according to the video frame time stamp to obtain an intermediate video frame characteristic sequence corresponding to the current video clip to be matched.

For example, the video segment to be matched includes four video frames, which are video frame a, video frame B, video frame C, and video frame D in sequence. For the initial video frame characteristics, the target sub-characteristics corresponding to the video frame B may be updated based on the target sub-characteristics corresponding to the video frame a, the target sub-characteristics corresponding to the video frame C may be updated based on the target sub-characteristics corresponding to the video frame B, the target sub-characteristics corresponding to the video frame D may be updated based on the target sub-characteristics corresponding to the video frame C, the target sub-characteristics corresponding to the video frame a may be updated based on the target sub-characteristics corresponding to the video frame B, and the video frame sub-characteristics corresponding to the other characteristic channels may not be changed, so as to obtain intermediate video frame characteristics corresponding to the video frame a, the video frame B, the video frame C, and the video frame D, respectively.

In the above embodiment, the content relevance between adjacent video frames is strong, and the video frames have coherence, when feature shifting is performed, the target sub-features corresponding to the adjacent video frames are updated based on the target sub-features corresponding to the current video frame, and subsequently, context interaction and context fusion can be performed when two-dimensional convolution processing is performed, so that the modeling capability in the time dimension is improved.

In one embodiment, updating the target sub-feature corresponding to the neighboring video frame based on the target sub-feature corresponding to the current video frame includes:

and updating the target sub-feature corresponding to the next video frame based on the target sub-feature corresponding to the current video frame, and configuring the target sub-feature corresponding to the starting video frame as a preset sub-feature.

The starting video frame refers to a video frame ranked first in a current video clip to be matched. The preset sub-feature is a preset video frame sub-feature, and is fixed data, for example, the preset sub-feature may be set to zero.

Specifically, when performing the target sub-feature update, the computer device may update the target sub-feature corresponding to the next video frame based on the target sub-feature corresponding to the current video frame, that is, in the initial video frame feature sequence, the target sub-feature is moved by one time unit along the direction of increasing video frame timestamp. Since the starting video frame in the video segment has no forward video frame, the computer device may configure the target sub-feature corresponding to the starting video frame as the preset sub-feature.

Referring to fig. 5A, a in fig. 5A denotes an initial video frame feature sequence, and b denotes an intermediate video frame feature sequence. For the initial video frame feature sequence, a row of minicubes represents an initial video frame feature corresponding to a video frame, and a minicube represents a video frame sub-feature. In the initial video frame feature sequence, each initial video frame feature is ordered from small to large according to the video frame timestamp, and the initial video frame feature can be represented by C × H × W. In fig. 5A, the initial video frame features include video frame sub-features corresponding to six feature channels, and two of the feature channels are used as target feature channels. In the initial video frame feature sequence, the video frame sub-features (i.e. target sub-features) corresponding to the target feature channel are moved by one time unit along the direction of increasing the video frame timestamp, that is, the target sub-features corresponding to the current video frame are used as the target sub-features corresponding to the next video frame. And filling the vacant positions after the shifting with zeros, namely, configuring the target sub-features corresponding to the starting video frame as preset sub-features, thereby obtaining an intermediate video frame feature sequence.

In the above embodiment, the target sub-feature corresponding to the next video frame is updated based on the target sub-feature corresponding to the current video frame, the past frame and the current frame may be blended, and the target sub-feature corresponding to the starting video frame is configured as the preset sub-feature, so that the intermediate video frame feature is quickly obtained.

In one embodiment, the target sub-features include a first sub-feature corresponding to a first feature channel in the target feature channel and a second sub-feature corresponding to other feature channels in the target feature channel.

Updating the target sub-features corresponding to the adjacent video frames based on the target sub-features corresponding to the current video frames, including:

updating a first sub-feature corresponding to a next video frame based on a first sub-feature corresponding to a current video frame, and configuring a first sub-feature corresponding to a starting video frame as a preset sub-feature; and updating a second sub-feature corresponding to the last video frame based on the second sub-feature corresponding to the current video frame, and configuring a second sub-feature corresponding to the ending video frame as a preset sub-feature.

The first characteristic channel may be set as needed, for example, any one of the target characteristic channels is used as the first characteristic channel. If there are at least two target feature channels, the target sub-features may include a first sub-feature corresponding to a first feature channel in the target feature channels and a second sub-feature corresponding to another feature channel in the target feature channels. The ending video frame refers to the video frame sequenced at the last in the current video clip to be matched.

Specifically, when the target sub-feature is updated, the computer device may update a first sub-feature corresponding to a next video frame based on a first sub-feature corresponding to a current video frame, update a second sub-feature corresponding to a previous video frame based on a second sub-feature corresponding to the current video frame, configure the first sub-feature corresponding to a starting video frame as a preset sub-feature, configure the second sub-feature corresponding to an ending video frame as a preset sub-feature, and cut off an unnecessary portion. That is, in the initial sequence of video frame features, the first sub-feature is shifted by one time unit in the direction of increasing video frame timestamp and the second sub-feature is shifted by one time unit in the direction of decreasing video frame timestamp. Since the starting video frame in the video segment has no forward video frame, the computer device may configure the first sub-feature corresponding to the starting video frame as a preset sub-feature, and since the ending video frame in the video segment has no backward video frame, the computer device may configure the second sub-feature corresponding to the ending video frame as a preset sub-feature.

Referring to fig. 5B, a in fig. 5B denotes an initial video frame feature sequence, and B denotes an intermediate video frame feature sequence. For the initial video frame feature sequence, a row of minicubes represents an initial video frame feature corresponding to a video frame, and a minicube represents a video frame sub-feature. In the initial video frame feature sequence, each initial video frame feature is ordered from small to large according to the video frame timestamp, and the initial video frame feature can be represented by C × H × W. In fig. 5B, the initial video frame features include video frame sub-features corresponding to six feature channels, where a first feature channel and a second feature channel are used as target feature channels, the first feature channel is used as a first feature channel in the target feature channels, and the second feature channel is used as another feature channel in the target feature channels. In the initial video frame feature sequence, moving a video frame sub-feature (i.e. a first sub-feature) corresponding to the first feature channel by one time unit along the direction of increasing the video frame timestamp, that is, taking the first sub-feature corresponding to the current video frame as a first sub-feature corresponding to the next video frame, and filling the vacant position after the movement with zero, that is, configuring the first sub-feature corresponding to the starting video frame as a preset sub-feature, and cutting off the redundant part. And moving the video frame sub-features (namely, the second sub-features) corresponding to other feature channels in the target feature channel by one time unit along the direction of decreasing the video frame timestamp, namely, taking the second sub-features corresponding to the current video frame as the second sub-features corresponding to the previous video frame, and filling the vacant positions after the shift by zeros, namely, configuring the second sub-features corresponding to the ending video frame as preset sub-features, and cutting off redundant parts.

In the above embodiment, the first sub-feature corresponding to the next video frame is updated based on the first sub-feature corresponding to the current video frame, the second sub-feature corresponding to the previous video frame is updated based on the second sub-feature corresponding to the current video frame, the past frame and the future frame may be blended with the current frame, the first sub-feature corresponding to the starting video frame is configured as the preset sub-feature, and the second sub-feature corresponding to the ending video frame is configured as the preset sub-feature, so that the intermediate video frame feature is obtained quickly.

In one embodiment, referring to fig. 6, feature fusion includes the steps of:

step S602, obtaining a plurality of clustering center features; each cluster center feature corresponds to a different video frame topic.

The video frame theme refers to the central idea and main content expressed by the image information of one video frame. For example, the information represented by one video segment is an action, and each video frame in the video segment can respectively express the subject information such as each action detail, action trigger object, auxiliary prop and the like which form the action. For example, the information represented by one video clip is a "shooting" action, each video frame can respectively express detail information such as a "basket", "ball control", "jumping", "shooting", and the like, and each video frame has a corresponding video frame theme.

The cluster center feature is obtained by carrying out cluster analysis on the video frame features of a large number of video frames. Different cluster center features may correspond to different video frame topics. For example, the computer device may obtain video frame features of a large number of video frames, perform feature clustering on each video frame feature to obtain a plurality of clustering centers, where each clustering center corresponds to one clustering center feature, and each video frame feature belongs to a closest clustering center, and video frame features belonging to the same clustering center have great similarity and represent the same video frame theme, while video frame features belonging to different clustering centers have great difference and represent different video frame themes. The feature clustering can adopt various clustering algorithms and can also adopt a machine learning algorithm.

In one embodiment, the cluster center features may be learned end-to-end along with other parameters of the model as parameters that the machine learning model can learn. For example, a video classification model may be trained based on training samples, the training samples being video segments of known real classification results, the video classification model including a feature extraction layer, a feature fusion layer, and a feature classification layer. During model training, inputting a video clip into a video classification model, extracting video frame characteristics of each video frame in the video clip through a characteristic extraction layer, fusing each video frame characteristic into a video characteristic through a characteristic fusion layer, outputting a prediction classification result through a characteristic classification layer based on the video characteristics, generating training loss based on a real classification result and the prediction classification result, adjusting model parameters based on the training loss until a convergence condition is met, and indicating that training is finished. The convergence condition may be at least one of a training loss smaller than a preset loss, a number of iterations larger than a preset number, and the like. The feature fusion layer fuses the features of all the video frames into video features based on the clustering center features. The clustering center features are parameters needing to be learned in the model training process, and after model training is finished, the semantic information of the video segments can be accurately represented by the video features obtained by fusing the finally learned clustering center features with the video frame features, so that accurate video classification results can be obtained through the feature classification layer.

Specifically, when feature fusion is performed, video frame features representing local features of the video segments may be aggregated according to the cluster center features, and a fusion video feature representing a global feature of the video segment may be generated according to a distance from each video frame feature to the cluster center feature to which the video frame feature belongs. For example, the distances from the features of each video frame to the feature of the cluster center to which the feature belongs may be averaged and fused to obtain a fused video feature, that is, semantic contributions of each video frame to video segments are the same. The distance between each video frame feature and the cluster center feature to which the video frame feature belongs can be fused in a non-average manner to obtain a fused video feature, that is, semantic contributions of each video frame to video segments are different. For example, the larger the distance from each video frame feature to the cluster center feature to which it belongs, the smaller the weight in performing feature fusion.

Step S604, aiming at the target video frame feature sequence corresponding to the current video clip to be matched, determining target center features from all the clustering center features based on the target video frame features corresponding to the same video frame and the distance between all the clustering center features, and obtaining the target center features respectively corresponding to all the video frames of the current video clip to be matched.

Step S606, based on the distance between the target video frame feature and the target center feature corresponding to the same video frame, the target feature distance corresponding to each video frame of the current video clip to be matched is obtained.

Specifically, for any one to-be-matched video segment, the computer device may obtain each clustering center feature, calculate a distance between a target video frame feature corresponding to any one video frame in the to-be-matched video segment and each clustering center feature, and use the clustering center feature with the smallest distance as the target center feature corresponding to the video frame to obtain the target center feature corresponding to each video frame in the to-be-matched video segment. Namely, the clustering center to which each video frame in the video segment to be matched belongs is determined.

Furthermore, the computer device can calculate the distance between the target video frame feature corresponding to each video frame and the target central feature corresponding to each video frame, so as to obtain the target feature distance corresponding to each video frame in the video segment to be matched. The target feature distance refers to the distance from the video frame feature to the cluster center feature to which the video frame feature belongs.

Step S608, performing attention distribution on the features of each target video frame corresponding to the current video segment to be matched, to obtain attention weights corresponding to each video frame of the current video segment to be matched.

And step S610, fusing the distances of the target features based on the attention weight to obtain fused video features corresponding to the current video clip to be matched.

Wherein, attention allocation means allocating attention weights of different degrees to different target video frame features to distinguish important features from non-important features. The attention weight is used for representing the importance degree and semantic contribution degree of a certain video frame in the video clip to the whole video clip.

Specifically, in order to improve the accuracy of the fused video features, the computer device may perform non-average fusion on the distances of the target features to obtain the fused video features. The computer device can perform attention distribution on each target video frame characteristic corresponding to the video segment to be matched so as to distinguish important video frame characteristics and non-important video frame characteristics in each video frame characteristic and obtain attention weights corresponding to each video frame of the video segment to be matched. The computer device can fuse the distances of the target features based on the attention weight, and perform weighted summation on the distances of the target features based on the attention weight, so as to obtain fused video features corresponding to the video segments to be matched.

In one embodiment, attention may also be assigned as a parameter that the machine learning model can learn, learning end-to-end along with other parameters of the model. For example, the feature fusion layer fuses the features of the video frames into the video features based on the cluster center features and the attention weights. The cluster center features and the parameters for attention distribution are parameters needing to be learned in the model training process, after model training is finished, the target feature distance corresponding to each video frame is determined based on the finally learned cluster center features, the attention weight corresponding to each video frame is determined based on the finally learned attention distribution parameters, and then the fused video features are obtained by fusing the target feature distances based on the attention weights. In the feature fusion layer, attention can be allocated specifically by the full link layer and the Softmax layer.

In the above embodiment, feature fusion is performed on each target video frame feature in the target video frame feature sequence based on the clustering center feature and the attention weight, so that a fused video feature with higher accuracy can be obtained.

In one embodiment, the generation process of the fused video feature comprises the following steps:

inputting a current video clip to be matched into a video feature extraction model; the video feature extraction model comprises a first feature extraction layer, a second feature extraction layer and a feature fusion layer; performing feature extraction on the current video clip to be matched through a first feature extraction layer to obtain an initial video frame feature sequence corresponding to the current video clip to be matched; performing characteristic shift on video frame sub-characteristics corresponding to a target characteristic channel in an initial video frame characteristic sequence through a second characteristic extraction layer based on the sequencing information of the initial video frame characteristics to obtain an intermediate video frame characteristic sequence corresponding to the current video clip to be matched; performing two-dimensional convolution processing on the intermediate video frame feature sequence through a second feature extraction layer to obtain a target video frame feature sequence corresponding to the current video clip to be matched; and performing feature fusion on the target video frame feature sequence through the feature fusion layer to obtain fusion video features corresponding to the current video segment to be matched.

Specifically, the computer device may obtain, by means of the machine learning model, a fused video feature corresponding to the video segment to be matched. The computer device may specifically extract video features of the video segment based on a video feature extraction model, the video feature extraction model including a first feature extraction layer, a second feature extraction layer, and a feature fusion layer. The first feature extraction layer is used for extracting features of each video frame in the video clips of the input model to obtain initial video frame features corresponding to each video frame and form an initial video frame feature sequence. The second feature extraction layer is used for obtaining an initial video frame feature sequence output by the first feature extraction layer, performing feature shift on video frame sub-features corresponding to a target feature channel in the initial video frame feature sequence based on the sequencing information of the initial video frame features to obtain an intermediate video frame feature sequence, and performing two-dimensional convolution processing on the intermediate video frame feature sequence to obtain a target video frame feature sequence. The feature fusion layer is used for acquiring a target video frame feature sequence output by the second feature extraction layer, and performing feature fusion on the target video frame feature sequence to obtain fusion video features.

It is understood that the specific processes of data processing such as feature shifting, feature fusion and the like can refer to the contents of the foregoing respective related embodiments.

In one embodiment, the video feature extraction model may be obtained in a supervised training manner. During model training, a feature classification layer can be added after an output layer of the video feature extraction model to be trained to obtain the video classification model to be trained. And carrying out supervised training on the video classification model based on the training sample carrying the training label to obtain the trained video classification model. And acquiring a network in front of the feature classification layer from the trained video classification model to serve as the trained video feature extraction model. The training samples are video clips, and the training labels represent real classification results of the video clips.

In the above embodiment, the fused video features can be generated quickly by using the video feature extraction model.

In one embodiment, calculating the matching degree of the real-time barrage with each video clip to be matched respectively based on the initial text feature and the fusion video feature comprises:

inputting the real-time barrage into a text processing model to obtain initial text characteristics; inputting the initial text features into a text feature extraction network in a trained video text matching model to obtain target text features corresponding to the real-time bullet screen; inputting the fused video features into a video feature extraction network in a video text matching model to obtain target video features; and obtaining the matching degree of the real-time bullet screen and each video clip to be matched respectively based on the similarity between the target text characteristic and the target video characteristic.

Wherein the text processing model is a machine learning model for processing text. The video text matching model is a machine learning model for determining whether text and video segments match. The video text matching model comprises two branches, wherein one branch is a text feature extraction network and is used for inputting text features of the barrage, and the other branch is a video feature extraction network and is used for inputting video features of the video clips. In one embodiment, the text feature extraction network and the video feature extraction network may be comprised of multiple fully connected layers.

Specifically, the computer device can extract the text features of the bullet screen by means of the machine learning model, and specifically, the real-time bullet screen is input into the text processing model to obtain the initial text features corresponding to the real-time bullet screen. The computer equipment can calculate the matching degree by means of a machine learning model, specifically, the initial text features corresponding to the real-time bullet screen are input into a text feature extraction network in a trained video text matching model, the fusion video features corresponding to the video clip to be matched are input into a video feature extraction network in the video text matching model, and the target text features corresponding to the real-time bullet screen and the target video features corresponding to the video clip to be matched are obtained through data processing of the text feature extraction network and the video feature extraction network. Then, the computer device can calculate the matching degree of the real-time bullet screen and the video clip to be matched based on the similarity between the target text characteristic and the target video characteristic. The computer equipment can directly use the feature similarity between the target text feature and the target video feature as the matching degree, or can input the target text feature and the target video feature into a matching layer of a video text matching model, and the video text matching model outputs the matching degree of the real-time barrage and the video clip to be matched through data processing of the matching layer.

In one embodiment, the text processing model may employ a RoBERTa model. The RoBERTa model is a Language model for Processing various NLP (Natural Language Processing) tasks. The RoBERTa model adjusts some training strategies based on the BERT model, e.g., using larger training batch sizes, using longer training sequences, using dynamic adjustment masks, etc. The dynamic adjustment of the mask refers to the mask processing in the process of loading data, while the BERT model is used for performing the mask processing on the data in advance, and the data which is subjected to the mask processing is directly loaded when the data is loaded. Referring to fig. 7, the training process of the RoBERTa model includes two phases, the first phase being a pre-training phase and the second phase being a fine-tuning phase. In the pre-training stage, model training can be performed through a pre-training task one and a pre-training task two. The pre-training task is to cover a part of words randomly in a sentence and then predict the covered words by using context information, for example, masking the sentence A, covering a part of words randomly, inputting the masked sentence A into a BERT model, and the training is to understand the meaning of the covered words according to the full text and predict the covered words. The pre-training task is a next sentence prediction task, which mainly enables the model to better understand the relation between sentences, for example, a sentence B is predicted according to a sentence A. In the fine tuning stage, parameters of the BERT model are fine tuned according to different learning tasks. For example, the question-answering task may be a learning task one, a question and a text containing an answer are input into a BERT model, the training is performed to find the position of the answer in the text containing the answer, and the starting position and the ending position of the answer are predicted. The learning task may also include a single sentence classification task, a pair of sentence classification tasks, and the like. An accurate RoBERTA model can be finally obtained through training in a pre-training stage and a fine-tuning stage.

During model training, a word vector sequence (Token sequence) corresponding to a training sample is input into a BERT model, an initial feature vector (represented by E) is obtained through data processing of an input layer, a target feature vector (represented by T) is obtained through data processing of a subsequent full-connection layer, and a prediction result is determined based on the target feature vector. Taking a pre-training task one as an example, inputting word vector sequences of a sentence A and a sentence B which are subjected to mask processing into a model, wherein the word vector sequence of the sentence A which is subjected to mask processing is composed of Tok1 to TokN, the word vector sequence of the sentence A which is subjected to mask processing is composed of Tok1 to TokM, Tok is short for Token, and a sentence start mark [ CLS ] is added before the word vector sequence of the sentence A]Adding sentence division mark [ SEP ] between word vector sequences of sentence A and sentence B]. Sentence head mark [ CLS]The corresponding output vector is used as semantic representation of the whole training sample for text classification and sentence segmentation identification [ SEP]A feature vector for distinguishing two sentences. After the word vector sequence is input into the model, the initial characteristic vector is obtained through the data processing of the input layer, and the initial characteristic vector is calculated by the formula E_[CLS]、E₁To E_N、E_[SEP]、E₁' to E_M' composition, E_[CLS]Represents [ CLS]Corresponding initial feature vector, E₁To E_NRepresenting the initial feature vectors, E, corresponding to Tok1 through TokN_[SEP]Represents [ SEP ]]Corresponding initial feature vector, E₁' to E_M' denotes the initial feature vectors corresponding to Tok1 to TokM. The initial characteristic vector is subjected to subsequent data processing of a full connection layer to obtain a target characteristic vector, and the target characteristic vector is C, T₁To T_N、T_[SEP]、T₁' to T_M' composition, C represents [ CLS]Corresponding target feature vector, T₁To T_NRepresenting the target feature vector, T, corresponding to Tok1 through TokN_[SEP]Represents [ SEP ]]Corresponding target feature vector, T₁' to T_M' denotes the target feature vectors corresponding to Tok1 to TokM.

Of course, other language models and other text models can be used as the text processing model.

In the above embodiment, the matching degree between the bullet screen and the video clip is calculated by using the video feature extraction model, so that the accuracy of the matching degree can be improved.

In one embodiment, the training process of the video text matching model comprises the following steps:

acquiring a training sample and a training label corresponding to the training sample; the training sample comprises a training video clip and a training bullet screen; inputting the fusion video features corresponding to the training video clips and the initial text features corresponding to the training barrage into a video text matching model to be trained to obtain a prediction label; the prediction label is obtained by extracting the similarity between the text feature output by the network and the video feature output by the network based on the text feature; and calculating training loss based on the training labels and the prediction labels, and adjusting model parameters of the video text matching model to be trained based on the training loss until a convergence condition is met to obtain the trained video text matching model.

Wherein, the training sample comprises a training video clip and a training barrage. The training labels can be two-class labels, if the training labels corresponding to the training samples are matching labels, the training video segments and the training barrage in the training samples are matched with each other, and if the training labels corresponding to the training samples are unmatched labels, the training video segments and the training barrage in the training samples are unmatched. It is understood that the training labels may also be matching probabilities, matching scores, etc. of the training video clips and the training barrage.

Specifically, the video text matching model can be trained in a supervised training mode. Before model training, the computer equipment can extract the fusion video features of the training video clips in the training samples and extract the initial text features corresponding to the training barrages in the training samples. Furthermore, during model training, the computer device may input the fusion video features corresponding to the training video segments and the initial text features corresponding to the training barrage into a video text matching model to be trained, specifically, input the fusion video features into a video feature extraction network, input the initial text features into the text feature extraction network, and output a prediction tag through data processing inside the model by the video text matching model. The computer equipment can calculate training loss based on the training label and the prediction label, carries out back propagation based on the training loss to adjust model parameters of the model to obtain an updated video text matching model, returns to the step of inputting the fused video features corresponding to the training video clip and the initial text features corresponding to the training barrage into the video text matching model for iterative execution, continues training until the convergence condition is met, and finishes training to obtain the trained video text matching model. The convergence condition may be at least one of the number of model iterations reaching a preset number, the training loss being less than a preset loss, and the like.

In an embodiment, regarding to the application of the model, in order to improve the matching efficiency, the computer device may perform video segmentation on the target video in advance, extract the fused video features corresponding to each video segment obtained by the segmentation, input the fused video features corresponding to each video segment into the trained video text matching model respectively, obtain the target video features corresponding to each video segment, and store each target video feature. Subsequently, if the computer device receives the real-time bullet screen corresponding to the target video, the computer device only needs to use the video text matching model once. And extracting the initial text features corresponding to the real-time bullet screen by the computer equipment, and inputting the initial text features into the trained video text matching model to obtain the target text features corresponding to the real-time bullet screen. Then, the computer device may obtain, from the data storage system, a video segment corresponding to a target video feature most similar to the target text feature as a target video segment corresponding to the real-time bullet screen. For example, the target video feature most similar to the target text feature is found by calculating the distance between the features, the cosine similarity, and the like.

Of course, after the real-time bullet screen corresponding to the target video is obtained, the computer device may also input the initial text features corresponding to the real-time bullet screen and the fusion video features corresponding to the video segment to be matched into the trained video text matching model, and the model outputs the matching degree of the real-time bullet screen and the video segment to be matched. And determining a target video clip corresponding to the real-time bullet screen from the multiple video clips to be matched based on the matching degree output by the model.

In the embodiment, the video text matching model can be obtained through quick training in a supervised training mode.

In one embodiment, the method further comprises:

carrying out video medium replacement detection on the target video to obtain a detection result; when the detection result is that the target video is subjected to video medium replacement, updating the video segments of the target video to obtain all updated video segments corresponding to the target video; calculating the matching degree of each bullet screen corresponding to the target video and each updated video segment to obtain a plurality of updated matching degrees; and updating the association relation corresponding to each bullet screen based on each updating matching degree.

The video medium replacement detection is used for detecting whether medium replacement occurs in the video. The video medium replacement means that the content of the video changes, for example, the video is subjected to medium replacement due to insertion of an advertisement, the video is subjected to medium replacement due to deletion of a part of a shot, and the like.

Specifically, the replacement of the video generation medium may cause the time length of the new video to be inconsistent with the original video, and further cause the bullet screen of the original video to be mismatched with the new video. Because, in order to ensure that the barrages and the video segments are always matched, the computer device can perform video medium replacement detection on the target video, and if the detection result of the video medium replacement detection indicates that the target video is subjected to video medium replacement, the computer device can match all the barrages of the target video with the video segments again to update the target video segments corresponding to the respective barrages. If the target video is subjected to medium replacement, the computer equipment can update the video segments of the target video and perform video segmentation on the target video again to obtain each updated video segment corresponding to the target video. The computer equipment can recalculate the matching degree of each bullet screen of the target video and each updated video segment to obtain a plurality of updated matching degrees, and updates the association relation corresponding to each bullet screen based on each updated matching degree. It can be understood that each bullet screen corresponding to the target video comprises a historical bullet screen and a current to future real-time bullet screen. The historical bullet screen refers to a real-time bullet screen received by the computer equipment in historical time, the computer equipment can update a target video segment corresponding to the historical bullet screen based on the updating matching degree, the association relation corresponding to the historical bullet screen is updated, the association relation between the historical bullet screen and the newly determined target video segment is established, and the bullet screen time of the historical bullet screen is recalibrated. For the current to future real-time barrage, the computer device may determine, from the respective updated video segments, target video segments corresponding to the real-time barrages, and further establish an association relationship between the two.

It is understood that the manner of calculating the matching degree between the bullet screen and the video clip can refer to the content of the foregoing related embodiments.

In one embodiment, the determination of whether media replacement has occurred in a video may be made by detecting whether the VID (video identification) of the video has changed. And if the VID of one video changes, triggering to match the bullet screen with the video clip again.

In the above embodiment, when the target video is replaced by the medium, the video segmentation is performed again, and the bullet screen and the video segment are matched again, so that the bullet screen of the bullet screen is always accurate, and the bullet screen is always played synchronously with the matched video segment.

In an embodiment, as shown in fig. 8, a video bullet screen matching method is provided, which is described by taking the method as an example of being applied to the play terminal in fig. 1, and includes the following steps:

step S802, a real-time bullet screen corresponding to the target video is obtained.

In one embodiment, the playback terminal has an application installed thereon, such as a video application, an instant messaging application, and the like. Referring to fig. 9A, a user can view a video in an audio/video application, and each bullet screen in the video is displayed as a sliding subtitle. The user may also view the video publication barrage in an instant messaging application. Referring to fig. 9B, a user can record content of an instant messaging application program and browse short videos published by others on an authoring platform, the platform supports interaction of approval and comments, and can also forward the interaction to a circle of friends, a chat scene and sharing with friends, the platform has a "floating comment" function, and if the user starts the "floating comment" function, the comments of the user on the short videos can be played in a manner similar to a bullet screen in a video picture content in a rolling manner.

Step S804, sending the real-time barrage to a server so that the server extracts initial text features corresponding to the real-time barrage, determines a plurality of video segments to be matched corresponding to the real-time barrage from a target video, acquires fusion video features corresponding to the video segments to be matched, calculates the matching degree of the real-time barrage and each video segment to be matched respectively based on the initial text features and the fusion video features, determines the target video segments from the video segments to be matched based on the matching degree, and establishes the incidence relation between the real-time barrage and the target video segments; the fusion video features are obtained by performing feature fusion on a target video frame feature sequence corresponding to a video segment to be matched, the target video frame feature sequence is obtained by performing feature extraction on each video frame in the video segment to be matched, and the target video frame feature sequence comprises target video frame features corresponding to each video frame in the same video segment to be matched.

Specifically, after acquiring a real-time bullet screen published by a watching user of a target video, a playing terminal sends the real-time bullet screen to a server, so that the server calibrates the bullet screen at any moment, determines a target video segment corresponding to the real-time bullet screen, and enables the bullet screen to appear in a video segment matched with the content of the bullet screen.

It is understood that the data processing procedure of the server may refer to the contents of the foregoing embodiments, and is not described herein again.

Step S806, acquiring the association relationship returned by the server, and synchronously playing the real-time barrage when the target video clip is played based on the association relationship.

Specifically, after determining the target video segment corresponding to the real-time bullet screen, the server may establish an association relationship between the real-time bullet screen and the corresponding target video segment, and send the association relationship to the playing terminal, so that the playing terminal may play the real-time bullet screen synchronously while playing the target video segment based on the association relationship. For example, the playing terminal may determine the starting playing time of the target video segment, and synchronously play the corresponding real-time barrage when the starting video frame of the target video segment starts playing.

In the video bullet screen matching method, a playing terminal sends a real-time bullet screen corresponding to a target video to a server, the server extracts initial text features corresponding to the real-time bullet screen, determines a plurality of video segments to be matched corresponding to the real-time bullet screen from the target video, and acquires an initial video frame feature sequence corresponding to each video segment to be matched; the method comprises the steps that an initial video frame feature sequence comprises initial video features corresponding to all video frames in a same video clip to be matched, feature extraction is carried out on each initial video frame feature sequence to obtain a target video frame feature sequence corresponding to each video clip to be matched, feature fusion is carried out on each target video frame feature sequence to obtain fusion video features corresponding to each video clip to be matched, the matching degree of a real-time bullet screen and each video clip to be matched is calculated based on the initial text features and the fusion video features, the target video clip is determined from each video clip to be matched based on the matching degree, the association relation of the real-time bullet screen and the target video clip is established, a server sends the association relation to a playing terminal, and the playing terminal synchronously plays the real-time bullet screen when playing the target video clip based on the association relation. Therefore, each time the playing terminal acquires the latest published real-time barrage when the user watches the target video, the playing terminal sends the latest published real-time barrage to the server, the server can determine the target video segment matched with the content of the real-time barrage through the matching degree calculated based on the text features of the real-time barrage and the video features of the video segments, and then the playing terminal can synchronously play the real-time barrage matched with the content of the real-time barrage when playing the target video, so that the matching accuracy of the barrage and the video segments is improved, and the barrage and the video segments of the target video are always accurately matched and played.

In a specific embodiment, referring to fig. 10, the video bullet screen matching method of the present application includes the following steps:

first, data preparation

1. Video segment segmentation

For any one video, the server divides the video into a plurality of video segments. For example, the server may perform video black frame detection and scene change detection on the video, determine a plurality of split video frames from a complete video according to the detection result, and split the video into a plurality of video segments according to the split video frames.

2. Video feature extraction

Aiming at any video clip, the server extracts the characteristics of the video frames in the video clip to obtain the initial video frame characteristics corresponding to each video frame, and the initial video frame characteristics are sequenced according to time to obtain an initial video frame characteristic sequence, wherein the initial video frame characteristics comprise video frame sub-characteristics corresponding to a plurality of characteristic channels respectively. The server shifts part of the characteristic channels along the time dimension, exchanges information of adjacent frames, and then performs two-dimensional convolution to improve the expression capacity of the characteristics. And obtaining a target video frame feature sequence consisting of target video frame features corresponding to all the video frames through two-dimensional convolution.

Further, the server can perform feature fusion on the target video frame feature sequence, fuse the video frame level data into video level data, and obtain the fusion video features corresponding to the video segments.

3. Text feature extraction

And the server extracts the characteristics of the bullet screen to obtain the initial text characteristics corresponding to the bullet screen.

Second, training phase

The server trains a video text matching model for predicting the matching degree of the video clips and the barrage. The video text matching model comprises two branches, wherein one branch inputs fusion video characteristics of a video clip, the other branch inputs initial text characteristics of a bullet screen, and each branch respectively carries out further characteristic extraction through a plurality of full-connection layers. During model training, inputting training video segments in training samples and the characteristics of training barrages into a video text matching model, respectively outputting target text characteristics and target video characteristics by two branches in the model, calculating cosine similarity between the target text characteristics and the target video characteristics, predicting the matching degree of the barrages and the video segments through a full connection layer and a Sigmoid activation function to obtain prediction labels, calculating training loss based on the training labels and the prediction labels corresponding to the training samples, and adjusting model parameters of the video text matching model based on the training loss until a convergence condition is met to obtain the trained video text matching model. It is to be understood that if the training labels are binary labels (match or not), the training loss is a binary loss.

Third, off-line stage

Once the video is uploaded to the server, the video medium replacement does not occur frequently, so that the segmentation of the video segments and the extraction of the video features can be automatically triggered after the video is uploaded, and the fused video features corresponding to the video segments are input into the trained video text matching model to obtain the target video features. The server may then store the target videos corresponding to the respective video segments in a database, for example, a hard disk or an in-memory database.

Fourth, on-line stage

And the playing terminal sends the real-time bullet screen to the server every time the playing terminal acquires the real-time bullet screen published by the user when the user watches the target video. And after the server acquires the real-time bullet screen corresponding to the target video, automatically triggering text feature extraction, and inputting the initial text features corresponding to the real-time bullet screen into the trained video text matching model to obtain the target text features. The server carries out nearest neighbor retrieval in a video feature library corresponding to a target video based on target text features, a video segment corresponding to the retrieved target video features is taken as a video segment which is most matched with the real-time barrage, the real-time barrage is considered to belong to the video segment, and therefore the association relationship between the real-time barrage and the video segment is established, and the association relationship is used for synchronously playing the real-time barrage when the target video segment is played.

If the video generates medium replacement, the server needs to perform processing such as video segment segmentation and video feature extraction on the video again, perform matching prediction on all bullet screens in the video again, and perform bullet screen time calibration again.

It will be appreciated that the prediction phase in FIG. 10 includes the offline phase and the online phase described above.

By the video bullet screen matching method, bullet screens with inconsistent pictures and texts in the video platform can be calibrated all the time, so that the bullet screens appear in the corresponding video clips, and the matching accuracy of the bullet screens and the video clips is improved.

It should be noted that, the user information (including but not limited to user equipment information, user personal information, etc.) and data (including but not limited to data for analysis, stored data, displayed data, etc.) referred to in the present application are information and data authorized by the user or sufficiently authorized by each party, and the collection, use and processing of the related data need to comply with the relevant laws and regulations and standards of the relevant country and region.

It should be understood that, although the steps in the flowcharts related to the embodiments as described above are sequentially displayed as indicated by arrows, the steps are not necessarily performed sequentially as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least a part of the steps in the flowcharts related to the embodiments described above may include multiple steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, and the execution order of the steps or stages is not necessarily sequential, but may be rotated or alternated with other steps or at least a part of the steps or stages in other steps.

Based on the same inventive concept, the embodiment of the application also provides a video bullet screen matching device for realizing the video bullet screen matching method. The implementation scheme for solving the problem provided by the device is similar to the implementation scheme recorded in the method, so that specific limitations in one or more embodiments of the video bullet screen matching device provided below can be referred to the limitations of the video bullet screen matching method in the foregoing, and details are not repeated herein.

In one embodiment, as shown in fig. 11, there is provided a video bullet screen matching device 1100, which specifically includes: a bullet screen processing module 1102, a video feature obtaining module 1104, a matching degree calculating module 1106, a target video segment determining module 1108 and an association relationship establishing module 1110, wherein:

and the bullet screen processing module 1102 is configured to acquire a real-time bullet screen corresponding to the target video and extract an initial text feature corresponding to the real-time bullet screen.

The video feature acquisition module 1104 is configured to determine a plurality of to-be-matched video segments corresponding to the real-time barrage from the target video, and acquire a fusion video feature corresponding to each to-be-matched video segment; the fusion video features are obtained by performing feature fusion on a target video frame feature sequence corresponding to a video segment to be matched, the target video frame feature sequence is obtained by performing feature extraction on each video frame in the video segment to be matched, and the target video frame feature sequence comprises target video frame features corresponding to each video frame in the same video segment to be matched.

And a matching degree calculation module 1106, configured to calculate, based on the initial text feature and the fusion video feature, matching degrees of the real-time bullet screen and each to-be-matched video segment.

And a target video segment determining module 1108, configured to determine a target video segment from the video segments to be matched based on the matching degree.

The association relationship establishing module 1110 is configured to establish an association relationship between the real-time barrage and the target video segment, where the association relationship is used to play the real-time barrage synchronously when the target video segment is played.

Above-mentioned video barrage matching device, every real-time barrage that the user published when watching the target video is obtained, just calculate the target video segment that obtains with real-time barrage content matching through the text characteristic based on real-time barrage and the video characteristic of video segment, and then when playing the target video segment, the real-time barrage of synchronous broadcast, improve the matching accuracy of barrage and video segment, guarantee that the barrage and the video segment of target video are accurate matching broadcast all the time.

In one embodiment, the video feature acquisition module comprises:

the device comprises a to-be-matched video clip determining unit, a to-be-matched video clip determining unit and a matching unit, wherein the to-be-matched video clip determining unit is used for determining split video frames from all target video frames based on pixel information corresponding to all target video frames of a target video; performing video segmentation on a target video based on a segmented video frame to obtain a plurality of initial video segments; and determining a plurality of video clips to be matched corresponding to the real-time bullet screens from each initial video clip.

In one embodiment, the to-be-matched video segment determining unit is further configured to obtain a first pixel value of each pixel point in each target video frame in a first color space; counting first pixel values corresponding to the same target video frame to obtain pixel information corresponding to each target video frame; and taking the target video frame with the pixel information smaller than the first threshold value as a segmentation video frame.

In one embodiment, the to-be-matched video segment determining unit is further configured to obtain a second pixel value of each pixel point in the same target video frame in a second color space, so as to obtain pixel information corresponding to each target video frame; calculating pixel change information between the adjacent target video frames based on second pixel values corresponding to the matched pixel points in the adjacent target video frames; and determining the split video frame from the adjacent target video frames with the pixel change information larger than the second threshold value.

In one embodiment, the video feature acquisition module comprises:

the fusion video feature acquisition unit is used for respectively extracting features of each video clip to be matched to obtain an initial video frame feature sequence corresponding to each video clip to be matched; the initial video frame feature sequence is obtained by sequencing initial video frame features respectively corresponding to each video frame in the same video clip to be matched according to the video frame time stamp, wherein the initial video frame features comprise video frame sub-features respectively corresponding to a plurality of feature channels; in an initial video frame feature sequence corresponding to the same video clip to be matched, performing feature shift on video frame sub-features corresponding to a target feature channel based on the sequencing information of the initial video frame features to obtain an intermediate video frame feature sequence corresponding to each video clip to be matched; respectively carrying out two-dimensional convolution processing on each intermediate video frame characteristic sequence to obtain a target video frame characteristic sequence corresponding to each current video clip to be matched; and respectively carrying out feature fusion on the feature sequences of the target video frames to obtain fusion video features corresponding to the current video segments to be matched.

In one embodiment, the fused video feature obtaining unit is further configured to take, in an initial video frame feature sequence corresponding to the current video segment to be matched, a video frame sub-feature corresponding to a target feature channel in each initial video frame feature as a target sub-feature; aiming at the initial video frame characteristics, updating the target sub-characteristics corresponding to the adjacent video frames based on the target sub-characteristics corresponding to the current video frame to obtain the intermediate video frame characteristics corresponding to each video frame of the current video clip to be matched; and sequencing the characteristics of the intermediate video frames according to the video frame time stamps to obtain an intermediate video frame characteristic sequence corresponding to the current video clip to be matched.

In an embodiment, the fused video feature obtaining unit is further configured to update a target sub-feature corresponding to a next video frame based on a target sub-feature corresponding to a current video frame, and configure the target sub-feature corresponding to a starting video frame as a preset sub-feature.

In one embodiment, the target sub-features include a first sub-feature corresponding to a first feature channel in the target feature channel and a second sub-feature corresponding to other feature channels in the target feature channel. The fused video feature obtaining unit is further used for updating a first sub-feature corresponding to a next video frame based on a first sub-feature corresponding to a current video frame, and configuring the first sub-feature corresponding to the starting video frame as a preset sub-feature; and updating a second sub-feature corresponding to the last video frame based on the second sub-feature corresponding to the current video frame, and configuring a second sub-feature corresponding to the ending video frame as a preset sub-feature.

In one embodiment, the video bullet screen matching device 1100 includes:

the characteristic fusion module is used for acquiring a plurality of clustering center characteristics; each clustering center feature corresponds to different video frame topics; determining a target center feature from each clustering center feature based on the target video frame feature corresponding to the same video frame and the distance between each clustering center feature aiming at a target video frame feature sequence corresponding to a current video clip to be matched to obtain the target center feature corresponding to each video frame of the current video clip to be matched; obtaining target characteristic distances corresponding to all video frames of the current video clip to be matched based on the distance between the target video frame characteristic and the target center characteristic corresponding to the same video frame; performing attention distribution on the characteristics of each target video frame corresponding to the current video clip to be matched to obtain attention weights corresponding to each video frame of the current video clip to be matched; and fusing the distances of the target features based on the attention weight to obtain fused video features corresponding to the current video clip to be matched.

In one embodiment, the video bullet screen matching device 1100 includes:

the characteristic processing module is used for inputting the current video clip to be matched into the video characteristic extraction model; the video feature extraction model comprises a first feature extraction layer, a second feature extraction layer and a feature fusion layer; performing feature extraction on the current video clip to be matched through a first feature extraction layer to obtain an initial video frame feature sequence corresponding to the current video clip to be matched; performing characteristic shift on video frame sub-characteristics corresponding to a target characteristic channel in an initial video frame characteristic sequence through a second characteristic extraction layer based on the sequencing information of the initial video frame characteristics to obtain an intermediate video frame characteristic sequence corresponding to the current video clip to be matched; performing two-dimensional convolution processing on the intermediate video frame feature sequence through a second feature extraction layer to obtain a target video frame feature sequence corresponding to the current video clip to be matched; and performing feature fusion on the target video frame feature sequence through the feature fusion layer to obtain fusion video features corresponding to the current video segment to be matched.

In one embodiment, the matching degree calculation module is further configured to input the real-time bullet screen into the text processing model to obtain an initial text feature; inputting the initial text features into a text feature extraction network in a trained video text matching model to obtain target text features corresponding to the real-time bullet screen; inputting the fused video features into a video feature extraction network in a video text matching model to obtain target video features; and obtaining the matching degree of the real-time bullet screen and each video clip to be matched respectively based on the similarity between the target text characteristic and the target video characteristic.

In one embodiment, the video bullet screen matching device 1100 includes:

the model training module is used for acquiring training samples and training labels corresponding to the training samples; the training sample comprises a training video clip and a training bullet screen; inputting the fusion video features corresponding to the training video clips and the initial text features corresponding to the training barrage into a video text matching model to be trained to obtain a prediction label; the prediction label is obtained by extracting the similarity between the text feature output by the network and the video feature output by the network based on the text feature; and calculating training loss based on the training labels and the prediction labels, and adjusting model parameters of the video text matching model to be trained based on the training loss until a convergence condition is met to obtain the trained video text matching model.

In an embodiment, the target video segment determining module is further configured to obtain, from the matching degrees, a video segment to be matched corresponding to the maximum matching degree as the target video segment.

In one embodiment, the video bullet screen matching device 1100 includes:

the video medium replacement detection module is used for carrying out video medium replacement detection on the target video to obtain a detection result; when the detection result is that the target video is subjected to video medium replacement, updating the video segments of the target video to obtain all updated video segments corresponding to the target video; calculating the matching degree of each bullet screen corresponding to the target video and each updated video segment to obtain a plurality of updated matching degrees; and updating the association relation corresponding to each bullet screen based on each updating matching degree.

In one embodiment, as shown in fig. 12, there is provided a video bullet screen matching apparatus 1200, which specifically includes: bullet screen acquisition module 1202, data matching module 1204 and bullet screen play module 1206, wherein:

and the bullet screen obtaining module 1202 is configured to obtain a real-time bullet screen corresponding to the target video.

The data matching module 1204 is configured to send the real-time bullet screen to the server, so that the server extracts initial text features corresponding to the real-time bullet screen, determines a plurality of to-be-matched video segments corresponding to the real-time bullet screen from the target video, obtains fusion video features corresponding to the to-be-matched video segments, calculates matching degrees of the real-time bullet screen and the to-be-matched video segments respectively based on the initial text features and the fusion video features, determines the target video segments from the to-be-matched video segments based on the matching degrees, and establishes an association relationship between the real-time bullet screen and the target video segments; the fusion video features are obtained by performing feature fusion on a target video frame feature sequence corresponding to a video segment to be matched, the target video frame feature sequence is obtained by performing feature extraction on each video frame in the video segment to be matched, and the target video frame feature sequence comprises target video frame features corresponding to each video frame in the same video segment to be matched.

And the bullet screen playing module 1206 is configured to obtain the association relationship returned by the server, and synchronously play the real-time bullet screen when the target video segment is played based on the association relationship.

Above-mentioned video barrage matching device, every time the play terminal obtains the real-time barrage that the user published when watching the target video, just send to the server, the server can confirm the target video segment that matches with real-time barrage content through the matching degree that calculates based on the text characteristic of real-time barrage and the video characteristic of video segment, and then the play terminal is when playing the target video, can be when playing the target video segment, the real-time barrage of synchronous broadcast and content matching, improve the matching accuracy of barrage and video segment, guarantee that the barrage and the video segment of target video are the accurate matching broadcast all the time.

All or part of the modules in the video bullet screen matching device can be realized by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.

In one embodiment, a computer device is provided, which may be a server, and its internal structure diagram may be as shown in fig. 13. The computer device includes a processor, a memory, and a network interface connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The database of the computer device is used for storing data such as target videos, fusion video characteristics, target video characteristics and the like. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a video bullet screen matching method.

In one embodiment, a computer device is provided, which may be a terminal, and its internal structure diagram may be as shown in fig. 14. The computer device includes a processor, a memory, a communication interface, a display screen, and an input device connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The communication interface of the computer device is used for carrying out wired or wireless communication with an external terminal, and the wireless communication can be realized through WIFI, a mobile cellular network, NFC (near field communication) or other technologies. The computer program is executed by a processor to implement a video bullet screen matching method. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, a key, a track ball or a touch pad arranged on the shell of the computer equipment, an external keyboard, a touch pad or a mouse and the like.

Those skilled in the art will appreciate that the configurations shown in fig. 13 and 14 are block diagrams of only some of the configurations relevant to the present disclosure, and do not constitute limitations on the computing devices to which the present disclosure may be applied, and that a particular computing device may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

In one embodiment, a computer device is further provided, which includes a memory and a processor, the memory stores a computer program, and the processor implements the steps of the above method embodiments when executing the computer program.

In an embodiment, a computer-readable storage medium is provided, in which a computer program is stored which, when being executed by a processor, carries out the steps of the above-mentioned method embodiments.

In one embodiment, a computer program product or computer program is provided that includes computer instructions stored in a computer-readable storage medium. The computer instructions are read by a processor of a computer device from a computer-readable storage medium, and the computer instructions are executed by the processor to cause the computer device to perform the steps in the above-mentioned method embodiments.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, database, or other medium used in the embodiments provided herein may include at least one of non-volatile and volatile memory. The nonvolatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical Memory, high-density embedded nonvolatile Memory, resistive Random Access Memory (ReRAM), Magnetic Random Access Memory (MRAM), Ferroelectric Random Access Memory (FRAM), Phase Change Memory (PCM), graphene Memory, and the like. Volatile Memory can include Random Access Memory (RAM), external cache Memory, and the like. By way of illustration and not limitation, RAM can take many forms, such as Static Random Access Memory (SRAM) or Dynamic Random Access Memory (DRAM), among others. The databases referred to in various embodiments provided herein may include at least one of relational and non-relational databases. The non-relational database may include, but is not limited to, a block chain based distributed database, and the like. The processors referred to in the embodiments provided herein may be general purpose processors, central processing units, graphics processors, digital signal processors, programmable logic devices, quantum computing based data processing logic devices, etc., without limitation.

The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the present application. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present application shall be subject to the appended claims.

Claims

1. A video bullet screen matching method is characterized by comprising the following steps:

2. The method according to claim 1, wherein the determining a plurality of video segments to be matched corresponding to the real-time bullet screen from the target video comprises:

determining a segmentation video frame from each target video frame based on pixel information corresponding to each target video frame of the target video;

performing video segmentation on the target video based on the segmented video frame to obtain a plurality of initial video segments;

and determining a plurality of video clips to be matched corresponding to the real-time bullet screen from each initial video clip.

3. The method according to claim 2, wherein the determining a sliced video frame from each target video frame of the target video based on the pixel information corresponding to each target video frame comprises:

acquiring a first pixel value of each pixel point in each target video frame in a first color space;

counting first pixel values corresponding to the same target video frame to obtain pixel information corresponding to each target video frame;

and taking the target video frame with the pixel information smaller than the first threshold value as the segmentation video frame.

4. The method according to claim 2, wherein the determining a sliced video frame from each target video frame of the target video based on the pixel information corresponding to each target video frame comprises:

acquiring a second pixel value of each pixel point in the same target video frame in a second color space to obtain pixel information corresponding to each target video frame;

calculating pixel change information between the adjacent target video frames based on second pixel values corresponding to the matched pixel points in the adjacent target video frames;

determining the sliced video frame from the adjacent target video frames with the pixel change information larger than a second threshold value.

5. The method according to claim 1, wherein the obtaining of the fused video features corresponding to the video segments to be matched comprises:

respectively extracting the characteristics of each video clip to be matched to obtain an initial video frame characteristic sequence corresponding to each video clip to be matched; the initial video frame feature sequence is obtained by sequencing initial video frame features respectively corresponding to each video frame in the same video clip to be matched according to video frame time stamps, wherein the initial video frame features comprise video frame sub-features respectively corresponding to a plurality of feature channels;

in an initial video frame feature sequence corresponding to the same video clip to be matched, performing feature shift on video frame sub-features corresponding to a target feature channel based on the sequencing information of the initial video frame features to obtain an intermediate video frame feature sequence corresponding to each video clip to be matched;

respectively carrying out two-dimensional convolution processing on each intermediate video frame feature sequence to obtain a target video frame feature sequence corresponding to each current video clip to be matched;

and respectively carrying out feature fusion on the feature sequences of the target video frames to obtain fusion video features corresponding to the current video segments to be matched.

6. The method according to claim 5, wherein the performing feature shift on the video frame sub-features corresponding to the target feature channel based on the ordering information of the initial video frame features in the initial video frame feature sequence corresponding to the same video segment to be matched to obtain the intermediate video frame feature sequence corresponding to each video segment to be matched comprises:

in an initial video frame feature sequence corresponding to a current video clip to be matched, taking a video frame sub-feature corresponding to a target feature channel in each initial video frame feature as a target sub-feature;

aiming at the initial video frame characteristics, updating the target sub-characteristics corresponding to the adjacent video frames based on the target sub-characteristics corresponding to the current video frame to obtain the intermediate video frame characteristics corresponding to each video frame of the current video clip to be matched;

and sequencing the characteristics of the intermediate video frames according to the video frame time stamps to obtain an intermediate video frame characteristic sequence corresponding to the current video clip to be matched.

7. The method of claim 6, wherein the updating the target sub-feature corresponding to the neighboring video frame based on the target sub-feature corresponding to the current video frame comprises:

8. The method of claim 6, wherein the target sub-features comprise first sub-features corresponding to a first one of the target feature channels and second sub-features corresponding to other ones of the target feature channels;

the updating of the target sub-feature corresponding to the adjacent video frame based on the target sub-feature corresponding to the current video frame includes:

updating a first sub-feature corresponding to a next video frame based on the first sub-feature corresponding to the current video frame, and configuring the first sub-feature corresponding to the starting video frame as a preset sub-feature;

and updating a second sub-feature corresponding to the last video frame based on the second sub-feature corresponding to the current video frame, and configuring a second sub-feature corresponding to the ending video frame as a preset sub-feature.

9. The method of claim 1, wherein the feature fusion comprises the steps of:

acquiring a plurality of clustering center features; each clustering center feature corresponds to different video frame topics;

determining a target center feature from each clustering center feature based on the distance between the target video frame feature corresponding to the same video frame and each clustering center feature for a target video frame feature sequence corresponding to a current video clip to be matched to obtain the target center feature corresponding to each video frame of the current video clip to be matched;

obtaining target characteristic distances corresponding to all video frames of the current video clip to be matched based on the distance between the target video frame characteristic and the target center characteristic corresponding to the same video frame;

performing attention distribution on the characteristics of each target video frame corresponding to the current video clip to be matched to obtain attention weights corresponding to each video frame of the current video clip to be matched;

and fusing the distances of all the target features based on the attention weight to obtain fused video features corresponding to the current video clip to be matched.

10. The method according to any one of claims 1 to 9, wherein the generation process of the fused video feature comprises the following steps:

inputting a current video clip to be matched into a video feature extraction model; the video feature extraction model comprises a first feature extraction layer, a second feature extraction layer and a feature fusion layer;

performing feature extraction on the current video clip to be matched through the first feature extraction layer to obtain an initial video frame feature sequence corresponding to the current video clip to be matched;

performing feature shift on video frame sub-features corresponding to a target feature channel in an initial video frame feature sequence based on the sequencing information of the initial video frame features through the second feature extraction layer to obtain an intermediate video frame feature sequence corresponding to the current video clip to be matched;

performing two-dimensional convolution processing on the intermediate video frame feature sequence through the second feature extraction layer to obtain a target video frame feature sequence corresponding to the current video clip to be matched;

and performing feature fusion on the feature sequence of the target video frame through the feature fusion layer to obtain fusion video features corresponding to the current video clip to be matched.

11. The method according to any one of claims 1 to 9, wherein the calculating the matching degree of the real-time barrage with each video segment to be matched respectively based on the initial text feature and the fused video feature comprises:

inputting the real-time barrage into a text processing model to obtain the initial text characteristics;

inputting the initial text features into a text feature extraction network in a trained video text matching model to obtain target text features corresponding to the real-time bullet screen;

inputting the fusion video features into a video feature extraction network in the video text matching model to obtain target video features;

and obtaining the matching degree of the real-time bullet screen and each video clip to be matched respectively based on the similarity between the target text characteristic and the target video characteristic.

12. The method of claim 11, wherein the training process of the video text matching model comprises the following steps:

acquiring a training sample and a training label corresponding to the training sample; the training sample comprises a training video clip and a training bullet screen;

inputting the fusion video features corresponding to the training video clips and the initial text features corresponding to the training barrage into a video text matching model to be trained to obtain a prediction label; the prediction label is obtained based on the similarity between the text feature output by the text feature extraction network and the video feature output by the video feature extraction network;

and calculating training loss based on the training labels and the prediction labels, and adjusting model parameters of the video text matching model to be trained based on the training loss until a convergence condition is met to obtain the trained video text matching model.

13. The method according to any one of claims 1 to 9, wherein the determining a target video segment from the video segments to be matched based on the matching degree comprises:

and acquiring the video clip to be matched corresponding to the maximum matching degree from all the matching degrees as a target video clip.

14. The method according to any one of claims 1 to 9, further comprising:

carrying out video medium replacement detection on the target video to obtain a detection result;

when the detection result indicates that the target video is subjected to video medium replacement, updating the video segments of the target video to obtain all updated video segments corresponding to the target video;

calculating the matching degree between each bullet screen corresponding to the target video and each updated video segment to obtain a plurality of updated matching degrees;

and updating the association relation corresponding to each bullet screen based on each updating matching degree.

15. A video bullet screen matching method is characterized by comprising the following steps:

acquiring a real-time bullet screen corresponding to a target video;

16. A video bullet screen matching device, characterized in that said device comprises:

17. A video bullet screen matching device, characterized in that said device comprises:

18. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor, when executing the computer program, implements the steps of the method of any of claims 1 to 14 or 15.

19. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 14 or 15.

20. A computer program product comprising a computer program, characterized in that the computer program realizes the steps of the method of any one of claims 1 to 14 or 15 when executed by a processor.