WO2023115838A1

WO2023115838A1 - Video text tracking method and electronic device

Info

Publication number: WO2023115838A1
Application number: PCT/CN2022/097877
Authority: WO
Inventors: 李壮
Original assignee: 北京达佳互联信息技术有限公司
Priority date: 2021-12-24
Filing date: 2022-06-09
Publication date: 2023-06-29
Also published as: CN114463376A; CN114463376B

Abstract

The present invention relates to a video text tracking method and an electronic device. The method comprises: extracting a plurality of video frames from a video (S11); acquiring a text sequence feature of text in each video frame, a position feature of a text region where the text is located, and an image feature corresponding to the text region (S13); respectively determining, according to the text sequence feature, the position feature and the image feature of each video frame, a fusion feature descriptor corresponding to each video frame (S15); and updating a stored reference tracking trajectory on the basis of the fusion feature descriptor corresponding to each video frame, so as to obtain a tracking trajectory of the text in the video, the tracking trajectory being used for representing position information of the text in the video (S17). The present invention can improve the accuracy of video text tracking trajectory determination.

Description

Video text tracking method and electronic device

This disclosure is based on a Chinese patent application with an application date of December 24, 2021 and application number 202111601711.0, and claims the priority of this Chinese patent application, the entire contents of which are incorporated by reference in this disclosure.

technical field

The present disclosure relates to the field of computer technology, in particular to a video text tracking method and electronic equipment.

Background technique

The video contains a lot of text, which can be used as an objective description of the target in the video and a subjective summary of the scene. The text in the video has information such as start and end time, position change, text content, etc. How to correctly track the text in the video is a key step in video understanding.

Contents of the invention

The disclosure provides a video text tracking method and an electronic device, which have a higher accuracy rate of text tracking and reduce the consumption of computing resources in the video text tracking process. The disclosed technical scheme is as follows:

According to an aspect of the embodiments of the present disclosure, a video text tracking method is provided, including:

Extract multiple video frames from the video;

Obtain the character sequence feature of the text in each video frame, the position feature of the text area where the text is located, and the image feature corresponding to the text area;

Determining a fusion feature descriptor corresponding to each video frame according to the character sequence feature, the position feature, and the image feature of each video frame;

Based on the fusion feature descriptor corresponding to each video frame, the stored reference tracking trajectory is updated to obtain the tracking trajectory of the text in the video; the tracking trajectory is used to represent the position information of the text in the video.

In some embodiments, the acquisition of the character sequence features of the text in each video frame, the positional features of the text area where the text is located, and the image features corresponding to the text area include:

For each video frame, determine the text area from the video frame;

Encoding the position information of the text area to obtain the position feature;

extracting the image features from the video frame based on the position information of the text area;

The image feature is decoded to obtain the character sequence feature.

In some embodiments, the determining the text area from the video frame includes:

performing feature extraction on the video frame to obtain a feature map corresponding to the video frame;

Determine the text area heat map corresponding to the feature map; the text area heat map represents the text area and non-text area in the feature map;

Connected domain analysis is performed on the text area heat map to obtain the text area.

In some embodiments, the extracting the image feature from the video frame based on the position information of the text region includes:

mapping the position information of the text area to the feature map to obtain a mapping area corresponding to the text area;

Extracting images located in the mapping region from the feature map to obtain the image features.

In some embodiments, the fusion feature descriptor corresponding to each video frame is obtained according to the character sequence feature, the position feature, and the image feature of each video frame, including:

For each video frame, splicing the character sequence feature, the position feature and the image feature of the video frame to obtain the fusion feature corresponding to the video frame;

Adjust the size of the fusion feature to obtain the fusion feature descriptor.

In some embodiments, the reference tracking trajectory includes a reference text area, the plurality of video frames are continuous on the time axis, and the stored reference tracking trajectory is updated based on the fusion feature descriptor corresponding to each video frame , to obtain the tracking track of the text in the video, including:

Traverse each video frame in sequence, and when traversing each video frame, perform the following operations: the current fusion feature descriptor corresponding to the current video frame traversed, the reference corresponding to the reference text area The fusion feature descriptor performs similarity matching to obtain the similarity matching result corresponding to the current video frame; based on the similarity matching result, the current fusion feature descriptor and the current text area corresponding to the current video frame, update The reference tracking track obtains the target tracking track corresponding to the current video frame; and the target tracking track is determined as the updated reference tracking track; the sequence characterizes each video frame at the time the order on the axis;

The target tracking track corresponding to the last sorted video frame is determined as the tracking track of the text in the video.

In some embodiments, based on the similarity matching result, the current fusion feature descriptor and the current text region corresponding to the current video frame, the reference tracking track is updated to obtain the current text region corresponding to the current video frame. Target tracking trajectory, including:

When the similarity matching result is less than a preset similarity threshold, update the reference text region based on the current text region, and update the reference fusion feature descriptor based on the current fusion feature descriptor, to obtain the The target tracking trajectory.

When the similarity matching result is greater than or equal to a preset similarity threshold, the current text region and the current fusion feature descriptor are added to the reference tracking track to obtain the target tracking track.

The method is executed based on a text tracking network, and the text tracking network includes a detection branch network, a recognition branch network and a multi-information fusion descriptor branch network;

The detection branch network is used to detect text regions in each video frame;

The recognition branch network is used to perform the steps of obtaining the character sequence features of the characters in each video frame, the positional features of the character region where the characters are located, and the corresponding image features of the character region;

The multi-information fusion description sub-branch network is used to perform the said fusion process corresponding to each video frame according to the character sequence feature, the position feature and the image feature of each video frame. Steps for feature descriptors.

In some embodiments, the detection branch network includes a feature extraction subnetwork, a first feature fusion subnetwork and a feature detection subnetwork;

The feature extraction sub-network is used to extract the basic features of the video frame for each video frame to obtain the basic features of the video frame;

The first feature fusion sub-network is used to perform multi-scale feature fusion processing on the extracted basic features to obtain the feature map corresponding to the video frame;

The feature detection sub-network is used to process the feature map of the video frame into a text area heat map, and the text area heat map represents a text area and a non-text area in the feature map; Connected domain analysis is performed on the graph to obtain the text region corresponding to the video frame.

According to another aspect of the embodiments of the present disclosure, a video text tracking device is provided, including:

An extraction module configured to extract a plurality of video frames from the video;

The feature acquisition module is configured to obtain the text sequence feature of the text in each video frame, the position feature of the text area where the text is located, and the corresponding image feature of the text area;

The descriptor acquisition module is configured to obtain the fusion feature descriptor corresponding to each video frame according to the character sequence feature, the position feature and the image feature of each video frame;

The tracking trajectory determination module is configured to update the stored reference tracking trajectory based on the fusion feature descriptor corresponding to each video frame to obtain the tracking trajectory of the text in the video; the tracking trajectory is used to characterize the video The location information of Chinese characters.

In some embodiments, the feature acquisition module includes:

a text area determination unit configured to, for each video frame, determine the text area from the video frame;

A location feature acquisition unit configured to encode the location information of the text area to obtain the location feature;

an image feature extraction unit configured to extract the image feature from the video frame based on the location information of the text region;

The character sequence feature acquisition unit is configured to decode the image feature to obtain the character sequence feature.

In some embodiments, the text area determination unit includes:

The feature map determination subunit is configured to perform feature extraction on the video frame to obtain a feature map corresponding to the video frame;

The text area heat map determination subunit is configured to determine the text area heat map corresponding to the feature map; the text area heat map represents the text area and non-text area in the feature map;

The connected domain analysis subunit is configured to perform connected domain analysis on the text area heat map to obtain the text area.

In some embodiments, the image feature extraction unit includes:

The mapping subunit is configured to map the position information of the text region to the feature map to obtain a mapping region corresponding to the text region;

The image feature extraction subunit is configured to extract the image located in the mapping area from the feature map to obtain the image feature.

In some embodiments, the descriptor acquisition module includes:

The splicing unit is configured to, for each video frame, splice the character sequence feature, the position feature, and the image feature of the video frame to obtain a fusion feature corresponding to the video frame;

The adjustment unit is configured to adjust the size of the fusion feature to obtain the fusion feature descriptor.

In some embodiments, the reference tracking track includes a reference text area, the plurality of video frames are a plurality of video frames continuous on the time axis, and the tracking track determination module includes:

The similarity matching unit is configured to traverse each video frame in sequence, and when traversing each video frame, perform the following operations: combine the current fusion feature descriptor corresponding to the traversed current video frame with The reference fusion feature descriptor corresponding to the reference text area performs similarity matching to obtain a similarity matching result corresponding to the current video frame; based on the similarity matching result, the current fusion feature descriptor and the current video In the current text area corresponding to the frame, update the reference tracking track to obtain the target tracking track corresponding to the current video frame; and determine the target tracking track as the updated reference tracking track; the order represents the the order of each video frame on said time axis;

The tracking track determination unit is configured to determine the target tracking track corresponding to the last video frame as the tracking track of the text in the video.

In some embodiments, the similarity matching unit is configured to update the reference text area based on the current text area when the similarity matching result is smaller than a preset similarity threshold, and based on the The current fusion feature descriptor updates the reference fusion feature descriptor to obtain the target tracking trajectory.

In some embodiments, the similarity matching unit is configured to add the current text region and the current fusion feature descriptor to to the reference tracking trajectory to obtain the target tracking trajectory.

In some embodiments, the text tracking network includes a detection branch network, a recognition branch network and a multi-information fusion descriptor branch network;

According to another aspect of the embodiments of the present disclosure, an electronic device is provided, including:

processor;

A memory for storing instructions executable by the processor; wherein the processor is configured to execute the instructions to implement the following steps:

Extract multiple video frames from the video;

In some embodiments, the processor is configured to execute the instructions to implement the following steps:

For each video frame, determine the text area from the video frame;

The image feature is decoded to obtain the character sequence feature.

For each video frame, the character sequence feature, the position feature and the image feature of the video frame are spliced to obtain the fusion feature corresponding to the video frame;

Adjust the size of the fusion feature to obtain the fusion feature descriptor.

In some embodiments, the reference tracking track includes a reference text area, the plurality of video frames are continuous on the time axis, and the processor is configured to execute the instructions to implement the following steps:

Traverse each video frame in sequence, and when traversing each video frame, perform the following operations: the current fusion feature descriptor corresponding to the current video frame traversed, the reference corresponding to the reference text area The fusion feature descriptor performs similarity matching to obtain the similarity matching result corresponding to the current video frame; based on the similarity matching result, the current fusion feature descriptor and the current text area corresponding to the current video frame, update The reference tracking track obtains the target tracking track corresponding to the current video frame; and the target tracking track is determined as the updated reference tracking track; the sequence characterizes each video frame at the time the order on the axes;

According to another aspect of an embodiment of the present disclosure, a computer-readable storage medium is provided, and when instructions in the computer-readable storage medium are executed by a processor of an electronic device, the electronic device is made to perform the following steps:

Extract multiple video frames from the video;

According to still another aspect of the embodiments of the present disclosure, a computer program product is provided, including a computer program, and when the computer program is executed by a processor, the following steps are implemented:

Extract multiple video frames from the video;

The embodiment of the present disclosure extracts a plurality of video frames from the video, and according to the character sequence feature of the text in each video frame, the position feature of the text area where the text is located, and the image feature corresponding to the text area, the corresponding video frame of each video frame is obtained. The fusion feature descriptor of each video frame is used to update the stored reference tracking trajectory based on the fusion feature descriptor corresponding to each video frame to obtain the tracking trajectory of the text in the video. Since the determination of the tracking trajectory of the text in the video fully considers the text sequence features, position features and image features of the text, the accuracy of text tracking is relatively high; in addition, since the disclosure updates the reference tracking trajectory according to the fusion feature descriptor, that is The tracking track of the text in the video can be obtained, avoiding the continuous processing of each frame of image through multiple models (text detection, text recognition, other models, etc.), and reducing the consumption of computing resources during the video text tracking process.

Description of drawings

Fig. 1 is a schematic diagram of an implementation environment of a video text tracking method according to an exemplary embodiment.

Fig. 2 is a flow chart of a video text tracking method according to an exemplary embodiment.

Fig. 3 is a flow chart of determining character sequence features, location features and image features according to an exemplary embodiment.

Fig. 4 is a flow chart of determining a corresponding text area from each video frame through a text tracking network according to an exemplary embodiment.

Fig. 5 is a schematic diagram of a character tracking network according to an exemplary embodiment.

Fig. 6 is a flow chart of obtaining fusion feature descriptors through a character tracking network according to an exemplary embodiment.

Fig. 7 is a flow chart of updating a reference tracking track to obtain a tracking track of text in a video according to an exemplary embodiment.

Fig. 8 is a block diagram of a video text tracking device according to an exemplary embodiment.

Fig. 9 is a block diagram showing an electronic device for video text tracking according to an exemplary embodiment.

Detailed ways

The acquisition of user information and user account-related information described in the embodiments of the present disclosure, including social relationship identity information, has been obtained from the user. On the premise of obtaining the authorization of the user, the method and device involved in the present disclosure , the device and the storage medium can obtain relevant information of the user.

In the related technology, the video is usually processed into multiple frames of images, and then text detection and text recognition are performed on each frame of images one by one to obtain video-level text tracking results. However, the methods in the related art make the accuracy rate of video text tracking lower. However, the embodiments of the present disclosure fully consider character sequence features, position features, and image features of characters, thereby improving the accuracy of character tracking.

Fig. 1 is a schematic diagram of an implementation environment of a video text tracking method according to an exemplary embodiment. As shown in FIG. 1 , the implementation environment includes at least a terminal 01 and a server 02 .

Wherein, the terminal 01 is used to collect video to be processed. In some embodiments, the terminal 01 is a smart phone, a tablet computer, a notebook computer, a desktop computer, a smart speaker, a smart watch, a vehicle terminal, a smart TV, a smart voice interaction device, etc., but is not limited thereto. The terminal 01 and the server 02 are directly or indirectly connected through wired or wireless communication, which is not limited in this disclosure.

Wherein, the server 02 is used to receive the video sent by the terminal 01, extract a plurality of video frames from the video, obtain the character sequence feature of the text in each video frame, the position feature of the text area where the text is located, and the corresponding text area. The image features of each video frame are determined according to the above-mentioned character sequence feature, the above-mentioned position feature and the above-mentioned image feature of each video frame, and the fusion feature descriptor corresponding to each video frame is determined, and the fusion feature descriptor corresponding to each video frame is updated based on the above-mentioned Refer to the tracking track to obtain the tracking track of the text in the above video. In some embodiments, the server 02 is an independent physical server, or a server cluster or distributed system composed of multiple physical servers, or a cloud server providing cloud computing services.

It should be noted that FIG. 1 is only an implementation environment of the video text tracking method provided by the embodiment of the present application, and other implementation environments may also be included in practical applications. For example, an embodiment environment that includes only terminals.

Fig. 2 is a flow chart of a video text tracking method according to an exemplary embodiment. As shown in Fig. 2, the method is used in an implementation environment including a terminal and a server in Fig. 1, as shown in the above implementation environment, After the terminal collects the video to be processed, the server executes the video text tracking method for the video to be processed. The method includes the following steps S11, S13, S15 and S17.

In S11, multiple video frames are extracted from the video.

In some embodiments, the plurality of video frames are consecutive video frames on the time axis. That is, the video includes multiple video frames, and the multiple video frames are arranged sequentially on the time axis, and the multiple video frames extracted by the server from the video are adjacent on the time axis. If the server selects a time period from the video and extracts multiple video frames within the time period, that is, multiple consecutive video frames.

In the embodiment of the present disclosure, multiple video frames are extracted from the video in various ways, which are not specifically limited in the present disclosure.

In some embodiments, FFmpeg is used to decimate the video into consecutive video frames. After frame extraction, the video to be processed is divided into n temporally continuous image frames, denoted as Frame-1,...Frame-t,...,Frame-n, where Frame-t represents the temporal position in the video as The video frame of t, n is a positive integer greater than 1. Among them, FFmpeg is a set of open source computer programs used to record, convert digital video, and convert it into a stream. It provides a complete solution for recording, converting, and streaming audio and video.

In some other embodiments, the video is processed into a plurality of continuous video frames on the time axis through other video stream decoding technologies.

In S13, character sequence features of the text in each video frame, positional features of the text area where the text is located, and image features corresponding to the text area are acquired.

In some embodiments, after extracting a plurality of consecutive video frames, according to the sequence of each video frame on the time axis, sequentially acquire the text sequence features of the text in each video frame, and the position of the text area where the text is located features and image features corresponding to the above-mentioned text regions. For example, if the continuous video frames are Frame-1, ... Frame-t, ..., Frame-n, first obtain the text sequence features, position features and image features of Frame-1, ..., and obtain the text sequence features, position of Frame-t Features and image features, ..., and finally get the text sequence features, position features and image features of Frame-n.

In some other embodiments, after extracting a plurality of consecutive video frames, the character sequence features, position features and image features corresponding to each of the multiple consecutive video frames are obtained in parallel.

In some embodiments, the text area where the text is located is a text area at the text line level, that is, the text area includes a line of text, and the text in the line of text is sequential. Therefore, the text sequence feature represents the text area in the text area. The sequence of the text. Exemplarily, the text sequence feature is a feature map intercepted in a corresponding video frame through the coordinates of the text.

In some embodiments, the location feature of the text area where the text is located is the position coordinates of the text area. In an exemplary manner, the position coordinates of the text area are the coordinates of the upper left corner, the lower left corner, the upper right corner and the lower right corner of the text area.

In some embodiments, the image feature corresponding to the text area is the feature of the image corresponding to the text area in the video frame.

In some embodiments, Fig. 3 is a flowchart showing a method for determining character sequence features, location features and image features according to an exemplary embodiment. As shown in Figure 3, in the above S13, the above-mentioned acquisition of the character sequence features of the text in each video frame, the positional features of the text area where the text is located, and the image features corresponding to the above text area include the following steps S1301, S1303, S1305 and S1307:

In S1301, the above-mentioned text area is determined from each of the above-mentioned video frames.

In some embodiments, after extracting a plurality of consecutive video frames, the text area where the text in each video frame is located is sequentially determined from each video frame according to the order of each video frame on the time axis. For example, if the continuous video frames are Frame-1, ... Frame-t, ..., Frame-n, then determine the text area where the text in Frame-1 is located from Frame-1, ..., determine Frame from Frame-t The text area where the text in -t is located, ..., determine the text area where the text in Frame-n is located from Frame-n.

In some other embodiments, after extracting a plurality of consecutive video frames, the text regions where the texts in the video frames are respectively located are determined from the video frames in parallel.

Exemplarily, the text area is a text box corresponding to the text in the video frame.

In the embodiments of the present disclosure, the foregoing text area is determined in various ways, which are not specifically limited in the present disclosure.

In some embodiments, the corresponding text area is determined from each video frame by a text frame recognition tool.

In some other embodiments, the corresponding text area is determined from each video frame through a text tracking network. Fig. 4 is a flow chart of determining a corresponding text area from each video frame through a text tracking network according to an exemplary embodiment. As shown in FIG. 4, in the above S1301, the above-mentioned determination of the above-mentioned text area from each of the above-mentioned video frames includes the following steps S13011, S13013 and S13015:

In S13011, feature extraction is performed on each of the above video frames to obtain a feature map corresponding to each of the above video frames.

In S13013, determine a text area heat map corresponding to the feature map; the text area heat map represents a text area and a non-text area in the feature map.

In S13015, a connected domain analysis is performed on the heat map of the above-mentioned text area to obtain the above-mentioned text area.

In some embodiments, the above steps S13011 , S13013 and S13015 are executed sequentially for each video frame according to the order of each video frame on the time axis.

In some other embodiments, the above steps S13011, S13013 and S13015 are executed in parallel for multiple video frames.

Fig. 5 is a schematic diagram of a character tracking network according to an exemplary embodiment. As shown in Figure 5, the text tracking network includes a detection branch network, which includes a feature extraction subnetwork, a first feature fusion subnetwork, and a feature detection subnetwork, and the feature detection network further includes two convolutional layers. (convolutional) and a global average pooling layer (global avg pooing).

Exemplarily, in the above S13011, the basic feature extraction is performed on each video frame through the feature extraction sub-network (for example, a classic convolutional network based on resnet18) to obtain the basic feature of each video frame, which can be understood as A feature map of Batch*64*W*H width. Among them, the batch size is a hyperparameter used to define the number of samples to be processed before updating the internal model parameters, W is the height, and H is the width. Then, through the first feature fusion sub-network (for example, two stacked feature pyramid enhancement modules (Feature Pyramid Enhancement Module, FPEM)), the extracted basic features are processed by multi-scale feature fusion to obtain the feature map corresponding to each video frame . Among them, FPEM is used to enhance the features extracted by the convolutional neural network, in order to make the extracted features more robust.

Exemplarily, in the above S13013, after the feature map corresponding to each video frame is obtained, the feature map of each video frame is processed into whether it is text or not through two convolutional and one global avg pooling in the feature detection sub-network Text area heatmap of regions. It can be understood that the text area heat map includes two parts, one is a text area and the other is a non-text area.

Exemplarily, in the above S13015, the connected domain analysis is performed on the text area heat map to obtain the position of the text area, and then the position of the text area is fitted to a rotated rectangle to obtain the text area corresponding to each video frame.

Among them, the connected domain refers to the adjacent regions with the same pixel value in the image. Connected domain analysis refers to finding and marking the connected domains in the image.

It should be noted that, in each video frame, at least one text area can be detected through the above method, denoted as box-1...box-m, a total of m text areas (m is a positive integer greater than or equal to 1). For example, Frame-1 detects multiple boxes; ... Frame-t detects multiple boxes, ..., Frame-n detects multiple boxes.

In the embodiment of the present disclosure, by performing feature extraction, heat map detection, and connected domain analysis on each video frame, the text area can be accurately determined from each video frame, and the determination process of the text area does not need to rely on multiple models (text detection, text recognition, other models, etc.), which can reduce the consumption of computing resources during the calculation process of the text area.

In S1303, encode the location information of the above-mentioned character area to obtain the above-mentioned location feature.

In some embodiments, according to the order of each video frame on the time axis, the position information of the text area in each video frame is sequentially encoded.

In some other embodiments, the location information of text regions in multiple video frames is encoded in parallel.

In an exemplary embodiment, position coding is carried out to the position information of the text area (that is, the coordinates of the upper left corner, the coordinates of the lower left corner, the coordinates of the upper right corner, and the coordinates of the lower right corner of the text area) to obtain the position characteristics of the text area, the position The feature is 1*128 dimensions. In some embodiments, the position encoding includes but not limited to: cosine position encoding (cos encoding), sine position encoding (sin encoding) and the like.

It should be noted that, since each video frame includes at least one text area, for at least one text area belonging to the same video frame, each video frame is sequentially determined according to the sequence of at least one text area in the video frame The location features corresponding to each text area in . For example, if the text area of a certain video frame is box-1...box-m, first determine the location feature of box-1, ..., and finally determine the location feature of box-m.

In S1305, based on the position information of the text region, extract the above image features from each of the above video frames.

In the embodiment of the present disclosure, after obtaining the position information of each text area in each video frame, the image features corresponding to each text area are extracted from each video frame according to the position information of the text area.

In some embodiments, according to the order of each video frame on the time axis, the image features of the respective corresponding text regions are extracted from each video frame according to the position information of the text regions in each video frame.

In some other embodiments, according to the position information of the text regions in the multiple video frames, the image features of the corresponding text regions are extracted from the multiple video frames in parallel.

It should be noted that, since each video frame includes at least one text area, for at least one text area belonging to the same video frame, each video frame is sequentially determined according to the sequence of at least one text area in the video frame Image features corresponding to each text area in .

In some embodiments, in S1305, the above-mentioned image features are extracted from each of the above-mentioned video frames based on the position information of the above-mentioned text area, including:

Map the position information of the above-mentioned text area to the above-mentioned feature map to obtain a position mapping result.

Extracting the image located in the above position mapping result from the above feature map to obtain the above image feature. Wherein, the position mapping result is the mapping area corresponding to the text area.

In some embodiments, the position information of the text area of each video frame (ie, the upper left corner coordinate, the lower left corner coordinate, the upper right corner coordinate and the lower right corner coordinate of the text area) is mapped to the feature map corresponding to each video frame, Obtain the position mapping result (i.e., a mapping area) of the text area of each video frame, intercept the image located in the position mapping result in the feature map of each video frame, and obtain the image feature corresponding to the text area of each video frame .

Exemplarily, in the above S1303 and S1305, the above location features and image features are extracted through the above text tracking network. Continuing as shown in Figure 5, the above-mentioned character tracking network also includes a recognition branch network, through which the recognition branch network extracts the corresponding location feature and image feature from each video frame according to the location information of the text region.

In the embodiment of the present disclosure, by mapping the position information of the text area to the above feature map, the corresponding image features are intercepted, so that the image features can be accurately matched with the text area, the accuracy of the image features is improved, and the fusion feature descriptor is improved. In addition, the determination process of image features does not need to rely on multiple models (text detection, text recognition, other models, etc.), which reduces the consumption of computing resources in the calculation process of image features.

In S1307, the above-mentioned image features are decoded to obtain the above-mentioned character sequence features.

In the embodiment of the present disclosure, since each text area of each video frame is a text area at the text line level, that is, a line of text is included in the text area, and each text has a corresponding position and order. Therefore, after obtaining each video frame After the image features corresponding to each text area of each video frame, the image features are decoded to obtain the text sequence features corresponding to each text area of each video frame.

In some embodiments, according to the order of each video frame on the time axis, the image features corresponding to each text area in each video frame are decoded sequentially to obtain the text sequence corresponding to each text area in each video frame feature.

In some other embodiments, the image features corresponding to the text regions corresponding to the plurality of video frames are decoded in parallel to obtain the character sequence features corresponding to the text regions in each video frame.

Exemplarily, in the above S1307, the above-mentioned image features are decoded through the above-mentioned character tracking network to obtain the above-mentioned character sequence features. Continuing as shown in Figure 5, the above-mentioned text tracking network also includes a recognition branch network, which decodes the image features intercepted by each text region of each video frame (for example, connectionist temporal classification decoding (CTC decoding)) , so as to obtain the text sequence features of each text area of each video frame.

In the embodiment of the present disclosure, firstly, the above-mentioned text area is determined from each video frame, and according to the position information of the text area, the position feature, image feature and text sequence feature of the text area are determined, which can improve the position feature, image feature and text The determination accuracy of sequence features, thereby improving the determination accuracy of fusion feature descriptors, and then improving the accuracy of video text tracking; in addition, the determination process of position features, image features and text sequence features does not need to rely on multiple models (text detection, text recognition, etc.) , other models, etc.), so as to reduce the consumption of computing resources in the process of determining position features, image features and text sequence features.

In S15, according to the above-mentioned character sequence feature, the above-mentioned position feature and the above-mentioned image feature of each video frame, determine the fusion feature descriptor corresponding to each video frame.

In some embodiments, according to the order of each video frame on the time axis, the text sequence features corresponding to each text area in each video frame, the above-mentioned position features, and the above-mentioned image features are fused in order to obtain each video frame Fusion feature descriptors corresponding to each text region in .

In some other embodiments, the text sequence features corresponding to each text area in multiple video frames, the above-mentioned position features, and the above-mentioned image features are fused in parallel to obtain a fusion feature descriptor corresponding to each text area in each video frame .

In the embodiment of the present disclosure, the above three features of each video frame are fused in various ways to obtain a fused feature descriptor corresponding to each video frame, which is not specifically limited in the present disclosure.

In some embodiments, the above-mentioned three features are fused through the above-mentioned character tracking network to obtain a fused feature descriptor corresponding to each video frame. Exemplarily, as shown in FIG. 5 , the above text tracking network further includes a multi-information fusion description sub-branch network, and the multi-information fusion description sub-branch network further includes a second feature fusion sub-network and a feature size adjustment sub-network.

In some embodiments, FIG. 6 is a flow chart of obtaining the fusion feature descriptor through the text tracking network according to an exemplary embodiment. As shown in Figure 6, in the above S15, the above-mentioned fusion feature descriptor corresponding to each video frame is obtained according to the above-mentioned character sequence feature, the above-mentioned position feature and the above-mentioned image feature of each video frame, including the following steps S1501 and S1503:

In S1501, for each video frame, the above-mentioned character sequence feature, the above-mentioned position feature and the above-mentioned image feature are concatenated to obtain the fusion feature corresponding to the video frame.

In S1503, the size of the fusion feature is adjusted to obtain the fusion feature descriptor.

Exemplarily, in the above S1501, through the second feature fusion sub-network, the text sequence features corresponding to the text areas of each video frame, the above position features and the above image features are spliced to obtain the text of each video frame The fusion feature corresponding to the region. The dimension of the fusion feature is 3*128 dimensions.

Exemplarily, in the above S1503, the output size of the fusion feature corresponding to each text region of each video frame is adjusted by a feature size adjustment sub-network (for example, two multilayer perceptrons (Multilayer Perceptron, MLP)) ( For example, 3*128 dimensions -> 1*128 dimensions), to obtain the fusion feature descriptor corresponding to each text area of each video frame. Among them, the multi-layer perceptron is a forward-structured artificial neural network, which includes an input layer, an output layer, and multiple hidden layer fusion feature descriptors.

In some embodiments, the multi-information fusion description sub-branch network also includes a position feature extraction sub-network, an image feature extraction sub-network, and a text sequence feature extraction sub-network, and the above-mentioned text sequence features are extracted through the text sequence feature extraction sub-network. The location feature extraction subnetwork extracts location features, and image features are extracted through the image feature extraction subnetwork.

In the embodiment of the present disclosure, feature extraction, feature recognition, and feature fusion are performed through the above-mentioned end-to-end text tracking network to obtain a fusion feature descriptor, so as to obtain the tracking track of the video text through the fusion feature descriptor, and realize the acquisition through a model. The video text tracking trajectory, the determination accuracy and robustness of the video text tracking trajectory are high, and the calculation resource consumption is small.

It should be noted that since each video frame corresponds to at least one text region, the fusion feature descriptors corresponding to each text region in each video frame are sequentially determined according to the sequence of at least one text region in the video frame. For example, if the text area of a certain video frame is box-1...box-m, first determine that box-1 corresponds to a fusion feature descriptor, ...finally determine that box-m corresponds to a fusion feature descriptor.

In the embodiment of the present disclosure, the fusion feature is obtained by concatenating the character sequence features, the above position features, and the above image features of each text area of each video frame, and the size of the fusion feature is adjusted to obtain the above fusion feature descriptor, so that The determination of the fusion feature descriptor can fully consider the character sequence features, position features and image features of the text, and the determination accuracy of the fusion feature descriptor is high, thereby improving the accuracy of text tracking; in addition, the determination of the fusion feature descriptor The process does not need to rely on multiple models (text detection, text recognition, other models, etc.), reducing the consumption of computing resources in the process of determining the fusion feature descriptor.

In S17, based on the fusion feature descriptor corresponding to each video frame, the stored reference tracking track is updated to obtain the tracking track of the text in the video; the tracking track is used to represent the position information of the text.

In the embodiment of the present disclosure, a reference tracking track used to characterize the position information of the text is set, and after obtaining the fusion feature descriptor of the text area of each video frame, the fusion feature descriptor of the text area in each video frame is used , update the reference fusion feature descriptor in the reference tracking track, and obtain the tracking track of the text in the video.

Exemplarily, the above update includes but not limited to: modification, replacement, addition, deletion and so on.

Fig. 7 is a flow chart of updating the reference tracking track to obtain the tracking track of the text in the video according to an exemplary embodiment. As shown in FIG. 7, in the above S17, the reference tracking track is updated based on the fusion feature descriptor corresponding to each video frame above, and the tracking track of the text in the above video is obtained, including the following steps S1701 and S1703:

In S1701, traverse each of the above video frames in sequence, and perform the following operations when traversing each of the above video frames: combine the current fusion feature descriptor corresponding to the traversed current video frame with the above reference tracking track The reference fusion feature descriptor corresponding to the reference text area performs similarity matching to obtain the similarity matching result corresponding to the above-mentioned current video frame; based on the above-mentioned similarity matching result, the above-mentioned current fusion feature descriptor and the current text area corresponding to the above-mentioned current video frame , updating the above-mentioned reference tracking trajectory to obtain the target tracking trajectory corresponding to the above-mentioned current video frame; determining the above-mentioned target tracking trajectory as the updated reference tracking trajectory; the above-mentioned sequence represents the sequence of each of the above-mentioned video frames on the above-mentioned time axis.

In S1703, the target tracking track corresponding to the last video frame is used as the tracking track of the text in the above-mentioned video.

In some embodiments, since the embodiments of the present disclosure need to analyze the tracking track of the video text, the video frames and the text in the video frame are sequential. Updates are performed sequentially.

Exemplarily, the continuous video frames are Frame-1, ... Frame-t, ..., Frame-n, and each video frame corresponds to a plurality of text areas box-1, ... box-t, ... box-m. Each text region corresponds to a fused feature descriptor.

The reference tracking trajectory is empty during initialization.

Traversing Frame-1, Frame-1 is the current video frame, since the reference tracking track is empty, the multiple text areas box-1, ... box-t, ... box-m, and multiple text areas of Frame-1 are directly The respective fusion feature descriptors are initialized to the reference tracking trajectory, and the target tracking trajectory 1 corresponding to Frame-1 is obtained, and the target tracking trajectory 1 is determined as the updated reference tracking trajectory.

Traversing Frame-2, Frame-2 is the current video frame, since the multiple text regions of Frame-1 and the fusion feature descriptors corresponding to the multiple text regions have been stored in the tracking track 1, the Frame-2’s The similarity between the fusion feature descriptors corresponding to each of the multiple text regions and the existing fusion feature descriptor of track 1 is tracked, and the similarity matching result corresponding to Frame-2 is obtained. And based on the similarity matching results, the current fusion feature descriptors corresponding to each of the multiple text regions of Frame-2, and the multiple text regions of Frame-2, update the tracking trajectory 1 to obtain the target tracking trajectory 2 corresponding to the above Frame-2 , and determine the target tracking trajectory 2 as the updated reference tracking trajectory.

The subsequent traversal process from Frame-3 to Frame-m is similar to the above-mentioned Frame-2, and will not be repeated here. After traversing Frame-m, output the target tracking track m corresponding to Frame-m, and the target tracking track m is the tracking track of the text in the video.

In the embodiment of the present disclosure, the fusion feature descriptors of each text region of each video frame are sequentially matched with the reference fusion feature descriptors in the reference tracking trajectory, and the similarity results of each video frame are sequentially compared The reference fusion feature descriptor in the reference tracking trajectory is updated, so that the final tracking trajectory can fully consider the text sequence features, position features and image features of the text, and the accuracy of text tracking is high; in addition, the tracking trajectory of text tracking The determination process is an end-to-end process (for example, using an end-to-end text tracking network to achieve text tracking), which does not need to rely on multiple models (text detection, text recognition, other models, etc.), reducing the determination process of the tracking trajectory consumption of computing resources.

In some embodiments, the above-mentioned current fusion feature descriptor corresponding to the above-mentioned current video frame is similarly matched with the reference fusion feature descriptor in the above-mentioned reference tracking track, and the similarity matching result corresponding to the above-mentioned current video frame is obtained, including :

The above-mentioned current fusion feature descriptor and the above-mentioned reference fusion feature descriptor are used to calculate the similarity confusion matrix to obtain the above-mentioned similarity matching result.

In some embodiments, in order to improve the accuracy of the similarity calculation, the similarity confusion matrix is calculated to obtain the similarity, and then the similarity with higher similarity is selected through the Hungarian algorithm to obtain the similarity matching result.

Assume that the continuous video frames are Frame-1, ... Frame-t, ..., Frame-n, and each video frame corresponds to multiple text areas box-1, ... box-t, ... box-m. Each text region corresponds to a fused feature descriptor.

The reference tracking trajectory is empty during initialization.

Traversing Frame-1, Frame-1 is the current video frame, since the reference tracking track is empty, directly correspond to multiple text areas box-1, box-2 and box-3 of Frame-1, and multiple text areas The fusion feature descriptor of is added to the reference tracking trajectory to obtain the target tracking trajectory 1 corresponding to Frame-1, and the target tracking trajectory 1 is determined as the updated reference tracking trajectory.

Traversing Frame-2, Frame-2 is the current video frame, assuming that Frame-2 includes box-1 and box-2, then the fusion feature descriptor of box-1 of Frame-2 and the fusion feature descriptor of box-2 , and calculate the similarity confusion matrix with the fusion feature descriptors corresponding to box-1, box-2 and box-3 in tracking track 1 respectively, and get the fusion feature descriptors of box-1 of Frame-2 respectively with Track the similarity between the fusion feature descriptors corresponding to box-1, box-2 and box-3 in track 1 (that is, box-1 of Frame-2, get 3 similarities), and Frame-2's The similarity between the fusion feature descriptors of box-2 and the fusion feature descriptors corresponding to box-1, box-2 and box-3 in tracking track 1 respectively (that is, box-2 of Frame-2, 3 similarity). Finally, the Hungarian matching algorithm is used to select the similarity with the highest similarity from the three similarities of box-1 of Frame-2 as the similarity matching result of box-1 of Frame-2. And through the Hungarian matching algorithm, from the three similarities of box-2 of Frame-2, the similarity with the highest similarity is selected as the similarity matching result of box-2 of Frame-2.

The subsequent determination process of the similarity matching results from Frame-3 to Frame-m is similar to the above-mentioned Frame-2, and will not be repeated here.

In some embodiments, based on the above similarity matching result, the above-mentioned current fusion feature descriptor and the current text region corresponding to the above-mentioned current video frame, the above-mentioned reference tracking track is updated to obtain the target tracking track corresponding to the above-mentioned current video frame, including:

When the similarity matching result is smaller than the preset similarity threshold, the reference text region is updated based on the current text region, and the reference fusion feature descriptor is updated based on the current fusion feature descriptor to obtain the target tracking trajectory.

In some embodiments, the aforementioned reference tracking track is updated based on the aforementioned similarity matching result, the aforementioned current fusion feature descriptor, and the current text region corresponding to the aforementioned current video frame to obtain the target tracking track corresponding to the aforementioned current video frame, and further includes :

In practical applications, the text area in the current video frame may or may not be a part of the existing track, but a new text area different from the existing track.

In order to accurately determine whether the text area in the current video frame belongs to a part of the existing trajectory, the similarity matching result corresponding to the current video frame is compared with a preset similarity threshold.

When the similarity matching result corresponding to the current video frame is less than the preset similarity threshold, it indicates that the current text area is part of the existing trajectory, then use the current text area to update the above reference text area, and use the current fusion feature The descriptor updates the above-mentioned reference fusion feature descriptor to update the current text region and the current fusion feature descriptor to the existing trajectory, thereby accurately determining the text region belonging to the existing trajectory, and improving the determination of the tracking trajectory of the text in the video precision.

When the similarity matching result corresponding to the current video frame is greater than or equal to the preset similarity threshold, it indicates that the current text area is not part of the existing track, but a new text, then the above-mentioned current text area and The above-mentioned current fusion feature descriptor is added to the above-mentioned reference tracking trajectory to update the reference tracking trajectory, thereby accurately determining the text area that does not belong to the existing trajectory, and further improving the determination accuracy of the text tracking trajectory in the video.

Continuing with the above example as an example:

For Frame-2, through the calculation of the similarity confusion matrix, the fusion feature descriptors of box-1 of Frame-2 and the fusion feature descriptions corresponding to box-1, box-2 and box-3 in tracking track 1 are obtained respectively. The similarity between sub-items (that is, box-1 of Frame-2, corresponding to 3 similarities), and the fusion feature descriptor of box-2 of Frame-2 and box-1 and box-2 in tracking track 1 respectively The similarity between the fusion feature descriptors corresponding to box-3 (that is, box-2 of Frame-2, corresponding to 3 similarities).

Assume that the highest similarity selected from the three similarities corresponding to box-1 of Frame-2 through the Hungarian matching algorithm is the similarity obtained by matching box-1 in tracking track 1, and the similarity is greater than or Equal to the preset similarity threshold, indicating that the box-1 of the Frame-2 is part of the existing track (box-1 in the tracking track 1), then use the box-1 of the Frame-2 to update the box in the tracking track 1 -1, and use the current fused feature descriptor of box-1 of Frame-2 to update the reference fused feature descriptor of box-1 in tracking trajectory 1 to combine box-1 of Frame-2 with the corresponding fused feature descriptor Update to track track 1.

Assume that the highest similarity selected from the three similarities corresponding to box-2 of Frame-2 through the Hungarian matching algorithm is the similarity obtained by matching box-2 in tracking track 1, and the similarity is less than the preset The similarity threshold indicates that the box-2 of the Frame-2 is not a part of the existing trajectory, but a new text, then initialize the box-2 of the above Frame-2 and the corresponding fusion feature descriptor to the tracking trajectory 1 .

Fig. 8 is a block diagram of a video text tracking device according to an exemplary embodiment. Referring to FIG. 8 , the device includes an extraction module 21 , a feature acquisition module 23 , a descriptor acquisition module 25 and a track determination module 27 .

The extraction module 21 is configured to extract a plurality of video frames from the video;

The feature acquisition module 23 is configured to acquire the character sequence feature of the text in each video frame, the position feature of the text area where the above text is located, and the image feature corresponding to the above text area;

The descriptor acquisition module 25 is configured to obtain the fusion feature descriptor corresponding to each of the video frames according to the above-mentioned character sequence features, the above-mentioned position features, and the above-mentioned image features of each of the above-mentioned video frames;

The tracking trajectory determination module 27 is configured to update the stored reference tracking trajectory based on the fusion feature descriptor corresponding to each video frame, and obtain the tracking trajectory of the text in the video; the tracking trajectory is used to characterize the text in the video location information.

In some embodiments, the feature acquisition module 23 includes:

The location feature acquisition unit is configured to encode the location information of the above-mentioned text area to obtain the above-mentioned location feature;

The image feature extraction unit is configured to extract the above image features from the above video frame based on the position information of the above text area;

The character sequence feature acquisition unit is configured to decode the above-mentioned image feature to obtain the above-mentioned character sequence feature.

In some embodiments, the above text area determination unit includes:

The connected domain analysis subunit is configured to perform connected domain analysis on the heat map of the text area to obtain the text area.

In some embodiments, the above-mentioned image feature extraction unit includes:

The mapping subunit is configured to map the position information of the above-mentioned text area to the above-mentioned feature map, and obtain the mapping area corresponding to the above-mentioned text area;

The image feature extraction subunit is configured to extract the image located in the above-mentioned mapping area from the above-mentioned feature map to obtain the above-mentioned image features.

In some embodiments, the above-mentioned descriptor acquisition module 25 includes:

The splicing unit is configured to, for each video frame, splice the above-mentioned text sequence feature, the above-mentioned position feature, and the above-mentioned image feature of the above-mentioned video frame to obtain the fusion feature corresponding to the above-mentioned video frame;

The adjustment unit is configured to adjust the size of the above-mentioned fusion feature to obtain the above-mentioned fusion feature descriptor.

In some embodiments, the above-mentioned reference tracking track includes a reference text area, and the above-mentioned multiple video frames are multiple video frames continuous on the time axis, and the above-mentioned tracking track determination module 27 includes:

The similarity matching unit is configured to traverse each of the above video frames in sequence, and when traversing each of the above video frames, perform the following operations: compare the current fusion feature descriptor corresponding to the traversed current video frame with the above reference The reference fusion feature descriptor corresponding to the text area performs similarity matching to obtain the similarity matching result corresponding to the above-mentioned current video frame; based on the above-mentioned similarity matching result, the above-mentioned current fusion feature descriptor and the current text area corresponding to the above-mentioned current video frame, Updating the above-mentioned reference tracking trajectory to obtain the target tracking trajectory corresponding to the above-mentioned current video frame; and determining the above-mentioned target tracking trajectory as the updated reference tracking trajectory; the above-mentioned order represents the order of each of the above-mentioned video frames on the above-mentioned time axis;

In some embodiments, the above-mentioned similarity matching unit is configured to update the above-mentioned reference text area based on the above-mentioned current text area when the above-mentioned similarity matching result is smaller than the preset similarity threshold, and based on the above-mentioned current fusion feature descriptor The above-mentioned reference fusion feature descriptor is updated to obtain the above-mentioned target tracking trajectory.

In some embodiments, the above-mentioned similarity matching unit is configured to add the above-mentioned current text region and the above-mentioned current fusion feature descriptor to the above-mentioned reference tracking when the above-mentioned similarity matching result is greater than or equal to a preset similarity threshold In the trajectory, the above-mentioned target tracking trajectory is obtained.

Regarding the device in the above-mentioned embodiment, the specific manner in which each module performs operations has been described in detail in the embodiment of the method, and will not be described in detail here

In an exemplary embodiment, there is also provided an electronic device, including a processor; a memory for storing instructions executable by the processor; wherein the processor is configured to execute the instructions to implement the following steps:

Extract multiple video frames from the video;

Obtain the character sequence feature of the text in each video frame, the position feature of the text area where the above text is located, and the image feature corresponding to the above text area;

According to the above-mentioned character sequence feature, the above-mentioned position feature and the above-mentioned image feature of each of the above-mentioned video frames, determine the fusion feature descriptor corresponding to each of the above-mentioned video frames;

The stored reference tracking trajectory is updated based on the fusion feature descriptor corresponding to each video frame to obtain the tracking trajectory of the text in the video; the tracking trajectory is used to represent the position information of the text in the video.

In some embodiments, the above-mentioned processor is configured to execute the above-mentioned instructions to implement the following steps:

For each video frame, determine the above-mentioned text area from the above-mentioned video frame;

Encoding the location information of the above-mentioned text area to obtain the above-mentioned location feature;

Extracting the above image features from the above video frame based on the position information of the above text area;

The above-mentioned image features are decoded to obtain the above-mentioned character sequence features.

performing feature extraction on the above video frame to obtain a feature map corresponding to the above video frame;

Determine the text area heat map corresponding to the above feature map; the above text area heat map represents the text area and non-text area in the above feature map;

Connected domain analysis is performed on the above text area heat map to obtain the above text area.

Mapping the position information of the above-mentioned text area to the above-mentioned feature map to obtain a mapping area corresponding to the above-mentioned text area;

Extracting images located in the above mapping region from the above feature map to obtain the above image features.

For each video frame, splicing the above-mentioned character sequence feature, the above-mentioned position feature and the above-mentioned image feature of the above-mentioned video frame to obtain the fusion feature corresponding to the above-mentioned video frame;

Adjust the size of the above-mentioned fusion feature to obtain the above-mentioned fusion feature descriptor.

In some embodiments, the above-mentioned reference tracking trajectory includes a reference text area, and the above-mentioned multiple video frames are continuous on the time axis, and the above-mentioned processor is configured to execute the above-mentioned instructions to implement the following steps:

Traverse each of the above video frames in order, and when traversing each of the above video frames, perform the following operations: the current fusion feature descriptor corresponding to the traversed current video frame, and the reference fusion feature description corresponding to the above reference text area The above-mentioned similarity matching result corresponding to the current video frame is obtained; based on the above-mentioned similarity matching result, the above-mentioned current fusion feature descriptor and the current text area corresponding to the above-mentioned current video frame, the above-mentioned reference tracking track is updated to obtain the above-mentioned The target tracking trajectory corresponding to the current video frame; and determining the above-mentioned target tracking trajectory as the updated reference tracking trajectory; the above-mentioned order represents the order of each of the above-mentioned video frames on the above-mentioned time axis;

The target tracking track corresponding to the last sorted video frame is determined as the tracking track of the text in the above video.

The above detection branch network is used to detect text regions in each video frame;

The recognition branch network is used to execute the steps of obtaining the character sequence features of the characters in each video frame, the positional features of the character regions where the characters are located, and the image features corresponding to the character regions;

The multi-information fusion descriptor branch network is used to perform the step of determining the fusion feature descriptor corresponding to each video frame according to the above-mentioned character sequence features, the above-mentioned position features and the above-mentioned image features of each of the above-mentioned video frames.

The above-mentioned feature extraction sub-network is used to extract the basic features of the above-mentioned video frames for each video frame to obtain the basic features of the above-mentioned video frames;

The above-mentioned first feature fusion sub-network is used to perform multi-scale feature fusion processing on the extracted basic features to obtain the feature map corresponding to the above-mentioned video frame;

The above-mentioned feature detection sub-network is used to process the feature map of the above-mentioned video frame into a text area heat map, and the above-mentioned text area heat map represents the text area and non-text area in the above-mentioned feature map; performing connected domain analysis on the above-mentioned text area heat map , to obtain the text area corresponding to the above video frame.

The electronic device is a terminal, a server or a similar computing device. Taking the electronic device as a server as an example, FIG. 9 is a block diagram of an electronic device for video text tracking according to an exemplary embodiment. The electronic device 30 Can produce bigger difference because of configuration or performance difference, can comprise one or more central processing unit (Central Processing Units, CPU) 31 (central processing unit 31 can comprise but not limited to microprocessor MCU or programmable logic device FPGA etc.), a memory 33 for storing data, and one or more storage media 32 for storing application programs 323 or data 322 (for example, one or more mass storage devices). Wherein, the memory 33 and the storage medium 32 may be temporary storage or persistent storage. The program stored in the storage medium 32 may include one or more modules, and each module may include a series of instructions to operate on the electronic device. Further, the central processing unit 31 may be configured to communicate with the storage medium 32 , and execute a series of instruction operations in the storage medium 32 on the electronic device 30 . The electronic device 30 can also include one or more power supplies 36, one or more wired or wireless network interfaces 35, one or more input and output interfaces 34, and/or, one or more operating systems 321, such as Windows Server™, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM, etc.

The input-output interface 34 may be used to receive or send data via a network. The specific example of the above network may include a wireless network provided by the communication provider of the electronic device 30 . In one example, the input and output interface 34 includes a network adapter (Network Interface Controller, NIC), which can be connected to other network devices through a base station so as to communicate with the Internet. In an exemplary embodiment, the input and output interface 34 may be a radio frequency (Radio Frequency, RF) module, which is used to communicate with the Internet in a wireless manner.

Those of ordinary skill in the art can understand that the structure shown in FIG. 9 is only for illustration, and it does not limit the structure of the above-mentioned electronic device. For example, electronic device 30 may also include more or fewer components than shown in FIG. 9 , or have a different configuration than that shown in FIG. 9 .

In an exemplary embodiment, a computer-readable storage medium is also provided, and when instructions in the computer-readable storage medium are executed by a processor of the electronic device, the electronic device is made to perform the following steps:

Extract multiple video frames from the video;

In some embodiments, when the instructions in the computer-readable storage medium are executed by the processor of the electronic device, the electronic device is made to perform the following steps:

In some embodiments, the above-mentioned reference tracking trajectory includes a reference text area, and the above-mentioned multiple video frames are continuous on the time axis, when the instructions in the computer-readable storage medium are executed by the processor of the electronic device, the electronic device is made to perform the following steps :

In an exemplary embodiment, a computer program product is also provided, including a computer program, which implements the following steps when the computer program is executed by a processor:

Extract multiple video frames from the video;

According to the character sequence feature, the position feature and the image feature of each video frame respectively, determine the fusion feature descriptor corresponding to each video frame;

In some embodiments, the computer program implements the following steps when executed by a processor:

In some embodiments, the above-mentioned reference tracking track includes a reference text area, and the above-mentioned multiple video frames are continuous on the time axis. When the computer program is executed by a processor, the following steps are implemented:

In some embodiments, the computer program implements the following steps when executed by the processor:

Those of ordinary skill in the art can understand that all or part of the processes in the methods of the above embodiments can be realized by instructing related hardware through a computer program, and the computer program can be stored in a non-volatile computer-readable storage medium , when the computer program is executed, it may include the procedures of the embodiments of the above-mentioned methods. Wherein, any reference to memory, storage, database or other media used in various embodiments provided by the present disclosure may include non-volatile and/or volatile memory. Nonvolatile memory can include read only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory. Volatile memory can include random access memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in many forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Chain Synchlink DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.

All the embodiments of the present disclosure can be implemented independently or in combination with other embodiments, which are all regarded as the scope of protection required by the present disclosure.

Claims

A video text tracking method, comprising:

Extract multiple video frames from the video;

Obtain the character sequence feature of the text in each video frame, the position feature of the text area where the text is located, and the image feature corresponding to the text area;

Determining a fusion feature descriptor corresponding to each video frame according to the character sequence feature, the position feature, and the image feature of each video frame;

Based on the fusion feature descriptor corresponding to each video frame, the stored reference tracking trajectory is updated to obtain the tracking trajectory of the text in the video; the tracking trajectory is used to represent the position information of the text in the video.
The video text tracking method according to claim 1, wherein said obtaining the text sequence feature of the text in each video frame, the position feature of the text area where the text is located, and the image feature corresponding to the text area include :

For each video frame, determine the text area from the video frame;

Encoding the position information of the text area to obtain the position feature;

extracting the image features from the video frame based on the position information of the text region;

The image feature is decoded to obtain the character sequence feature.
The video text tracking method according to claim 2, wherein said determining said text area from said video frame comprises:

performing feature extraction on the video frame to obtain a feature map corresponding to the video frame;

Determine the text area heat map corresponding to the feature map; the text area heat map represents the text area and non-text area in the feature map;

Connected domain analysis is performed on the text area heat map to obtain the text area.
The video text tracking method according to claim 3, wherein said extracting said image features from said video frame based on the position information of said text region comprises:

mapping the position information of the text area to the feature map to obtain a mapping area corresponding to the text area;

Extracting images located in the mapping region from the feature map to obtain the image features.
The video text tracking method according to claim 1, wherein said fusion feature corresponding to each video frame is obtained according to said text sequence feature, said position feature and said image feature of each video frame respectively Descriptors, including:

For each video frame, splicing the character sequence feature, the position feature and the image feature of the video frame to obtain the fusion feature corresponding to the video frame;

Adjust the size of the fusion feature to obtain the fusion feature descriptor.
The video text tracking method according to claim 1, wherein the reference tracking trajectory includes a reference text area, the plurality of video frames are continuous on the time axis, and the fusion feature description based on each video frame corresponds Sub-update the stored reference tracking track to obtain the tracking track of the text in the video, including:

Traverse each video frame in sequence, and when traversing each video frame, perform the following operations: the current fusion feature descriptor corresponding to the current video frame traversed, the reference corresponding to the reference text area The fusion feature descriptor performs similarity matching to obtain the similarity matching result corresponding to the current video frame; based on the similarity matching result, the current fusion feature descriptor and the current text area corresponding to the current video frame, update The reference tracking track obtains the target tracking track corresponding to the current video frame; and the target tracking track is determined as the updated reference tracking track; the sequence characterizes each video frame at the time the order on the axis;

The target tracking track corresponding to the last sorted video frame is determined as the tracking track of the text in the video.
The video text tracking method according to claim 6, wherein the reference tracking track is updated based on the similarity matching result, the current fusion feature descriptor and the current text region corresponding to the current video frame, Obtain the target tracking trajectory corresponding to the current video frame, including:

When the similarity matching result is less than a preset similarity threshold, update the reference text region based on the current text region, and update the reference fusion feature descriptor based on the current fusion feature descriptor, to obtain the the target tracking trajectory.
The video text tracking method according to claim 6, wherein the reference tracking track is updated based on the similarity matching result, the current fusion feature descriptor and the current text region corresponding to the current video frame, Obtain the target tracking trajectory corresponding to the current video frame, including:

When the similarity matching result is greater than or equal to a preset similarity threshold, the current text region and the current fusion feature descriptor are added to the reference tracking track to obtain the target tracking track.
The video text tracking method according to claim 1, wherein the method is executed based on a text tracking network, and the text tracking network includes a detection branch network, a recognition branch network and a multi-information fusion description sub-branch network;

The detection branch network is used to detect text regions in each video frame;

The recognition branch network is used to perform the steps of obtaining the character sequence features of the characters in each video frame, the positional features of the character region where the characters are located, and the corresponding image features of the character region;

The multi-information fusion description sub-branch network is used to perform the said fusion process corresponding to each video frame according to the character sequence feature, the position feature and the image feature of each video frame. Steps for feature descriptors.
The video text tracking method according to claim 9, wherein said detection branch network comprises a feature extraction subnetwork, a first feature fusion subnetwork and a feature detection subnetwork;

The feature extraction sub-network is used to extract the basic features of the video frame for each video frame to obtain the basic features of the video frame;

The first feature fusion sub-network is used to perform multi-scale feature fusion processing on the extracted basic features to obtain the feature map corresponding to the video frame;

The feature detection sub-network is used to process the feature map of the video frame into a text area heat map, and the text area heat map represents a text area and a non-text area in the feature map; Connected domain analysis is performed on the graph to obtain the text region corresponding to the video frame.
A video text tracking device, comprising:

An extraction module configured to extract a plurality of video frames from the video;

The feature acquisition module is configured to acquire the character sequence feature of the text in each video frame, the position feature of the text area where the text is located, and the image feature corresponding to the text area;

The descriptor acquisition module is configured to obtain the fusion feature descriptor corresponding to each video frame according to the character sequence feature, the position feature and the image feature of each video frame;

The tracking trajectory determination module is configured to update the stored reference tracking trajectory based on the fusion feature descriptor corresponding to each video frame to obtain the tracking trajectory of the text in the video; the tracking trajectory is used to characterize the video The location information of Chinese characters.
The video text tracking device according to claim 11, wherein the feature acquisition module includes:

a text area determination unit configured to, for each video frame, determine the text area from the video frame;

A location feature acquisition unit configured to encode the location information of the text area to obtain the location feature;

an image feature extraction unit configured to extract the image feature from the video frame based on the location information of the text region;

The character sequence feature acquisition unit is configured to decode the image feature to obtain the character sequence feature.
The video text tracking device according to claim 12, wherein the text area determining unit comprises:

The feature map determination subunit is configured to perform feature extraction on the video frame to obtain a feature map corresponding to the video frame;

The text area heat map determination subunit is configured to determine the text area heat map corresponding to the feature map; the text area heat map represents the text area and non-text area in the feature map;

The connected domain analysis subunit is configured to perform connected domain analysis on the text area heat map to obtain the text area.
The video text tracking device according to claim 13, wherein the image feature extraction unit comprises:

The mapping subunit is configured to map the position information of the text region to the feature map to obtain a mapping region corresponding to the text region;

The image feature extraction subunit is configured to extract the image located in the mapping area from the feature map to obtain the image feature.
The video text tracking device according to claim 11, wherein the descriptor acquisition module includes:

The splicing unit is configured to, for each video frame, splice the character sequence feature, the position feature, and the image feature of the video frame to obtain a fusion feature corresponding to the video frame;

The adjustment unit is configured to adjust the size of the fusion feature to obtain the fusion feature descriptor.
The video text tracking device according to claim 11, wherein, the reference tracking track includes a reference text area, and the plurality of video frames are a plurality of video frames continuous on the time axis, and the tracking track determination module includes :

The similarity matching unit is configured to traverse each video frame in sequence, and when traversing each video frame, perform the following operations: combine the current fusion feature descriptor corresponding to the traversed current video frame with The reference fusion feature descriptor corresponding to the reference text area performs similarity matching to obtain a similarity matching result corresponding to the current video frame; based on the similarity matching result, the current fusion feature descriptor and the current video In the current text area corresponding to the frame, update the reference tracking track to obtain the target tracking track corresponding to the current video frame; and determine the target tracking track as the updated reference tracking track; the order represents the the order of each video frame on said time axis;

The tracking track determination unit is configured to determine the target tracking track corresponding to the last video frame as the tracking track of the text in the video.
The video text tracking device according to claim 16, wherein the similarity matching unit is configured to update the Referring to the text area, and updating the reference fusion feature descriptor based on the current fusion feature descriptor to obtain the target tracking trajectory.
The video text tracking device according to claim 16, wherein the similarity matching unit is configured to combine the current text region and The current fusion feature descriptor is added to the reference tracking track to obtain the target tracking track.
The video text tracking device according to claim 11, wherein the text tracking network includes a detection branch network, a recognition branch network and a multi-information fusion description sub-branch network;

The detection branch network is used to detect text regions in each video frame;

The recognition branch network is used to perform the steps of obtaining the character sequence features of the characters in each video frame, the positional features of the character region where the characters are located, and the corresponding image features of the character region;

The multi-information fusion description sub-branch network is used to perform the said fusion process corresponding to each video frame according to the character sequence feature, the position feature and the image feature of each video frame. Steps for feature descriptors.
The video text tracking device according to claim 19, wherein the detection branch network comprises a feature extraction subnetwork, a first feature fusion subnetwork and a feature detection subnetwork;

The feature extraction sub-network is used to extract the basic features of the video frame for each video frame to obtain the basic features of the video frame;

The first feature fusion sub-network is used to perform multi-scale feature fusion processing on the extracted basic features to obtain the feature map corresponding to the video frame;

The feature detection sub-network is used to process the feature map of the video frame into a text area heat map, and the text area heat map represents a text area and a non-text area in the feature map; Connected domain analysis is performed on the graph to obtain the text region corresponding to the video frame.
An electronic device comprising:

processor;

A memory for storing instructions executable by the processor; wherein the processor is configured to execute the instructions to implement the following steps:

Extract multiple video frames from the video;

Obtain the character sequence feature of the text in each video frame, the position feature of the text area where the text is located, and the image feature corresponding to the text area;

Determining a fusion feature descriptor corresponding to each video frame according to the character sequence feature, the position feature, and the image feature of each video frame;

Based on the fusion feature descriptor corresponding to each video frame, the stored reference tracking trajectory is updated to obtain the tracking trajectory of the text in the video; the tracking trajectory is used to represent the position information of the text in the video.
The electronic device according to claim 21, wherein the processor is configured to execute the instructions to implement the following steps:

For each video frame, determine the text area from the video frame;

Encoding the position information of the text area to obtain the position feature;

extracting the image features from the video frame based on the position information of the text region;

The image feature is decoded to obtain the character sequence feature.
The electronic device according to claim 22, wherein the processor is configured to execute the instructions to implement the following steps:

performing feature extraction on the video frame to obtain a feature map corresponding to the video frame;

Determine the text area heat map corresponding to the feature map; the text area heat map represents the text area and non-text area in the feature map;

Connected domain analysis is performed on the text area heat map to obtain the text area.
The electronic device according to claim 23, wherein the processor is configured to execute the instructions to implement the following steps:

mapping the position information of the text area to the feature map to obtain a mapping area corresponding to the text area;

Extracting images located in the mapping region from the feature map to obtain the image features.
The electronic device according to claim 21, wherein the processor is configured to execute the instructions to implement the following steps:

For each video frame, splicing the character sequence feature, the position feature and the image feature of the video frame to obtain the fusion feature corresponding to the video frame;

Adjust the size of the fusion feature to obtain the fusion feature descriptor.
The electronic device according to claim 21, wherein the reference tracking track includes a reference text area, the plurality of video frames are continuous on the time axis, and the processor is configured to execute the instructions to achieve the following steps :

Traverse each video frame in sequence, and when traversing each video frame, perform the following operations: the current fusion feature descriptor corresponding to the current video frame traversed, the reference corresponding to the reference text area The fusion feature descriptor performs similarity matching to obtain the similarity matching result corresponding to the current video frame; based on the similarity matching result, the current fusion feature descriptor and the current text area corresponding to the current video frame, update The reference tracking track obtains the target tracking track corresponding to the current video frame; and determines the target tracking track as the updated reference tracking track; the sequence characterizes each video frame at the time the order on the axes;

The target tracking track corresponding to the last sorted video frame is determined as the tracking track of the text in the video.
The electronic device according to claim 26, wherein the processor is configured to execute the instructions to implement the following steps:

When the similarity matching result is less than a preset similarity threshold, update the reference text region based on the current text region, and update the reference fusion feature descriptor based on the current fusion feature descriptor, to obtain the the target tracking trajectory.
The electronic device according to claim 26, wherein the processor is configured to execute the instructions to implement the following steps:

When the similarity matching result is greater than or equal to a preset similarity threshold, the current text region and the current fusion feature descriptor are added to the reference tracking track to obtain the target tracking track.
The electronic device according to claim 21, wherein the character tracking network includes a detection branch network, a recognition branch network and a multi-information fusion description sub-branch network;

The detection branch network is used to detect text regions in each video frame;

The recognition branch network is used to perform the steps of obtaining the character sequence features of the characters in each video frame, the positional features of the character region where the characters are located, and the corresponding image features of the character region;

The multi-information fusion description sub-branch network is used to perform the said fusion process corresponding to each video frame according to the character sequence feature, the position feature and the image feature of each video frame. Steps for feature descriptors.
The electronic device according to claim 29, wherein the detection branch network comprises a feature extraction subnetwork, a first feature fusion subnetwork and a feature detection subnetwork;

The feature extraction sub-network is used to extract the basic features of the video frame for each video frame to obtain the basic features of the video frame;

The first feature fusion sub-network is used to perform multi-scale feature fusion processing on the extracted basic features to obtain the feature map corresponding to the video frame;

The feature detection sub-network is used to process the feature map of the video frame into a text area heat map, and the text area heat map represents a text area and a non-text area in the feature map; Connected domain analysis is performed on the graph to obtain the text region corresponding to the video frame.
A computer-readable storage medium, when instructions in the computer-readable storage medium are executed by a processor of an electronic device, the electronic device is made to perform the following steps:

Extract multiple video frames from the video;

Obtain the character sequence feature of the text in each video frame, the position feature of the text area where the text is located, and the image feature corresponding to the text area;

Determining a fusion feature descriptor corresponding to each video frame according to the character sequence feature, the position feature, and the image feature of each video frame;

Based on the fusion feature descriptor corresponding to each video frame, the stored reference tracking trajectory is updated to obtain the tracking trajectory of the text in the video; the tracking trajectory is used to represent the position information of the text in the video.
A computer program product, including a computer program, when the computer program is executed by a processor, the following steps are implemented:

Extract multiple video frames from the video;

Obtain the character sequence feature of the text in each video frame, the position feature of the text area where the text is located, and the image feature corresponding to the text area;

Determining a fusion feature descriptor corresponding to each video frame according to the character sequence feature, the position feature, and the image feature of each video frame;

Based on the fusion feature descriptor corresponding to each video frame, the stored reference tracking trajectory is updated to obtain the tracking trajectory of the text in the video; the tracking trajectory is used to represent the position information of the text in the video.