WO2023115838A1 - Video text tracking method and electronic device - Google Patents

Video text tracking method and electronic device Download PDF

Info

Publication number
WO2023115838A1
WO2023115838A1 PCT/CN2022/097877 CN2022097877W WO2023115838A1 WO 2023115838 A1 WO2023115838 A1 WO 2023115838A1 CN 2022097877 W CN2022097877 W CN 2022097877W WO 2023115838 A1 WO2023115838 A1 WO 2023115838A1
Authority
WO
WIPO (PCT)
Prior art keywords
feature
text
video frame
video
fusion
Prior art date
Application number
PCT/CN2022/097877
Other languages
French (fr)
Chinese (zh)
Inventor
李壮
Original Assignee
北京达佳互联信息技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京达佳互联信息技术有限公司 filed Critical 北京达佳互联信息技术有限公司
Publication of WO2023115838A1 publication Critical patent/WO2023115838A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • G06T7/246Analysis of motion using feature-based methods, e.g. the tracking of corners or segments
    • G06T7/248Analysis of motion using feature-based methods, e.g. the tracking of corners or segments involving reference images or patches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10016Video; Image sequence

Definitions

  • the present disclosure relates to the field of computer technology, in particular to a video text tracking method and electronic equipment.
  • the video contains a lot of text, which can be used as an objective description of the target in the video and a subjective summary of the scene.
  • the text in the video has information such as start and end time, position change, text content, etc. How to correctly track the text in the video is a key step in video understanding.
  • the disclosure provides a video text tracking method and an electronic device, which have a higher accuracy rate of text tracking and reduce the consumption of computing resources in the video text tracking process.
  • the disclosed technical scheme is as follows:
  • a video text tracking method including:
  • the stored reference tracking trajectory is updated to obtain the tracking trajectory of the text in the video; the tracking trajectory is used to represent the position information of the text in the video.
  • the acquisition of the character sequence features of the text in each video frame, the positional features of the text area where the text is located, and the image features corresponding to the text area include:
  • the image feature is decoded to obtain the character sequence feature.
  • the determining the text area from the video frame includes:
  • the text area heat map represents the text area and non-text area in the feature map
  • Connected domain analysis is performed on the text area heat map to obtain the text area.
  • the extracting the image feature from the video frame based on the position information of the text region includes:
  • the fusion feature descriptor corresponding to each video frame is obtained according to the character sequence feature, the position feature, and the image feature of each video frame, including:
  • the reference tracking trajectory includes a reference text area
  • the plurality of video frames are continuous on the time axis
  • the stored reference tracking trajectory is updated based on the fusion feature descriptor corresponding to each video frame , to obtain the tracking track of the text in the video, including:
  • the fusion feature descriptor performs similarity matching to obtain the similarity matching result corresponding to the current video frame; based on the similarity matching result, the current fusion feature descriptor and the current text area corresponding to the current video frame, update
  • the reference tracking track obtains the target tracking track corresponding to the current video frame; and the target tracking track is determined as the updated reference tracking track; the sequence characterizes each video frame at the time the order on the axis;
  • the target tracking track corresponding to the last sorted video frame is determined as the tracking track of the text in the video.
  • the reference tracking track is updated to obtain the current text region corresponding to the current video frame.
  • Target tracking trajectory including:
  • the reference tracking track is updated to obtain the current text region corresponding to the current video frame.
  • Target tracking trajectory including:
  • the current text region and the current fusion feature descriptor are added to the reference tracking track to obtain the target tracking track.
  • the method is executed based on a text tracking network, and the text tracking network includes a detection branch network, a recognition branch network and a multi-information fusion descriptor branch network;
  • the detection branch network is used to detect text regions in each video frame
  • the recognition branch network is used to perform the steps of obtaining the character sequence features of the characters in each video frame, the positional features of the character region where the characters are located, and the corresponding image features of the character region;
  • the multi-information fusion description sub-branch network is used to perform the said fusion process corresponding to each video frame according to the character sequence feature, the position feature and the image feature of each video frame. Steps for feature descriptors.
  • the detection branch network includes a feature extraction subnetwork, a first feature fusion subnetwork and a feature detection subnetwork;
  • the feature extraction sub-network is used to extract the basic features of the video frame for each video frame to obtain the basic features of the video frame;
  • the first feature fusion sub-network is used to perform multi-scale feature fusion processing on the extracted basic features to obtain the feature map corresponding to the video frame;
  • the feature detection sub-network is used to process the feature map of the video frame into a text area heat map, and the text area heat map represents a text area and a non-text area in the feature map; Connected domain analysis is performed on the graph to obtain the text region corresponding to the video frame.
  • a video text tracking device including:
  • An extraction module configured to extract a plurality of video frames from the video
  • the feature acquisition module is configured to obtain the text sequence feature of the text in each video frame, the position feature of the text area where the text is located, and the corresponding image feature of the text area;
  • the descriptor acquisition module is configured to obtain the fusion feature descriptor corresponding to each video frame according to the character sequence feature, the position feature and the image feature of each video frame;
  • the tracking trajectory determination module is configured to update the stored reference tracking trajectory based on the fusion feature descriptor corresponding to each video frame to obtain the tracking trajectory of the text in the video; the tracking trajectory is used to characterize the video The location information of Chinese characters.
  • the feature acquisition module includes:
  • a text area determination unit configured to, for each video frame, determine the text area from the video frame
  • a location feature acquisition unit configured to encode the location information of the text area to obtain the location feature
  • an image feature extraction unit configured to extract the image feature from the video frame based on the location information of the text region
  • the character sequence feature acquisition unit is configured to decode the image feature to obtain the character sequence feature.
  • the text area determination unit includes:
  • the feature map determination subunit is configured to perform feature extraction on the video frame to obtain a feature map corresponding to the video frame;
  • the text area heat map determination subunit is configured to determine the text area heat map corresponding to the feature map; the text area heat map represents the text area and non-text area in the feature map;
  • the connected domain analysis subunit is configured to perform connected domain analysis on the text area heat map to obtain the text area.
  • the image feature extraction unit includes:
  • the mapping subunit is configured to map the position information of the text region to the feature map to obtain a mapping region corresponding to the text region;
  • the image feature extraction subunit is configured to extract the image located in the mapping area from the feature map to obtain the image feature.
  • the descriptor acquisition module includes:
  • the splicing unit is configured to, for each video frame, splice the character sequence feature, the position feature, and the image feature of the video frame to obtain a fusion feature corresponding to the video frame;
  • the adjustment unit is configured to adjust the size of the fusion feature to obtain the fusion feature descriptor.
  • the reference tracking track includes a reference text area
  • the plurality of video frames are a plurality of video frames continuous on the time axis
  • the tracking track determination module includes:
  • the similarity matching unit is configured to traverse each video frame in sequence, and when traversing each video frame, perform the following operations: combine the current fusion feature descriptor corresponding to the traversed current video frame with The reference fusion feature descriptor corresponding to the reference text area performs similarity matching to obtain a similarity matching result corresponding to the current video frame; based on the similarity matching result, the current fusion feature descriptor and the current video In the current text area corresponding to the frame, update the reference tracking track to obtain the target tracking track corresponding to the current video frame; and determine the target tracking track as the updated reference tracking track; the order represents the the order of each video frame on said time axis;
  • the tracking track determination unit is configured to determine the target tracking track corresponding to the last video frame as the tracking track of the text in the video.
  • the similarity matching unit is configured to update the reference text area based on the current text area when the similarity matching result is smaller than a preset similarity threshold, and based on the The current fusion feature descriptor updates the reference fusion feature descriptor to obtain the target tracking trajectory.
  • the similarity matching unit is configured to add the current text region and the current fusion feature descriptor to to the reference tracking trajectory to obtain the target tracking trajectory.
  • the text tracking network includes a detection branch network, a recognition branch network and a multi-information fusion descriptor branch network;
  • the detection branch network is used to detect text regions in each video frame
  • the recognition branch network is used to perform the steps of obtaining the character sequence features of the characters in each video frame, the positional features of the character region where the characters are located, and the corresponding image features of the character region;
  • the multi-information fusion description sub-branch network is used to perform the said fusion process corresponding to each video frame according to the character sequence feature, the position feature and the image feature of each video frame. Steps for feature descriptors.
  • the detection branch network includes a feature extraction subnetwork, a first feature fusion subnetwork and a feature detection subnetwork;
  • the feature extraction sub-network is used to extract the basic features of the video frame for each video frame to obtain the basic features of the video frame;
  • the first feature fusion sub-network is used to perform multi-scale feature fusion processing on the extracted basic features to obtain the feature map corresponding to the video frame;
  • the feature detection sub-network is used to process the feature map of the video frame into a text area heat map, and the text area heat map represents a text area and a non-text area in the feature map; Connected domain analysis is performed on the graph to obtain the text region corresponding to the video frame.
  • an electronic device including:
  • a memory for storing instructions executable by the processor; wherein the processor is configured to execute the instructions to implement the following steps:
  • the stored reference tracking trajectory is updated to obtain the tracking trajectory of the text in the video; the tracking trajectory is used to represent the position information of the text in the video.
  • the processor is configured to execute the instructions to implement the following steps:
  • the image feature is decoded to obtain the character sequence feature.
  • the processor is configured to execute the instructions to implement the following steps:
  • the text area heat map represents the text area and non-text area in the feature map
  • Connected domain analysis is performed on the text area heat map to obtain the text area.
  • the processor is configured to execute the instructions to implement the following steps:
  • the processor is configured to execute the instructions to implement the following steps:
  • the character sequence feature, the position feature and the image feature of the video frame are spliced to obtain the fusion feature corresponding to the video frame;
  • the reference tracking track includes a reference text area
  • the plurality of video frames are continuous on the time axis
  • the processor is configured to execute the instructions to implement the following steps:
  • the fusion feature descriptor performs similarity matching to obtain the similarity matching result corresponding to the current video frame; based on the similarity matching result, the current fusion feature descriptor and the current text area corresponding to the current video frame, update
  • the reference tracking track obtains the target tracking track corresponding to the current video frame; and the target tracking track is determined as the updated reference tracking track; the sequence characterizes each video frame at the time the order on the axes;
  • the target tracking track corresponding to the last sorted video frame is determined as the tracking track of the text in the video.
  • the processor is configured to execute the instructions to implement the following steps:
  • the processor is configured to execute the instructions to implement the following steps:
  • the current text region and the current fusion feature descriptor are added to the reference tracking track to obtain the target tracking track.
  • the text tracking network includes a detection branch network, a recognition branch network and a multi-information fusion descriptor branch network;
  • the detection branch network is used to detect text regions in each video frame
  • the recognition branch network is used to perform the steps of obtaining the character sequence features of the characters in each video frame, the positional features of the character region where the characters are located, and the corresponding image features of the character region;
  • the multi-information fusion description sub-branch network is used to perform the said fusion process corresponding to each video frame according to the character sequence feature, the position feature and the image feature of each video frame. Steps for feature descriptors.
  • the detection branch network includes a feature extraction subnetwork, a first feature fusion subnetwork and a feature detection subnetwork;
  • the feature extraction sub-network is used to extract the basic features of the video frame for each video frame to obtain the basic features of the video frame;
  • the first feature fusion sub-network is used to perform multi-scale feature fusion processing on the extracted basic features to obtain the feature map corresponding to the video frame;
  • the feature detection sub-network is used to process the feature map of the video frame into a text area heat map, and the text area heat map represents a text area and a non-text area in the feature map; Connected domain analysis is performed on the graph to obtain the text region corresponding to the video frame.
  • a computer-readable storage medium is provided, and when instructions in the computer-readable storage medium are executed by a processor of an electronic device, the electronic device is made to perform the following steps:
  • the stored reference tracking trajectory is updated to obtain the tracking trajectory of the text in the video; the tracking trajectory is used to represent the position information of the text in the video.
  • a computer program product including a computer program, and when the computer program is executed by a processor, the following steps are implemented:
  • the stored reference tracking trajectory is updated to obtain the tracking trajectory of the text in the video; the tracking trajectory is used to represent the position information of the text in the video.
  • the embodiment of the present disclosure extracts a plurality of video frames from the video, and according to the character sequence feature of the text in each video frame, the position feature of the text area where the text is located, and the image feature corresponding to the text area, the corresponding video frame of each video frame is obtained.
  • the fusion feature descriptor of each video frame is used to update the stored reference tracking trajectory based on the fusion feature descriptor corresponding to each video frame to obtain the tracking trajectory of the text in the video.
  • the determination of the tracking trajectory of the text in the video fully considers the text sequence features, position features and image features of the text, the accuracy of text tracking is relatively high; in addition, since the disclosure updates the reference tracking trajectory according to the fusion feature descriptor, that is The tracking track of the text in the video can be obtained, avoiding the continuous processing of each frame of image through multiple models (text detection, text recognition, other models, etc.), and reducing the consumption of computing resources during the video text tracking process.
  • Fig. 1 is a schematic diagram of an implementation environment of a video text tracking method according to an exemplary embodiment.
  • Fig. 2 is a flow chart of a video text tracking method according to an exemplary embodiment.
  • Fig. 3 is a flow chart of determining character sequence features, location features and image features according to an exemplary embodiment.
  • Fig. 4 is a flow chart of determining a corresponding text area from each video frame through a text tracking network according to an exemplary embodiment.
  • Fig. 5 is a schematic diagram of a character tracking network according to an exemplary embodiment.
  • Fig. 6 is a flow chart of obtaining fusion feature descriptors through a character tracking network according to an exemplary embodiment.
  • Fig. 7 is a flow chart of updating a reference tracking track to obtain a tracking track of text in a video according to an exemplary embodiment.
  • Fig. 8 is a block diagram of a video text tracking device according to an exemplary embodiment.
  • Fig. 9 is a block diagram showing an electronic device for video text tracking according to an exemplary embodiment.
  • the method and device involved in the present disclosure the device and the storage medium can obtain relevant information of the user.
  • the video is usually processed into multiple frames of images, and then text detection and text recognition are performed on each frame of images one by one to obtain video-level text tracking results.
  • the methods in the related art make the accuracy rate of video text tracking lower.
  • the embodiments of the present disclosure fully consider character sequence features, position features, and image features of characters, thereby improving the accuracy of character tracking.
  • Fig. 1 is a schematic diagram of an implementation environment of a video text tracking method according to an exemplary embodiment. As shown in FIG. 1 , the implementation environment includes at least a terminal 01 and a server 02 .
  • the terminal 01 is used to collect video to be processed.
  • the terminal 01 is a smart phone, a tablet computer, a notebook computer, a desktop computer, a smart speaker, a smart watch, a vehicle terminal, a smart TV, a smart voice interaction device, etc., but is not limited thereto.
  • the terminal 01 and the server 02 are directly or indirectly connected through wired or wireless communication, which is not limited in this disclosure.
  • the server 02 is used to receive the video sent by the terminal 01, extract a plurality of video frames from the video, obtain the character sequence feature of the text in each video frame, the position feature of the text area where the text is located, and the corresponding text area.
  • the image features of each video frame are determined according to the above-mentioned character sequence feature, the above-mentioned position feature and the above-mentioned image feature of each video frame, and the fusion feature descriptor corresponding to each video frame is determined, and the fusion feature descriptor corresponding to each video frame is updated based on the above-mentioned Refer to the tracking track to obtain the tracking track of the text in the above video.
  • the server 02 is an independent physical server, or a server cluster or distributed system composed of multiple physical servers, or a cloud server providing cloud computing services.
  • FIG. 1 is only an implementation environment of the video text tracking method provided by the embodiment of the present application, and other implementation environments may also be included in practical applications.
  • Fig. 2 is a flow chart of a video text tracking method according to an exemplary embodiment.
  • the method is used in an implementation environment including a terminal and a server in Fig. 1, as shown in the above implementation environment, After the terminal collects the video to be processed, the server executes the video text tracking method for the video to be processed.
  • the method includes the following steps S11, S13, S15 and S17.
  • the plurality of video frames are consecutive video frames on the time axis. That is, the video includes multiple video frames, and the multiple video frames are arranged sequentially on the time axis, and the multiple video frames extracted by the server from the video are adjacent on the time axis. If the server selects a time period from the video and extracts multiple video frames within the time period, that is, multiple consecutive video frames.
  • multiple video frames are extracted from the video in various ways, which are not specifically limited in the present disclosure.
  • FFmpeg is used to decimate the video into consecutive video frames.
  • the video to be processed is divided into n temporally continuous image frames, denoted as Frame-1,...Frame-t,...,Frame-n, where Frame-t represents the temporal position in the video as The video frame of t, n is a positive integer greater than 1.
  • FFmpeg is a set of open source computer programs used to record, convert digital video, and convert it into a stream. It provides a complete solution for recording, converting, and streaming audio and video.
  • the video is processed into a plurality of continuous video frames on the time axis through other video stream decoding technologies.
  • the continuous video frames are Frame-1, ... Frame-t, ..., Frame-n
  • the character sequence features, position features and image features corresponding to each of the multiple consecutive video frames are obtained in parallel.
  • the text area where the text is located is a text area at the text line level, that is, the text area includes a line of text, and the text in the line of text is sequential. Therefore, the text sequence feature represents the text area in the text area.
  • the sequence of the text is a feature map intercepted in a corresponding video frame through the coordinates of the text.
  • the location feature of the text area where the text is located is the position coordinates of the text area.
  • the position coordinates of the text area are the coordinates of the upper left corner, the lower left corner, the upper right corner and the lower right corner of the text area.
  • the image feature corresponding to the text area is the feature of the image corresponding to the text area in the video frame.
  • Fig. 3 is a flowchart showing a method for determining character sequence features, location features and image features according to an exemplary embodiment.
  • the above-mentioned acquisition of the character sequence features of the text in each video frame, the positional features of the text area where the text is located, and the image features corresponding to the above text area include the following steps S1301, S1303, S1305 and S1307:
  • the above-mentioned text area is determined from each of the above-mentioned video frames.
  • the text area where the text in each video frame is located is sequentially determined from each video frame according to the order of each video frame on the time axis. For example, if the continuous video frames are Frame-1, ... Frame-t, ..., Frame-n, then determine the text area where the text in Frame-1 is located from Frame-1, ..., determine Frame from Frame-t The text area where the text in -t is located, ..., determine the text area where the text in Frame-n is located from Frame-n.
  • the text regions where the texts in the video frames are respectively located are determined from the video frames in parallel.
  • the text area is a text box corresponding to the text in the video frame.
  • the corresponding text area is determined from each video frame by a text frame recognition tool.
  • the corresponding text area is determined from each video frame through a text tracking network.
  • Fig. 4 is a flow chart of determining a corresponding text area from each video frame through a text tracking network according to an exemplary embodiment. As shown in FIG. 4, in the above S1301, the above-mentioned determination of the above-mentioned text area from each of the above-mentioned video frames includes the following steps S13011, S13013 and S13015:
  • the above steps S13011 , S13013 and S13015 are executed sequentially for each video frame according to the order of each video frame on the time axis.
  • the above steps S13011, S13013 and S13015 are executed in parallel for multiple video frames.
  • Fig. 5 is a schematic diagram of a character tracking network according to an exemplary embodiment.
  • the text tracking network includes a detection branch network, which includes a feature extraction subnetwork, a first feature fusion subnetwork, and a feature detection subnetwork, and the feature detection network further includes two convolutional layers. (convolutional) and a global average pooling layer (global avg pooing).
  • the basic feature extraction is performed on each video frame through the feature extraction sub-network (for example, a classic convolutional network based on resnet18) to obtain the basic feature of each video frame, which can be understood as A feature map of Batch*64*W*H width.
  • the batch size is a hyperparameter used to define the number of samples to be processed before updating the internal model parameters
  • W is the height
  • H is the width.
  • the first feature fusion sub-network for example, two stacked feature pyramid enhancement modules (Feature Pyramid Enhancement Module, FPEM)
  • the extracted basic features are processed by multi-scale feature fusion to obtain the feature map corresponding to each video frame .
  • FPEM is used to enhance the features extracted by the convolutional neural network, in order to make the extracted features more robust.
  • the feature map of each video frame is processed into whether it is text or not through two convolutional and one global avg pooling in the feature detection sub-network Text area heatmap of regions.
  • the text area heat map includes two parts, one is a text area and the other is a non-text area.
  • the connected domain analysis is performed on the text area heat map to obtain the position of the text area, and then the position of the text area is fitted to a rotated rectangle to obtain the text area corresponding to each video frame.
  • the connected domain refers to the adjacent regions with the same pixel value in the image.
  • Connected domain analysis refers to finding and marking the connected domains in the image.
  • At least one text area can be detected through the above method, denoted as box-1...box-m, a total of m text areas (m is a positive integer greater than or equal to 1).
  • box-1 detects multiple boxes
  • ... Frame-t detects multiple boxes
  • Frame-n detects multiple boxes.
  • the text area can be accurately determined from each video frame, and the determination process of the text area does not need to rely on multiple models (text detection, text recognition, other models, etc.), which can reduce the consumption of computing resources during the calculation process of the text area.
  • the position information of the text area in each video frame is sequentially encoded.
  • the location information of text regions in multiple video frames is encoded in parallel.
  • position coding is carried out to the position information of the text area (that is, the coordinates of the upper left corner, the coordinates of the lower left corner, the coordinates of the upper right corner, and the coordinates of the lower right corner of the text area) to obtain the position characteristics of the text area, the position The feature is 1*128 dimensions.
  • the position encoding includes but not limited to: cosine position encoding (cos encoding), sine position encoding (sin encoding) and the like.
  • each video frame since each video frame includes at least one text area, for at least one text area belonging to the same video frame, each video frame is sequentially determined according to the sequence of at least one text area in the video frame The location features corresponding to each text area in . For example, if the text area of a certain video frame is box-1...box-m, first determine the location feature of box-1, ..., and finally determine the location feature of box-m.
  • the image features corresponding to each text area are extracted from each video frame according to the position information of the text area.
  • the image features of the respective corresponding text regions are extracted from each video frame according to the position information of the text regions in each video frame.
  • the image features of the corresponding text regions are extracted from the multiple video frames in parallel.
  • each video frame includes at least one text area, for at least one text area belonging to the same video frame, each video frame is sequentially determined according to the sequence of at least one text area in the video frame Image features corresponding to each text area in .
  • the above-mentioned image features are extracted from each of the above-mentioned video frames based on the position information of the above-mentioned text area, including:
  • the position mapping result is the mapping area corresponding to the text area.
  • the position information of the text area of each video frame is mapped to the feature map corresponding to each video frame, Obtain the position mapping result (i.e., a mapping area) of the text area of each video frame, intercept the image located in the position mapping result in the feature map of each video frame, and obtain the image feature corresponding to the text area of each video frame .
  • the position mapping result i.e., a mapping area
  • the above location features and image features are extracted through the above text tracking network.
  • the above-mentioned character tracking network also includes a recognition branch network, through which the recognition branch network extracts the corresponding location feature and image feature from each video frame according to the location information of the text region.
  • the determination process of image features does not need to rely on multiple models (text detection, text recognition, other models, etc.), which reduces the consumption of computing resources in the calculation process of image features.
  • each text area of each video frame is a text area at the text line level, that is, a line of text is included in the text area, and each text has a corresponding position and order. Therefore, after obtaining each video frame After the image features corresponding to each text area of each video frame, the image features are decoded to obtain the text sequence features corresponding to each text area of each video frame.
  • the image features corresponding to each text area in each video frame are decoded sequentially to obtain the text sequence corresponding to each text area in each video frame feature.
  • the image features corresponding to the text regions corresponding to the plurality of video frames are decoded in parallel to obtain the character sequence features corresponding to the text regions in each video frame.
  • the above-mentioned image features are decoded through the above-mentioned character tracking network to obtain the above-mentioned character sequence features.
  • the above-mentioned text tracking network also includes a recognition branch network, which decodes the image features intercepted by each text region of each video frame (for example, connectionist temporal classification decoding (CTC decoding)) , so as to obtain the text sequence features of each text area of each video frame.
  • CTC decoding connectionist temporal classification decoding
  • the above-mentioned text area is determined from each video frame, and according to the position information of the text area, the position feature, image feature and text sequence feature of the text area are determined, which can improve the position feature, image feature and text
  • the determination accuracy of sequence features thereby improving the determination accuracy of fusion feature descriptors, and then improving the accuracy of video text tracking; in addition, the determination process of position features, image features and text sequence features does not need to rely on multiple models (text detection, text recognition, etc.) , other models, etc.), so as to reduce the consumption of computing resources in the process of determining position features, image features and text sequence features.
  • the text sequence features corresponding to each text area in each video frame, the above-mentioned position features, and the above-mentioned image features are fused in order to obtain each video frame Fusion feature descriptors corresponding to each text region in .
  • the text sequence features corresponding to each text area in multiple video frames, the above-mentioned position features, and the above-mentioned image features are fused in parallel to obtain a fusion feature descriptor corresponding to each text area in each video frame .
  • each video frame is fused in various ways to obtain a fused feature descriptor corresponding to each video frame, which is not specifically limited in the present disclosure.
  • the above-mentioned three features are fused through the above-mentioned character tracking network to obtain a fused feature descriptor corresponding to each video frame.
  • the above text tracking network further includes a multi-information fusion description sub-branch network
  • the multi-information fusion description sub-branch network further includes a second feature fusion sub-network and a feature size adjustment sub-network.
  • FIG. 6 is a flow chart of obtaining the fusion feature descriptor through the text tracking network according to an exemplary embodiment.
  • the above-mentioned fusion feature descriptor corresponding to each video frame is obtained according to the above-mentioned character sequence feature, the above-mentioned position feature and the above-mentioned image feature of each video frame, including the following steps S1501 and S1503:
  • the above-mentioned character sequence feature, the above-mentioned position feature and the above-mentioned image feature are concatenated to obtain the fusion feature corresponding to the video frame.
  • the size of the fusion feature is adjusted to obtain the fusion feature descriptor.
  • the text sequence features corresponding to the text areas of each video frame, the above position features and the above image features are spliced to obtain the text of each video frame
  • the fusion feature corresponding to the region.
  • the dimension of the fusion feature is 3*128 dimensions.
  • the output size of the fusion feature corresponding to each text region of each video frame is adjusted by a feature size adjustment sub-network (for example, two multilayer perceptrons (Multilayer Perceptron, MLP)) ( For example, 3*128 dimensions -> 1*128 dimensions), to obtain the fusion feature descriptor corresponding to each text area of each video frame.
  • a feature size adjustment sub-network for example, two multilayer perceptrons (Multilayer Perceptron, MLP)) ( For example, 3*128 dimensions -> 1*128 dimensions), to obtain the fusion feature descriptor corresponding to each text area of each video frame.
  • MLP multilayer perceptrons
  • the multi-layer perceptron is a forward-structured artificial neural network, which includes an input layer, an output layer, and multiple hidden layer fusion feature descriptors.
  • the multi-information fusion description sub-branch network also includes a position feature extraction sub-network, an image feature extraction sub-network, and a text sequence feature extraction sub-network, and the above-mentioned text sequence features are extracted through the text sequence feature extraction sub-network.
  • the location feature extraction subnetwork extracts location features, and image features are extracted through the image feature extraction subnetwork.
  • feature extraction, feature recognition, and feature fusion are performed through the above-mentioned end-to-end text tracking network to obtain a fusion feature descriptor, so as to obtain the tracking track of the video text through the fusion feature descriptor, and realize the acquisition through a model.
  • the video text tracking trajectory, the determination accuracy and robustness of the video text tracking trajectory are high, and the calculation resource consumption is small.
  • each video frame corresponds to at least one text region
  • the fusion feature descriptors corresponding to each text region in each video frame are sequentially determined according to the sequence of at least one text region in the video frame. For example, if the text area of a certain video frame is box-1...box-m, first determine that box-1 corresponds to a fusion feature descriptor, ...finally determine that box-m corresponds to a fusion feature descriptor.
  • the fusion feature is obtained by concatenating the character sequence features, the above position features, and the above image features of each text area of each video frame, and the size of the fusion feature is adjusted to obtain the above fusion feature descriptor, so that
  • the determination of the fusion feature descriptor can fully consider the character sequence features, position features and image features of the text, and the determination accuracy of the fusion feature descriptor is high, thereby improving the accuracy of text tracking; in addition, the determination of the fusion feature descriptor
  • the process does not need to rely on multiple models (text detection, text recognition, other models, etc.), reducing the consumption of computing resources in the process of determining the fusion feature descriptor.
  • the stored reference tracking track is updated to obtain the tracking track of the text in the video; the tracking track is used to represent the position information of the text.
  • a reference tracking track used to characterize the position information of the text is set, and after obtaining the fusion feature descriptor of the text area of each video frame, the fusion feature descriptor of the text area in each video frame is used , update the reference fusion feature descriptor in the reference tracking track, and obtain the tracking track of the text in the video.
  • the above update includes but not limited to: modification, replacement, addition, deletion and so on.
  • Fig. 7 is a flow chart of updating the reference tracking track to obtain the tracking track of the text in the video according to an exemplary embodiment.
  • the reference tracking track is updated based on the fusion feature descriptor corresponding to each video frame above, and the tracking track of the text in the above video is obtained, including the following steps S1701 and S1703:
  • the target tracking track corresponding to the last video frame is used as the tracking track of the text in the above-mentioned video.
  • the video frames and the text in the video frame are sequential. Updates are performed sequentially.
  • the continuous video frames are Frame-1, ... Frame-t, ..., Frame-n, and each video frame corresponds to a plurality of text areas box-1, ... box-t, ... box-m.
  • Each text region corresponds to a fused feature descriptor.
  • the reference tracking trajectory is empty during initialization.
  • Frame-1 is the current video frame, since the reference tracking track is empty, the multiple text areas box-1, ... box-t, ... box-m, and multiple text areas of Frame-1 are directly
  • the respective fusion feature descriptors are initialized to the reference tracking trajectory, and the target tracking trajectory 1 corresponding to Frame-1 is obtained, and the target tracking trajectory 1 is determined as the updated reference tracking trajectory.
  • Frame-2 is the current video frame, since the multiple text regions of Frame-1 and the fusion feature descriptors corresponding to the multiple text regions have been stored in the tracking track 1, the Frame-2’s
  • the similarity between the fusion feature descriptors corresponding to each of the multiple text regions and the existing fusion feature descriptor of track 1 is tracked, and the similarity matching result corresponding to Frame-2 is obtained.
  • the current fusion feature descriptors corresponding to each of the multiple text regions of Frame-2, and the multiple text regions of Frame-2 update the tracking trajectory 1 to obtain the target tracking trajectory 2 corresponding to the above Frame-2 , and determine the target tracking trajectory 2 as the updated reference tracking trajectory.
  • the fusion feature descriptors of each text region of each video frame are sequentially matched with the reference fusion feature descriptors in the reference tracking trajectory, and the similarity results of each video frame are sequentially compared
  • the reference fusion feature descriptor in the reference tracking trajectory is updated, so that the final tracking trajectory can fully consider the text sequence features, position features and image features of the text, and the accuracy of text tracking is high; in addition, the tracking trajectory of text tracking
  • the determination process is an end-to-end process (for example, using an end-to-end text tracking network to achieve text tracking), which does not need to rely on multiple models (text detection, text recognition, other models, etc.), reducing the determination process of the tracking trajectory consumption of computing resources.
  • the above-mentioned current fusion feature descriptor corresponding to the above-mentioned current video frame is similarly matched with the reference fusion feature descriptor in the above-mentioned reference tracking track, and the similarity matching result corresponding to the above-mentioned current video frame is obtained, including :
  • the above-mentioned current fusion feature descriptor and the above-mentioned reference fusion feature descriptor are used to calculate the similarity confusion matrix to obtain the above-mentioned similarity matching result.
  • the similarity confusion matrix is calculated to obtain the similarity, and then the similarity with higher similarity is selected through the Hungarian algorithm to obtain the similarity matching result.
  • each video frame corresponds to multiple text areas box-1, ... box-t, ... box-m.
  • Each text region corresponds to a fused feature descriptor.
  • the reference tracking trajectory is empty during initialization.
  • Frame-1 is the current video frame, since the reference tracking track is empty, directly correspond to multiple text areas box-1, box-2 and box-3 of Frame-1, and multiple text areas
  • the fusion feature descriptor of is added to the reference tracking trajectory to obtain the target tracking trajectory 1 corresponding to Frame-1, and the target tracking trajectory 1 is determined as the updated reference tracking trajectory.
  • Frame-2 is the current video frame, assuming that Frame-2 includes box-1 and box-2, then the fusion feature descriptor of box-1 of Frame-2 and the fusion feature descriptor of box-2 , and calculate the similarity confusion matrix with the fusion feature descriptors corresponding to box-1, box-2 and box-3 in tracking track 1 respectively, and get the fusion feature descriptors of box-1 of Frame-2 respectively with Track the similarity between the fusion feature descriptors corresponding to box-1, box-2 and box-3 in track 1 (that is, box-1 of Frame-2, get 3 similarities), and Frame-2's The similarity between the fusion feature descriptors of box-2 and the fusion feature descriptors corresponding to box-1, box-2 and box-3 in tracking track 1 respectively (that is, box-2 of Frame-2, 3 similarity).
  • the Hungarian matching algorithm is used to select the similarity with the highest similarity from the three similarities of box-1 of Frame-2 as the similarity matching result of box-1 of Frame-2. And through the Hungarian matching algorithm, from the three similarities of box-2 of Frame-2, the similarity with the highest similarity is selected as the similarity matching result of box-2 of Frame-2.
  • the above-mentioned current fusion feature descriptor and the current text region corresponding to the above-mentioned current video frame is updated to obtain the target tracking track corresponding to the above-mentioned current video frame, including:
  • the reference text region is updated based on the current text region, and the reference fusion feature descriptor is updated based on the current fusion feature descriptor to obtain the target tracking trajectory.
  • the aforementioned reference tracking track is updated based on the aforementioned similarity matching result, the aforementioned current fusion feature descriptor, and the current text region corresponding to the aforementioned current video frame to obtain the target tracking track corresponding to the aforementioned current video frame, and further includes :
  • the current text region and the current fusion feature descriptor are added to the reference tracking track to obtain the target tracking track.
  • the text area in the current video frame may or may not be a part of the existing track, but a new text area different from the existing track.
  • the similarity matching result corresponding to the current video frame is compared with a preset similarity threshold.
  • the descriptor updates the above-mentioned reference fusion feature descriptor to update the current text region and the current fusion feature descriptor to the existing trajectory, thereby accurately determining the text region belonging to the existing trajectory, and improving the determination of the tracking trajectory of the text in the video precision.
  • the similarity matching result corresponding to the current video frame is greater than or equal to the preset similarity threshold, it indicates that the current text area is not part of the existing track, but a new text, then the above-mentioned current text area and The above-mentioned current fusion feature descriptor is added to the above-mentioned reference tracking trajectory to update the reference tracking trajectory, thereby accurately determining the text area that does not belong to the existing trajectory, and further improving the determination accuracy of the text tracking trajectory in the video.
  • the fusion feature descriptors of box-1 of Frame-2 and the fusion feature descriptions corresponding to box-1, box-2 and box-3 in tracking track 1 are obtained respectively.
  • the similarity between sub-items that is, box-1 of Frame-2, corresponding to 3 similarities
  • the fusion feature descriptor of box-2 of Frame-2 and box-1 and box-2 in tracking track 1 respectively
  • the similarity between the fusion feature descriptors corresponding to box-3 that is, box-2 of Frame-2, corresponding to 3 similarities).
  • the highest similarity selected from the three similarities corresponding to box-1 of Frame-2 through the Hungarian matching algorithm is the similarity obtained by matching box-1 in tracking track 1, and the similarity is greater than or Equal to the preset similarity threshold, indicating that the box-1 of the Frame-2 is part of the existing track (box-1 in the tracking track 1)
  • use the current fused feature descriptor of box-1 of Frame-2 to update the reference fused feature descriptor of box-1 in tracking trajectory 1 to combine box-1 of Frame-2 with the corresponding fused feature descriptor Update to track track 1.
  • the highest similarity selected from the three similarities corresponding to box-2 of Frame-2 through the Hungarian matching algorithm is the similarity obtained by matching box-2 in tracking track 1, and the similarity is less than the preset
  • the similarity threshold indicates that the box-2 of the Frame-2 is not a part of the existing trajectory, but a new text, then initialize the box-2 of the above Frame-2 and the corresponding fusion feature descriptor to the tracking trajectory 1 .
  • Fig. 8 is a block diagram of a video text tracking device according to an exemplary embodiment.
  • the device includes an extraction module 21 , a feature acquisition module 23 , a descriptor acquisition module 25 and a track determination module 27 .
  • the extraction module 21 is configured to extract a plurality of video frames from the video
  • the feature acquisition module 23 is configured to acquire the character sequence feature of the text in each video frame, the position feature of the text area where the above text is located, and the image feature corresponding to the above text area;
  • the descriptor acquisition module 25 is configured to obtain the fusion feature descriptor corresponding to each of the video frames according to the above-mentioned character sequence features, the above-mentioned position features, and the above-mentioned image features of each of the above-mentioned video frames;
  • the tracking trajectory determination module 27 is configured to update the stored reference tracking trajectory based on the fusion feature descriptor corresponding to each video frame, and obtain the tracking trajectory of the text in the video; the tracking trajectory is used to characterize the text in the video location information.
  • the feature acquisition module 23 includes:
  • a text area determination unit configured to, for each video frame, determine the text area from the video frame
  • the location feature acquisition unit is configured to encode the location information of the above-mentioned text area to obtain the above-mentioned location feature;
  • the image feature extraction unit is configured to extract the above image features from the above video frame based on the position information of the above text area;
  • the character sequence feature acquisition unit is configured to decode the above-mentioned image feature to obtain the above-mentioned character sequence feature.
  • the above text area determination unit includes:
  • the feature map determination subunit is configured to perform feature extraction on the video frame to obtain a feature map corresponding to the video frame;
  • the text area heat map determination subunit is configured to determine the text area heat map corresponding to the feature map; the text area heat map represents the text area and non-text area in the feature map;
  • the connected domain analysis subunit is configured to perform connected domain analysis on the heat map of the text area to obtain the text area.
  • the above-mentioned image feature extraction unit includes:
  • the mapping subunit is configured to map the position information of the above-mentioned text area to the above-mentioned feature map, and obtain the mapping area corresponding to the above-mentioned text area;
  • the image feature extraction subunit is configured to extract the image located in the above-mentioned mapping area from the above-mentioned feature map to obtain the above-mentioned image features.
  • the above-mentioned descriptor acquisition module 25 includes:
  • the splicing unit is configured to, for each video frame, splice the above-mentioned text sequence feature, the above-mentioned position feature, and the above-mentioned image feature of the above-mentioned video frame to obtain the fusion feature corresponding to the above-mentioned video frame;
  • the adjustment unit is configured to adjust the size of the above-mentioned fusion feature to obtain the above-mentioned fusion feature descriptor.
  • the above-mentioned reference tracking track includes a reference text area
  • the above-mentioned multiple video frames are multiple video frames continuous on the time axis
  • the above-mentioned tracking track determination module 27 includes:
  • the similarity matching unit is configured to traverse each of the above video frames in sequence, and when traversing each of the above video frames, perform the following operations: compare the current fusion feature descriptor corresponding to the traversed current video frame with the above reference
  • the reference fusion feature descriptor corresponding to the text area performs similarity matching to obtain the similarity matching result corresponding to the above-mentioned current video frame; based on the above-mentioned similarity matching result, the above-mentioned current fusion feature descriptor and the current text area corresponding to the above-mentioned current video frame, Updating the above-mentioned reference tracking trajectory to obtain the target tracking trajectory corresponding to the above-mentioned current video frame; and determining the above-mentioned target tracking trajectory as the updated reference tracking trajectory; the above-mentioned order represents the order of each of the above-mentioned video frames on the above-mentioned time axis;
  • the tracking track determination unit is configured to determine the target tracking track corresponding to the last video frame as the tracking track of the text in the video.
  • the above-mentioned similarity matching unit is configured to update the above-mentioned reference text area based on the above-mentioned current text area when the above-mentioned similarity matching result is smaller than the preset similarity threshold, and based on the above-mentioned current fusion feature descriptor The above-mentioned reference fusion feature descriptor is updated to obtain the above-mentioned target tracking trajectory.
  • the above-mentioned similarity matching unit is configured to add the above-mentioned current text region and the above-mentioned current fusion feature descriptor to the above-mentioned reference tracking when the above-mentioned similarity matching result is greater than or equal to a preset similarity threshold In the trajectory, the above-mentioned target tracking trajectory is obtained.
  • the text tracking network includes a detection branch network, a recognition branch network and a multi-information fusion descriptor branch network;
  • the detection branch network is used to detect text regions in each video frame
  • the recognition branch network is used to perform the steps of obtaining the character sequence features of the characters in each video frame, the positional features of the character region where the characters are located, and the corresponding image features of the character region;
  • the multi-information fusion description sub-branch network is used to perform the said fusion process corresponding to each video frame according to the character sequence feature, the position feature and the image feature of each video frame. Steps for feature descriptors.
  • the detection branch network includes a feature extraction subnetwork, a first feature fusion subnetwork and a feature detection subnetwork;
  • the feature extraction sub-network is used to extract the basic features of the video frame for each video frame to obtain the basic features of the video frame;
  • the first feature fusion sub-network is used to perform multi-scale feature fusion processing on the extracted basic features to obtain the feature map corresponding to the video frame;
  • the feature detection sub-network is used to process the feature map of the video frame into a text area heat map, and the text area heat map represents a text area and a non-text area in the feature map; Connected domain analysis is performed on the graph to obtain the text region corresponding to the video frame.
  • an electronic device including a processor; a memory for storing instructions executable by the processor; wherein the processor is configured to execute the instructions to implement the following steps:
  • the above-mentioned position feature and the above-mentioned image feature of each of the above-mentioned video frames determine the fusion feature descriptor corresponding to each of the above-mentioned video frames;
  • the stored reference tracking trajectory is updated based on the fusion feature descriptor corresponding to each video frame to obtain the tracking trajectory of the text in the video; the tracking trajectory is used to represent the position information of the text in the video.
  • the above-mentioned processor is configured to execute the above-mentioned instructions to implement the following steps:
  • the above-mentioned image features are decoded to obtain the above-mentioned character sequence features.
  • the above-mentioned processor is configured to execute the above-mentioned instructions to implement the following steps:
  • the above text area heat map represents the text area and non-text area in the above feature map
  • Connected domain analysis is performed on the above text area heat map to obtain the above text area.
  • the above-mentioned processor is configured to execute the above-mentioned instructions to implement the following steps:
  • the above-mentioned processor is configured to execute the above-mentioned instructions to implement the following steps:
  • the above-mentioned reference tracking trajectory includes a reference text area
  • the above-mentioned multiple video frames are continuous on the time axis
  • the above-mentioned processor is configured to execute the above-mentioned instructions to implement the following steps:
  • the above-mentioned similarity matching result corresponding to the current video frame is obtained; based on the above-mentioned similarity matching result, the above-mentioned current fusion feature descriptor and the current text area corresponding to the above-mentioned current video frame, the above-mentioned reference tracking track is updated to obtain the above-mentioned
  • the above-mentioned order represents the order of each of the above-mentioned video frames on the above-mentioned time axis;
  • the target tracking track corresponding to the last sorted video frame is determined as the tracking track of the text in the above video.
  • the above-mentioned processor is configured to execute the above-mentioned instructions to implement the following steps:
  • the reference text region is updated based on the current text region, and the reference fusion feature descriptor is updated based on the current fusion feature descriptor to obtain the target tracking trajectory.
  • the above-mentioned processor is configured to execute the above-mentioned instructions to implement the following steps:
  • the current text region and the current fusion feature descriptor are added to the reference tracking track to obtain the target tracking track.
  • the text tracking network includes a detection branch network, a recognition branch network and a multi-information fusion descriptor branch network;
  • the above detection branch network is used to detect text regions in each video frame
  • the recognition branch network is used to execute the steps of obtaining the character sequence features of the characters in each video frame, the positional features of the character regions where the characters are located, and the image features corresponding to the character regions;
  • the multi-information fusion descriptor branch network is used to perform the step of determining the fusion feature descriptor corresponding to each video frame according to the above-mentioned character sequence features, the above-mentioned position features and the above-mentioned image features of each of the above-mentioned video frames.
  • the detection branch network includes a feature extraction subnetwork, a first feature fusion subnetwork and a feature detection subnetwork;
  • the above-mentioned feature extraction sub-network is used to extract the basic features of the above-mentioned video frames for each video frame to obtain the basic features of the above-mentioned video frames;
  • the above-mentioned first feature fusion sub-network is used to perform multi-scale feature fusion processing on the extracted basic features to obtain the feature map corresponding to the above-mentioned video frame;
  • the above-mentioned feature detection sub-network is used to process the feature map of the above-mentioned video frame into a text area heat map, and the above-mentioned text area heat map represents the text area and non-text area in the above-mentioned feature map; performing connected domain analysis on the above-mentioned text area heat map , to obtain the text area corresponding to the above video frame.
  • the electronic device is a terminal, a server or a similar computing device.
  • FIG. 9 is a block diagram of an electronic device for video text tracking according to an exemplary embodiment.
  • the electronic device 30 Can produce bigger difference because of configuration or performance difference, can comprise one or more central processing unit (Central Processing Units, CPU) 31 (central processing unit 31 can comprise but not limited to microprocessor MCU or programmable logic device FPGA etc.), a memory 33 for storing data, and one or more storage media 32 for storing application programs 323 or data 322 (for example, one or more mass storage devices).
  • the memory 33 and the storage medium 32 may be temporary storage or persistent storage.
  • the program stored in the storage medium 32 may include one or more modules, and each module may include a series of instructions to operate on the electronic device. Further, the central processing unit 31 may be configured to communicate with the storage medium 32 , and execute a series of instruction operations in the storage medium 32 on the electronic device 30 .
  • the electronic device 30 can also include one or more power supplies 36, one or more wired or wireless network interfaces 35, one or more input and output interfaces 34, and/or, one or more operating systems 321, such as Windows ServerTM, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM, etc.
  • the input-output interface 34 may be used to receive or send data via a network.
  • the specific example of the above network may include a wireless network provided by the communication provider of the electronic device 30 .
  • the input and output interface 34 includes a network adapter (Network Interface Controller, NIC), which can be connected to other network devices through a base station so as to communicate with the Internet.
  • the input and output interface 34 may be a radio frequency (Radio Frequency, RF) module, which is used to communicate with the Internet in a wireless manner.
  • RF Radio Frequency
  • FIG. 9 is only for illustration, and it does not limit the structure of the above-mentioned electronic device.
  • electronic device 30 may also include more or fewer components than shown in FIG. 9 , or have a different configuration than that shown in FIG. 9 .
  • a computer-readable storage medium is also provided, and when instructions in the computer-readable storage medium are executed by a processor of the electronic device, the electronic device is made to perform the following steps:
  • the stored reference tracking trajectory is updated to obtain the tracking trajectory of the text in the video; the tracking trajectory is used to represent the position information of the text in the video.
  • the electronic device when the instructions in the computer-readable storage medium are executed by the processor of the electronic device, the electronic device is made to perform the following steps:
  • the above-mentioned image features are decoded to obtain the above-mentioned character sequence features.
  • the electronic device when the instructions in the computer-readable storage medium are executed by the processor of the electronic device, the electronic device is made to perform the following steps:
  • the above text area heat map represents the text area and non-text area in the above feature map
  • Connected domain analysis is performed on the above text area heat map to obtain the above text area.
  • the electronic device when the instructions in the computer-readable storage medium are executed by the processor of the electronic device, the electronic device is made to perform the following steps:
  • the electronic device when the instructions in the computer-readable storage medium are executed by the processor of the electronic device, the electronic device is made to perform the following steps:
  • the above-mentioned reference tracking trajectory includes a reference text area, and the above-mentioned multiple video frames are continuous on the time axis, when the instructions in the computer-readable storage medium are executed by the processor of the electronic device, the electronic device is made to perform the following steps :
  • the above-mentioned similarity matching result corresponding to the current video frame is obtained; based on the above-mentioned similarity matching result, the above-mentioned current fusion feature descriptor and the current text area corresponding to the above-mentioned current video frame, the above-mentioned reference tracking track is updated to obtain the above-mentioned
  • the above-mentioned order represents the order of each of the above-mentioned video frames on the above-mentioned time axis;
  • the target tracking track corresponding to the last sorted video frame is determined as the tracking track of the text in the above video.
  • the electronic device when the instructions in the computer-readable storage medium are executed by the processor of the electronic device, the electronic device is made to perform the following steps:
  • the reference text region is updated based on the current text region, and the reference fusion feature descriptor is updated based on the current fusion feature descriptor to obtain the target tracking trajectory.
  • the electronic device when the instructions in the computer-readable storage medium are executed by the processor of the electronic device, the electronic device is made to perform the following steps:
  • the current text region and the current fusion feature descriptor are added to the reference tracking track to obtain the target tracking track.
  • the text tracking network includes a detection branch network, a recognition branch network and a multi-information fusion descriptor branch network;
  • the above detection branch network is used to detect text regions in each video frame
  • the recognition branch network is used to execute the steps of obtaining the character sequence features of the characters in each video frame, the positional features of the character regions where the characters are located, and the image features corresponding to the character regions;
  • the multi-information fusion descriptor branch network is used to perform the step of determining the fusion feature descriptor corresponding to each video frame according to the above-mentioned character sequence features, the above-mentioned position features and the above-mentioned image features of each of the above-mentioned video frames.
  • the detection branch network includes a feature extraction subnetwork, a first feature fusion subnetwork and a feature detection subnetwork;
  • the above-mentioned feature extraction sub-network is used to extract the basic features of the above-mentioned video frames for each video frame to obtain the basic features of the above-mentioned video frames;
  • the above-mentioned first feature fusion sub-network is used to perform multi-scale feature fusion processing on the extracted basic features to obtain the feature map corresponding to the above-mentioned video frame;
  • the above-mentioned feature detection sub-network is used to process the feature map of the above-mentioned video frame into a text area heat map, and the above-mentioned text area heat map represents the text area and non-text area in the above-mentioned feature map; performing connected domain analysis on the above-mentioned text area heat map , to obtain the text area corresponding to the above video frame.
  • a computer program product including a computer program, which implements the following steps when the computer program is executed by a processor:
  • the character sequence feature the position feature and the image feature of each video frame respectively, determine the fusion feature descriptor corresponding to each video frame;
  • the stored reference tracking trajectory is updated to obtain the tracking trajectory of the text in the video; the tracking trajectory is used to represent the position information of the text in the video.
  • the computer program implements the following steps when executed by a processor:
  • the above-mentioned image features are decoded to obtain the above-mentioned character sequence features.
  • the computer program implements the following steps when executed by a processor:
  • the above text area heat map represents the text area and non-text area in the above feature map
  • Connected domain analysis is performed on the above text area heat map to obtain the above text area.
  • the computer program implements the following steps when executed by a processor:
  • the computer program implements the following steps when executed by a processor:
  • the above-mentioned reference tracking track includes a reference text area, and the above-mentioned multiple video frames are continuous on the time axis.
  • the above-mentioned similarity matching result corresponding to the current video frame is obtained; based on the above-mentioned similarity matching result, the above-mentioned current fusion feature descriptor and the current text area corresponding to the above-mentioned current video frame, the above-mentioned reference tracking track is updated to obtain the above-mentioned
  • the above-mentioned order represents the order of each of the above-mentioned video frames on the above-mentioned time axis;
  • the target tracking track corresponding to the last sorted video frame is determined as the tracking track of the text in the above video.
  • the computer program implements the following steps when executed by a processor:
  • the reference text region is updated based on the current text region, and the reference fusion feature descriptor is updated based on the current fusion feature descriptor to obtain the target tracking trajectory.
  • the computer program implements the following steps when executed by the processor:
  • the current text region and the current fusion feature descriptor are added to the reference tracking track to obtain the target tracking track.
  • the text tracking network includes a detection branch network, a recognition branch network and a multi-information fusion descriptor branch network;
  • the above detection branch network is used to detect text regions in each video frame
  • the recognition branch network is used to execute the steps of obtaining the character sequence features of the characters in each video frame, the positional features of the character regions where the characters are located, and the image features corresponding to the character regions;
  • the multi-information fusion descriptor branch network is used to perform the step of determining the fusion feature descriptor corresponding to each video frame according to the above-mentioned character sequence features, the above-mentioned position features and the above-mentioned image features of each of the above-mentioned video frames.
  • the detection branch network includes a feature extraction subnetwork, a first feature fusion subnetwork and a feature detection subnetwork;
  • the above-mentioned feature extraction sub-network is used to extract the basic features of the above-mentioned video frames for each video frame to obtain the basic features of the above-mentioned video frames;
  • the above-mentioned first feature fusion sub-network is used to perform multi-scale feature fusion processing on the extracted basic features to obtain the feature map corresponding to the above-mentioned video frame;
  • the above-mentioned feature detection sub-network is used to process the feature map of the above-mentioned video frame into a text area heat map, and the above-mentioned text area heat map represents the text area and non-text area in the above-mentioned feature map; performing connected domain analysis on the above-mentioned text area heat map , to obtain the text area corresponding to the above video frame.
  • Nonvolatile memory can include read only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory.
  • ROM read only memory
  • PROM programmable ROM
  • EPROM electrically programmable ROM
  • EEPROM electrically erasable programmable ROM
  • Volatile memory can include random access memory (RAM) or external cache memory.
  • RAM random access memory
  • RAM is available in many forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Chain Synchlink DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.

Abstract

The present invention relates to a video text tracking method and an electronic device. The method comprises: extracting a plurality of video frames from a video (S11); acquiring a text sequence feature of text in each video frame, a position feature of a text region where the text is located, and an image feature corresponding to the text region (S13); respectively determining, according to the text sequence feature, the position feature and the image feature of each video frame, a fusion feature descriptor corresponding to each video frame (S15); and updating a stored reference tracking trajectory on the basis of the fusion feature descriptor corresponding to each video frame, so as to obtain a tracking trajectory of the text in the video, the tracking trajectory being used for representing position information of the text in the video (S17). The present invention can improve the accuracy of video text tracking trajectory determination.

Description

视频文字跟踪方法及电子设备Video text tracking method and electronic device
本公开基于申请日为2021年12月24日、申请号为202111601711.0的中国专利申请,并要求该中国专利申请的优先权,其全部内容通过引用结合在本公开中作为参考。This disclosure is based on a Chinese patent application with an application date of December 24, 2021 and application number 202111601711.0, and claims the priority of this Chinese patent application, the entire contents of which are incorporated by reference in this disclosure.
技术领域technical field
本公开涉及计算机技术领域,尤其涉及一种视频文字跟踪方法及电子设备。The present disclosure relates to the field of computer technology, in particular to a video text tracking method and electronic equipment.
背景技术Background technique
视频内包含大量文字,该文字可以作为视频中的目标的客观描述和场景的主观总结。视频中的文字具有起止时间、位置变化、文字内容等信息,如何正确跟踪视频中的文字,是视频理解的关键步骤。The video contains a lot of text, which can be used as an objective description of the target in the video and a subjective summary of the scene. The text in the video has information such as start and end time, position change, text content, etc. How to correctly track the text in the video is a key step in video understanding.
发明内容Contents of the invention
本公开提供一种视频文字跟踪方法及电子设备,文字跟踪的准确率较高,并且降低了视频文字跟踪过程对计算资源的消耗。本公开的技术方案如下:The disclosure provides a video text tracking method and an electronic device, which have a higher accuracy rate of text tracking and reduce the consumption of computing resources in the video text tracking process. The disclosed technical scheme is as follows:
根据本公开实施例的一方面,提供一种视频文字跟踪方法,包括:According to an aspect of the embodiments of the present disclosure, a video text tracking method is provided, including:
从视频中提取出多个视频帧;Extract multiple video frames from the video;
获取每个视频帧中的文字的文字序列特征、所述文字所在的文字区域的位置特征以及所述文字区域对应的图像特征;Obtain the character sequence feature of the text in each video frame, the position feature of the text area where the text is located, and the image feature corresponding to the text area;
分别根据所述每个视频帧的所述文字序列特征、所述位置特征以及所述图像特征,确定所述每个视频帧对应的融合特征描述子;Determining a fusion feature descriptor corresponding to each video frame according to the character sequence feature, the position feature, and the image feature of each video frame;
基于所述每个视频帧对应的融合特征描述子更新已存储的参考跟踪轨迹,得到所述视频中的文字的跟踪轨迹;所述跟踪轨迹用于表征所述视频中的文字的位置信息。Based on the fusion feature descriptor corresponding to each video frame, the stored reference tracking trajectory is updated to obtain the tracking trajectory of the text in the video; the tracking trajectory is used to represent the position information of the text in the video.
在一些实施例中,所述获取每个视频帧中的文字的文字序列特征、所述文字所在的文字区域的位置特征以及所述文字区域对应的图像特征,包括:In some embodiments, the acquisition of the character sequence features of the text in each video frame, the positional features of the text area where the text is located, and the image features corresponding to the text area include:
对于每个视频帧,从所述视频帧中确定出所述文字区域;For each video frame, determine the text area from the video frame;
编码所述文字区域的位置信息,得到所述位置特征;Encoding the position information of the text area to obtain the position feature;
基于所述文字区域的位置信息,从所述视频帧中提取出所述图像特征;extracting the image features from the video frame based on the position information of the text area;
解码所述图像特征,得到所述文字序列特征。The image feature is decoded to obtain the character sequence feature.
在一些实施例中,所述从所述视频帧中确定出所述文字区域,包括:In some embodiments, the determining the text area from the video frame includes:
对所述视频帧进行特征提取,得到所述视频帧对应的特征图;performing feature extraction on the video frame to obtain a feature map corresponding to the video frame;
确定所述特征图对应的文字区域热图;所述文字区域热图表征所述特征图中的文字区域和非文字区域;Determine the text area heat map corresponding to the feature map; the text area heat map represents the text area and non-text area in the feature map;
对所述文字区域热图进行连通域分析,得到所述文字区域。Connected domain analysis is performed on the text area heat map to obtain the text area.
在一些实施例中,所述基于所述文字区域的位置信息,从所述视频帧中提取出所述图像特征,包括:In some embodiments, the extracting the image feature from the video frame based on the position information of the text region includes:
将所述文字区域的位置信息映射到所述特征图中,得到所述文字区域对应的映射区域;mapping the position information of the text area to the feature map to obtain a mapping area corresponding to the text area;
从所述特征图中提取出位于所述映射区域中的图像,得到所述图像特征。Extracting images located in the mapping region from the feature map to obtain the image features.
在一些实施例中,所述分别根据每个视频帧的所述文字序列特征、所述位置特征以及所述图像特征,得到所述每个视频帧对应的融合特征描述子,包括:In some embodiments, the fusion feature descriptor corresponding to each video frame is obtained according to the character sequence feature, the position feature, and the image feature of each video frame, including:
对于每个视频帧,拼接所述视频帧的所述文字序列特征、所述位置特征以及所述图像特征,得到所述视频帧对应的融合特征;For each video frame, splicing the character sequence feature, the position feature and the image feature of the video frame to obtain the fusion feature corresponding to the video frame;
调整所述融合特征的尺寸,得到所述融合特征描述子。Adjust the size of the fusion feature to obtain the fusion feature descriptor.
在一些实施例中,所述参考跟踪轨迹包括参考文字区域,所述多个视频帧在时间轴上连续,所述基于所述每个视频帧对应的融合特征描述子更新已存储的参考跟踪轨迹,得到所述视频中的文字的跟踪轨迹,包括:In some embodiments, the reference tracking trajectory includes a reference text area, the plurality of video frames are continuous on the time axis, and the stored reference tracking trajectory is updated based on the fusion feature descriptor corresponding to each video frame , to obtain the tracking track of the text in the video, including:
按顺序依次遍历所述每个视频帧,并在遍历所述每个视频帧时,执行以下操作:将遍历到的当前视频帧对应的当前融合特征描述子,与所述参考文字区域对应的参考融合特征描述子进行相似度匹配,得到所述当前视频帧对应的相似度匹配结果;基于所述相似度匹配结果、所述当前融合特征描述子和所述当前视频帧对应的当前文字区域,更新所述参考跟踪轨迹,得到所述当前视频帧对应的目标跟踪轨迹;且将所述目标跟踪轨迹确定为更新后的所述参考跟踪轨迹;所述顺序表征所述每个视频帧在所述时间轴上的顺序;Traverse each video frame in sequence, and when traversing each video frame, perform the following operations: the current fusion feature descriptor corresponding to the current video frame traversed, the reference corresponding to the reference text area The fusion feature descriptor performs similarity matching to obtain the similarity matching result corresponding to the current video frame; based on the similarity matching result, the current fusion feature descriptor and the current text area corresponding to the current video frame, update The reference tracking track obtains the target tracking track corresponding to the current video frame; and the target tracking track is determined as the updated reference tracking track; the sequence characterizes each video frame at the time the order on the axis;
将排序最后的视频帧对应的目标跟踪轨迹,确定为所述视频中的文字的跟踪轨迹。The target tracking track corresponding to the last sorted video frame is determined as the tracking track of the text in the video.
在一些实施例中,所述基于所述相似度匹配结果、所述当前融合特征描述子和所述当前视频帧对应的当前文字区域,更新所述参考跟踪轨迹,得到所述当前视频帧对应的目标跟踪轨迹,包括:In some embodiments, based on the similarity matching result, the current fusion feature descriptor and the current text region corresponding to the current video frame, the reference tracking track is updated to obtain the current text region corresponding to the current video frame. Target tracking trajectory, including:
在所述相似度匹配结果小于预设相似度阈值的情况下,基于所述当前文字区域更新所述参考文字区域,并基于所述当前融合特征描述子更新所述参考融合特征描述子,得到所述目标跟踪轨迹。When the similarity matching result is less than a preset similarity threshold, update the reference text region based on the current text region, and update the reference fusion feature descriptor based on the current fusion feature descriptor, to obtain the The target tracking trajectory.
在一些实施例中,所述基于所述相似度匹配结果、所述当前融合特征描述子和所述当前视频帧对应的当前文字区域,更新所述参考跟踪轨迹,得到所述当前视频帧对应的目标跟踪轨迹,包括:In some embodiments, based on the similarity matching result, the current fusion feature descriptor and the current text region corresponding to the current video frame, the reference tracking track is updated to obtain the current text region corresponding to the current video frame. Target tracking trajectory, including:
在所述相似度匹配结果大于或等于预设相似度阈值的情况下,将所述当前文字区域和所述当前融合特征描述子添加至所述参考跟踪轨迹中,得到所述目标跟踪轨迹。When the similarity matching result is greater than or equal to a preset similarity threshold, the current text region and the current fusion feature descriptor are added to the reference tracking track to obtain the target tracking track.
所述方法基于文字跟踪网络执行,所述文字跟踪网络包括检测分支网络、识别分支网络和多信息融合描述子分支网络;The method is executed based on a text tracking network, and the text tracking network includes a detection branch network, a recognition branch network and a multi-information fusion descriptor branch network;
所述检测分支网络,用于检测每个视频帧中的文字区域;The detection branch network is used to detect text regions in each video frame;
所述识别分支网络,用于执行所述获取每个视频帧中的文字的文字序列特征、所述文字所在的文字区域的位置特征以及所述文字区域对应的图像特征的步骤;The recognition branch network is used to perform the steps of obtaining the character sequence features of the characters in each video frame, the positional features of the character region where the characters are located, and the corresponding image features of the character region;
所述多信息融合描述子分支网络,用于执行所述分别根据所述每个视频帧的所述文字序列特征、所述位置特征以及所述图像特征,确定所述每个视频帧对应的融合特征描述子的步骤。The multi-information fusion description sub-branch network is used to perform the said fusion process corresponding to each video frame according to the character sequence feature, the position feature and the image feature of each video frame. Steps for feature descriptors.
在一些实施例中,所述检测分支网络包括特征提取子网络、第一特征融合子网络和特征检测子网络;In some embodiments, the detection branch network includes a feature extraction subnetwork, a first feature fusion subnetwork and a feature detection subnetwork;
所述特征提取子网络,用于对于每个视频帧,对所述视频帧进行基础特征提取,得到所述视频帧的基础特征;The feature extraction sub-network is used to extract the basic features of the video frame for each video frame to obtain the basic features of the video frame;
所述第一特征融合子网络,用于对提取出的基础特征进行多尺度特征融合处理,得到所述视频帧对应的特征图;The first feature fusion sub-network is used to perform multi-scale feature fusion processing on the extracted basic features to obtain the feature map corresponding to the video frame;
所述特征检测子网络,用于将所述视频帧的特征图处理成文字区域热图,所述文字区域热图表征所述特征图中的文字区域和非文字区域;对所述文字区域热图进行连通域分析,得到所述视频帧对应的文字区域。The feature detection sub-network is used to process the feature map of the video frame into a text area heat map, and the text area heat map represents a text area and a non-text area in the feature map; Connected domain analysis is performed on the graph to obtain the text region corresponding to the video frame.
根据本公开实施例的另一方面,提供一种视频文字跟踪装置,包括:According to another aspect of the embodiments of the present disclosure, a video text tracking device is provided, including:
提取模块,被配置为从视频中提取出多个视频帧;An extraction module configured to extract a plurality of video frames from the video;
特征获取模块,被配置为获取每个视频帧中的文字的文字序列特征、所述文字所在的 文字区域的位置特征以及所述文字区域对应的图像特征;The feature acquisition module is configured to obtain the text sequence feature of the text in each video frame, the position feature of the text area where the text is located, and the corresponding image feature of the text area;
描述子获取模块,被配置为分别根据所述每个视频帧的所述文字序列特征、所述位置特征以及所述图像特征,得到所述每个视频帧对应的融合特征描述子;The descriptor acquisition module is configured to obtain the fusion feature descriptor corresponding to each video frame according to the character sequence feature, the position feature and the image feature of each video frame;
跟踪轨迹确定模块,被配置为基于所述每个视频帧对应的融合特征描述子更新已存储的参考跟踪轨迹,得到所述视频中的文字的跟踪轨迹;所述跟踪轨迹用于表征所述视频中文字的位置信息。The tracking trajectory determination module is configured to update the stored reference tracking trajectory based on the fusion feature descriptor corresponding to each video frame to obtain the tracking trajectory of the text in the video; the tracking trajectory is used to characterize the video The location information of Chinese characters.
在一些实施例中,所述特征获取模块,包括:In some embodiments, the feature acquisition module includes:
文字区域确定单元,被配置为对于每个视频帧,从所述视频帧中确定出所述文字区域;a text area determination unit configured to, for each video frame, determine the text area from the video frame;
位置特征获取单元,被配置为编码所述文字区域的位置信息,得到所述位置特征;A location feature acquisition unit configured to encode the location information of the text area to obtain the location feature;
图像特征提取单元,被配置为基于所述文字区域的位置信息,从所述视频帧中提取出所述图像特征;an image feature extraction unit configured to extract the image feature from the video frame based on the location information of the text region;
文字序列特征获取单元,被配置为解码所述图像特征,得到所述文字序列特征。The character sequence feature acquisition unit is configured to decode the image feature to obtain the character sequence feature.
在一些实施例中,所述文字区域确定单元,包括:In some embodiments, the text area determination unit includes:
特征图确定子单元,被配置为对所述视频帧进行特征提取,得到所述视频帧对应的特征图;The feature map determination subunit is configured to perform feature extraction on the video frame to obtain a feature map corresponding to the video frame;
文字区域热图确定子单元,被配置为确定所述特征图对应的文字区域热图;所述文字区域热图表征所述特征图中的文字区域和非文字区域;The text area heat map determination subunit is configured to determine the text area heat map corresponding to the feature map; the text area heat map represents the text area and non-text area in the feature map;
连通域分析子单元,被配置为对所述文字区域热图进行连通域分析,得到所述文字区域。The connected domain analysis subunit is configured to perform connected domain analysis on the text area heat map to obtain the text area.
在一些实施例中,所述图像特征提取单元,包括:In some embodiments, the image feature extraction unit includes:
映射子单元,被配置为将所述文字区域的位置信息映射到所述特征图中,得到所述文字区域对应的映射区域;The mapping subunit is configured to map the position information of the text region to the feature map to obtain a mapping region corresponding to the text region;
图像特征提取子单元,被配置为从所述特征图中提取出位于所述映射区域中的图像,得到所述图像特征。The image feature extraction subunit is configured to extract the image located in the mapping area from the feature map to obtain the image feature.
在一些实施例中,所述描述子获取模块,包括:In some embodiments, the descriptor acquisition module includes:
拼接单元,被配置为对于每个视频帧,拼接所述视频帧的所述文字序列特征、所述位置特征以及所述图像特征,得到所述视频帧对应的融合特征;The splicing unit is configured to, for each video frame, splice the character sequence feature, the position feature, and the image feature of the video frame to obtain a fusion feature corresponding to the video frame;
调整单元,被配置为调整所述融合特征的尺寸,得到所述融合特征描述子。The adjustment unit is configured to adjust the size of the fusion feature to obtain the fusion feature descriptor.
在一些实施例中,所述参考跟踪轨迹包括参考文字区域,所述多个视频帧为在时间轴上连续的多个视频帧,所述跟踪轨迹确定模块,包括:In some embodiments, the reference tracking track includes a reference text area, the plurality of video frames are a plurality of video frames continuous on the time axis, and the tracking track determination module includes:
相似度匹配单元,被配置为按顺序依次遍历所述每个视频帧,并在遍历所述每个视频帧时,执行以下操作:将遍历到的当前视频帧对应的当前融合特征描述子,与所述参考文字区域对应的参考融合特征描述子进行相似度匹配,得到所述当前视频帧对应的相似度匹配结果;基于所述相似度匹配结果、所述当前融合特征描述子和所述当前视频帧对应的当前文字区域,更新所述参考跟踪轨迹,得到所述当前视频帧对应的目标跟踪轨迹;且将所述目标跟踪轨迹确定为更新后的所述参考跟踪轨迹;所述顺序表征所述每个视频帧在所述时间轴上的顺序;The similarity matching unit is configured to traverse each video frame in sequence, and when traversing each video frame, perform the following operations: combine the current fusion feature descriptor corresponding to the traversed current video frame with The reference fusion feature descriptor corresponding to the reference text area performs similarity matching to obtain a similarity matching result corresponding to the current video frame; based on the similarity matching result, the current fusion feature descriptor and the current video In the current text area corresponding to the frame, update the reference tracking track to obtain the target tracking track corresponding to the current video frame; and determine the target tracking track as the updated reference tracking track; the order represents the the order of each video frame on said time axis;
跟踪轨迹确定单元,被配置为将排序最后的视频帧对应的目标跟踪轨迹,确定为所述视频中的文字的跟踪轨迹。The tracking track determination unit is configured to determine the target tracking track corresponding to the last video frame as the tracking track of the text in the video.
在一些实施例中,所述相似度匹配单元,被配置为在所述相似度匹配结果小于预设相似度阈值的情况下,基于所述当前文字区域更新所述参考文字区域,并基于所述当前融合特征描述子更新所述参考融合特征描述子,得到所述目标跟踪轨迹。In some embodiments, the similarity matching unit is configured to update the reference text area based on the current text area when the similarity matching result is smaller than a preset similarity threshold, and based on the The current fusion feature descriptor updates the reference fusion feature descriptor to obtain the target tracking trajectory.
在一些实施例中,所述相似度匹配单元,被配置为在所述相似度匹配结果大于或等于预设相似度阈值的情况下,将所述当前文字区域和所述当前融合特征描述子添加至所述参考跟踪轨迹中,得到所述目标跟踪轨迹。In some embodiments, the similarity matching unit is configured to add the current text region and the current fusion feature descriptor to to the reference tracking trajectory to obtain the target tracking trajectory.
在一些实施例中,文字跟踪网络包括检测分支网络、识别分支网络和多信息融合描述子分支网络;In some embodiments, the text tracking network includes a detection branch network, a recognition branch network and a multi-information fusion descriptor branch network;
所述检测分支网络,用于检测每个视频帧中的文字区域;The detection branch network is used to detect text regions in each video frame;
所述识别分支网络,用于执行所述获取每个视频帧中的文字的文字序列特征、所述文字所在的文字区域的位置特征以及所述文字区域对应的图像特征的步骤;The recognition branch network is used to perform the steps of obtaining the character sequence features of the characters in each video frame, the positional features of the character region where the characters are located, and the corresponding image features of the character region;
所述多信息融合描述子分支网络,用于执行所述分别根据所述每个视频帧的所述文字序列特征、所述位置特征以及所述图像特征,确定所述每个视频帧对应的融合特征描述子的步骤。The multi-information fusion description sub-branch network is used to perform the said fusion process corresponding to each video frame according to the character sequence feature, the position feature and the image feature of each video frame. Steps for feature descriptors.
在一些实施例中,所述检测分支网络包括特征提取子网络、第一特征融合子网络和特征检测子网络;In some embodiments, the detection branch network includes a feature extraction subnetwork, a first feature fusion subnetwork and a feature detection subnetwork;
所述特征提取子网络,用于对于每个视频帧,对所述视频帧进行基础特征提取,得到所述视频帧的基础特征;The feature extraction sub-network is used to extract the basic features of the video frame for each video frame to obtain the basic features of the video frame;
所述第一特征融合子网络,用于对提取出的基础特征进行多尺度特征融合处理,得到所述视频帧对应的特征图;The first feature fusion sub-network is used to perform multi-scale feature fusion processing on the extracted basic features to obtain the feature map corresponding to the video frame;
所述特征检测子网络,用于将所述视频帧的特征图处理成文字区域热图,所述文字区域热图表征所述特征图中的文字区域和非文字区域;对所述文字区域热图进行连通域分析,得到所述视频帧对应的文字区域。The feature detection sub-network is used to process the feature map of the video frame into a text area heat map, and the text area heat map represents a text area and a non-text area in the feature map; Connected domain analysis is performed on the graph to obtain the text region corresponding to the video frame.
根据本公开实施例的另一方面,提供一种电子设备,包括:According to another aspect of the embodiments of the present disclosure, an electronic device is provided, including:
处理器;processor;
用于存储所述处理器可执行指令的存储器;其中,所述处理器被配置为执行所述指令,以实现如下步骤:A memory for storing instructions executable by the processor; wherein the processor is configured to execute the instructions to implement the following steps:
从视频中提取出多个视频帧;Extract multiple video frames from the video;
获取每个视频帧中的文字的文字序列特征、所述文字所在的文字区域的位置特征以及所述文字区域对应的图像特征;Obtain the character sequence feature of the text in each video frame, the position feature of the text area where the text is located, and the image feature corresponding to the text area;
分别根据所述每个视频帧的所述文字序列特征、所述位置特征以及所述图像特征,确定所述每个视频帧对应的融合特征描述子;Determining a fusion feature descriptor corresponding to each video frame according to the character sequence feature, the position feature, and the image feature of each video frame;
基于所述每个视频帧对应的融合特征描述子更新已存储的参考跟踪轨迹,得到所述视频中的文字的跟踪轨迹;所述跟踪轨迹用于表征所述视频中的文字的位置信息。Based on the fusion feature descriptor corresponding to each video frame, the stored reference tracking trajectory is updated to obtain the tracking trajectory of the text in the video; the tracking trajectory is used to represent the position information of the text in the video.
在一些实施例中,所述处理器被配置为执行所述指令,以实现如下步骤:In some embodiments, the processor is configured to execute the instructions to implement the following steps:
对于每个视频帧,从所述视频帧中确定出所述文字区域;For each video frame, determine the text area from the video frame;
编码所述文字区域的位置信息,得到所述位置特征;Encoding the position information of the text area to obtain the position feature;
基于所述文字区域的位置信息,从所述视频帧中提取出所述图像特征;extracting the image features from the video frame based on the position information of the text area;
解码所述图像特征,得到所述文字序列特征。The image feature is decoded to obtain the character sequence feature.
在一些实施例中,所述处理器被配置为执行所述指令,以实现如下步骤:In some embodiments, the processor is configured to execute the instructions to implement the following steps:
对所述视频帧进行特征提取,得到所述视频帧对应的特征图;performing feature extraction on the video frame to obtain a feature map corresponding to the video frame;
确定所述特征图对应的文字区域热图;所述文字区域热图表征所述特征图中的文字区域和非文字区域;Determine the text area heat map corresponding to the feature map; the text area heat map represents the text area and non-text area in the feature map;
对所述文字区域热图进行连通域分析,得到所述文字区域。Connected domain analysis is performed on the text area heat map to obtain the text area.
在一些实施例中,所述处理器被配置为执行所述指令,以实现如下步骤:In some embodiments, the processor is configured to execute the instructions to implement the following steps:
将所述文字区域的位置信息映射到所述特征图中,得到所述文字区域对应的映射区域;mapping the position information of the text area to the feature map to obtain a mapping area corresponding to the text area;
从所述特征图中提取出位于所述映射区域中的图像,得到所述图像特征。Extracting images located in the mapping region from the feature map to obtain the image features.
在一些实施例中,所述处理器被配置为执行所述指令,以实现如下步骤:In some embodiments, the processor is configured to execute the instructions to implement the following steps:
对于每个视频帧,拼接所述视频帧的所述文字序列特征、所述位置特征以及所述图像 特征,得到所述视频帧对应的融合特征;For each video frame, the character sequence feature, the position feature and the image feature of the video frame are spliced to obtain the fusion feature corresponding to the video frame;
调整所述融合特征的尺寸,得到所述融合特征描述子。Adjust the size of the fusion feature to obtain the fusion feature descriptor.
在一些实施例中,所述参考跟踪轨迹包括参考文字区域,所述多个视频帧在时间轴上连续,所述处理器被配置为执行所述指令,以实现如下步骤:In some embodiments, the reference tracking track includes a reference text area, the plurality of video frames are continuous on the time axis, and the processor is configured to execute the instructions to implement the following steps:
按顺序依次遍历所述每个视频帧,并在遍历所述每个视频帧时,执行以下操作:将遍历到的当前视频帧对应的当前融合特征描述子,与所述参考文字区域对应的参考融合特征描述子进行相似度匹配,得到所述当前视频帧对应的相似度匹配结果;基于所述相似度匹配结果、所述当前融合特征描述子和所述当前视频帧对应的当前文字区域,更新所述参考跟踪轨迹,得到所述当前视频帧对应的目标跟踪轨迹;且将所述目标跟踪轨迹确定为更新后的所述参考跟踪轨迹;所述顺序表征所述每个视频帧在所述时间轴上的顺序;Traverse each video frame in sequence, and when traversing each video frame, perform the following operations: the current fusion feature descriptor corresponding to the current video frame traversed, the reference corresponding to the reference text area The fusion feature descriptor performs similarity matching to obtain the similarity matching result corresponding to the current video frame; based on the similarity matching result, the current fusion feature descriptor and the current text area corresponding to the current video frame, update The reference tracking track obtains the target tracking track corresponding to the current video frame; and the target tracking track is determined as the updated reference tracking track; the sequence characterizes each video frame at the time the order on the axes;
将排序最后的视频帧对应的目标跟踪轨迹,确定为所述视频中的文字的跟踪轨迹。The target tracking track corresponding to the last sorted video frame is determined as the tracking track of the text in the video.
在一些实施例中,所述处理器被配置为执行所述指令,以实现如下步骤:In some embodiments, the processor is configured to execute the instructions to implement the following steps:
在所述相似度匹配结果小于预设相似度阈值的情况下,基于所述当前文字区域更新所述参考文字区域,并基于所述当前融合特征描述子更新所述参考融合特征描述子,得到所述目标跟踪轨迹。When the similarity matching result is less than a preset similarity threshold, update the reference text region based on the current text region, and update the reference fusion feature descriptor based on the current fusion feature descriptor, to obtain the The target tracking trajectory.
在一些实施例中,所述处理器被配置为执行所述指令,以实现如下步骤:In some embodiments, the processor is configured to execute the instructions to implement the following steps:
在所述相似度匹配结果大于或等于预设相似度阈值的情况下,将所述当前文字区域和所述当前融合特征描述子添加至所述参考跟踪轨迹中,得到所述目标跟踪轨迹。When the similarity matching result is greater than or equal to a preset similarity threshold, the current text region and the current fusion feature descriptor are added to the reference tracking track to obtain the target tracking track.
在一些实施例中,文字跟踪网络包括检测分支网络、识别分支网络和多信息融合描述子分支网络;In some embodiments, the text tracking network includes a detection branch network, a recognition branch network and a multi-information fusion descriptor branch network;
所述检测分支网络,用于检测每个视频帧中的文字区域;The detection branch network is used to detect text regions in each video frame;
所述识别分支网络,用于执行所述获取每个视频帧中的文字的文字序列特征、所述文字所在的文字区域的位置特征以及所述文字区域对应的图像特征的步骤;The recognition branch network is used to perform the steps of obtaining the character sequence features of the characters in each video frame, the positional features of the character region where the characters are located, and the corresponding image features of the character region;
所述多信息融合描述子分支网络,用于执行所述分别根据所述每个视频帧的所述文字序列特征、所述位置特征以及所述图像特征,确定所述每个视频帧对应的融合特征描述子的步骤。The multi-information fusion description sub-branch network is used to perform the said fusion process corresponding to each video frame according to the character sequence feature, the position feature and the image feature of each video frame. Steps for feature descriptors.
在一些实施例中,所述检测分支网络包括特征提取子网络、第一特征融合子网络和特征检测子网络;In some embodiments, the detection branch network includes a feature extraction subnetwork, a first feature fusion subnetwork and a feature detection subnetwork;
所述特征提取子网络,用于对于每个视频帧,对所述视频帧进行基础特征提取,得到所述视频帧的基础特征;The feature extraction sub-network is used to extract the basic features of the video frame for each video frame to obtain the basic features of the video frame;
所述第一特征融合子网络,用于对提取出的基础特征进行多尺度特征融合处理,得到所述视频帧对应的特征图;The first feature fusion sub-network is used to perform multi-scale feature fusion processing on the extracted basic features to obtain the feature map corresponding to the video frame;
所述特征检测子网络,用于将所述视频帧的特征图处理成文字区域热图,所述文字区域热图表征所述特征图中的文字区域和非文字区域;对所述文字区域热图进行连通域分析,得到所述视频帧对应的文字区域。The feature detection sub-network is used to process the feature map of the video frame into a text area heat map, and the text area heat map represents a text area and a non-text area in the feature map; Connected domain analysis is performed on the graph to obtain the text region corresponding to the video frame.
根据本公开实施例的另一方面,提供一种计算机可读存储介质,当所述计算机可读存储介质中的指令由电子设备的处理器执行时,使得所述电子设备执行如下步骤:According to another aspect of an embodiment of the present disclosure, a computer-readable storage medium is provided, and when instructions in the computer-readable storage medium are executed by a processor of an electronic device, the electronic device is made to perform the following steps:
从视频中提取出多个视频帧;Extract multiple video frames from the video;
获取每个视频帧中的文字的文字序列特征、所述文字所在的文字区域的位置特征以及所述文字区域对应的图像特征;Obtain the character sequence feature of the text in each video frame, the position feature of the text area where the text is located, and the image feature corresponding to the text area;
分别根据所述每个视频帧的所述文字序列特征、所述位置特征以及所述图像特征,确定所述每个视频帧对应的融合特征描述子;Determining a fusion feature descriptor corresponding to each video frame according to the character sequence feature, the position feature, and the image feature of each video frame;
基于所述每个视频帧对应的融合特征描述子更新已存储的参考跟踪轨迹,得到所述视频中的文字的跟踪轨迹;所述跟踪轨迹用于表征所述视频中的文字的位置信息。Based on the fusion feature descriptor corresponding to each video frame, the stored reference tracking trajectory is updated to obtain the tracking trajectory of the text in the video; the tracking trajectory is used to represent the position information of the text in the video.
根据本公开实施例的再一方面,提供一种计算机程序产品,包括计算机程序,所述计算机程序被处理器执行时实现如下步骤:According to still another aspect of the embodiments of the present disclosure, a computer program product is provided, including a computer program, and when the computer program is executed by a processor, the following steps are implemented:
从视频中提取出多个视频帧;Extract multiple video frames from the video;
获取每个视频帧中的文字的文字序列特征、所述文字所在的文字区域的位置特征以及所述文字区域对应的图像特征;Obtain the character sequence feature of the text in each video frame, the position feature of the text area where the text is located, and the image feature corresponding to the text area;
分别根据所述每个视频帧的所述文字序列特征、所述位置特征以及所述图像特征,确定所述每个视频帧对应的融合特征描述子;Determining a fusion feature descriptor corresponding to each video frame according to the character sequence feature, the position feature, and the image feature of each video frame;
基于所述每个视频帧对应的融合特征描述子更新已存储的参考跟踪轨迹,得到所述视频中的文字的跟踪轨迹;所述跟踪轨迹用于表征所述视频中的文字的位置信息。Based on the fusion feature descriptor corresponding to each video frame, the stored reference tracking trajectory is updated to obtain the tracking trajectory of the text in the video; the tracking trajectory is used to represent the position information of the text in the video.
本公开实施例从视频中抽取多个视频帧,根据每个视频帧中的文字的文字序列特征、文字所在的文字区域的位置特征以及所述文字区域对应的图像特征,得到每个视频帧对应的融合特征描述子,并基于每个视频帧对应的融合特征描述子更新已存储的参考跟踪轨迹,得到视频中的文字的跟踪轨迹。由于视频中的文字的跟踪轨迹的确定充分考虑了文字的文字序列特征、位置特征和图像特征,文字跟踪的准确率较高;此外,由于本公开根据融合特征描述子更新参考跟踪轨迹之后,即可得到视频中文字的跟踪轨迹,避免通过多个模型(文字检测、文字识别、其他模型等)对每帧图像继续进行处理,降低视频文字跟踪过程对计算资源的消耗。The embodiment of the present disclosure extracts a plurality of video frames from the video, and according to the character sequence feature of the text in each video frame, the position feature of the text area where the text is located, and the image feature corresponding to the text area, the corresponding video frame of each video frame is obtained. The fusion feature descriptor of each video frame is used to update the stored reference tracking trajectory based on the fusion feature descriptor corresponding to each video frame to obtain the tracking trajectory of the text in the video. Since the determination of the tracking trajectory of the text in the video fully considers the text sequence features, position features and image features of the text, the accuracy of text tracking is relatively high; in addition, since the disclosure updates the reference tracking trajectory according to the fusion feature descriptor, that is The tracking track of the text in the video can be obtained, avoiding the continuous processing of each frame of image through multiple models (text detection, text recognition, other models, etc.), and reducing the consumption of computing resources during the video text tracking process.
附图说明Description of drawings
图1是根据一示例性实施例示出的一种视频文字跟踪方法的实施环境示意图。Fig. 1 is a schematic diagram of an implementation environment of a video text tracking method according to an exemplary embodiment.
图2是根据一示例性实施例示出的一种视频文字跟踪方法的流程图。Fig. 2 is a flow chart of a video text tracking method according to an exemplary embodiment.
图3是根据一示例性实施例示出的一种确定文字序列特征、位置特征和图像特征的流程图。Fig. 3 is a flow chart of determining character sequence features, location features and image features according to an exemplary embodiment.
图4是根据一示例性实施例示出的一种通过文字跟踪网络从每个视频帧中确定出相应的文字区域的流程图。Fig. 4 is a flow chart of determining a corresponding text area from each video frame through a text tracking network according to an exemplary embodiment.
图5是根据一示例性实施例示出的一种文字跟踪网络示意图。Fig. 5 is a schematic diagram of a character tracking network according to an exemplary embodiment.
图6是根据一示例性实施例示出的一种通过文字跟踪网络得到融合特征描述子的流程图。Fig. 6 is a flow chart of obtaining fusion feature descriptors through a character tracking network according to an exemplary embodiment.
图7是根据一示例性实施例示出的更新参考跟踪轨迹得到视频中的文字的跟踪轨迹的流程图。Fig. 7 is a flow chart of updating a reference tracking track to obtain a tracking track of text in a video according to an exemplary embodiment.
图8是根据一示例性实施例示出的一种视频文字跟踪装置框图。Fig. 8 is a block diagram of a video text tracking device according to an exemplary embodiment.
图9是根据一示例性实施例示出的一种用于视频文字跟踪的电子设备的框图。Fig. 9 is a block diagram showing an electronic device for video text tracking according to an exemplary embodiment.
具体实施方式Detailed ways
本公开实施例中所描述的获取用户信息以及用户账户的相关信息,包括社交关系身份信息之类的,均已获得用户许可,在取得用户许可授权的前提下,本公开所涉及的方法,装置,设备及存储介质可以获取用户的相关信息。The acquisition of user information and user account-related information described in the embodiments of the present disclosure, including social relationship identity information, has been obtained from the user. On the premise of obtaining the authorization of the user, the method and device involved in the present disclosure , the device and the storage medium can obtain relevant information of the user.
相关技术中,通常将视频处理成多帧图像,再对每帧图像逐个进行文字检测和文字识别,得到视频级别的文字跟踪结果。然而相关技术中的方法使得视频文字跟踪的准确率较低。而本公开实施例充分考虑了文字的文字序列特征、位置特征和图像特征,提升了文字跟踪的准确率。In the related technology, the video is usually processed into multiple frames of images, and then text detection and text recognition are performed on each frame of images one by one to obtain video-level text tracking results. However, the methods in the related art make the accuracy rate of video text tracking lower. However, the embodiments of the present disclosure fully consider character sequence features, position features, and image features of characters, thereby improving the accuracy of character tracking.
图1是根据一示例性实施例示出的一种视频文字跟踪方法的实施环境示意图。如图1所示,该实施环境至少包括终端01和服务器02。Fig. 1 is a schematic diagram of an implementation environment of a video text tracking method according to an exemplary embodiment. As shown in FIG. 1 , the implementation environment includes at least a terminal 01 and a server 02 .
其中,该终端01用于采集待处理视频。在一些实施例中,该终端01是智能手机、平板电脑、笔记本电脑、台式计算机、智能音箱、智能手表、车载终端、智能电视、智能语音交互设备等,但并不局限于此。终端01以及服务器02通过有线或无线通信方式进行直接或间接地连接,本公开在此不做限制。Wherein, the terminal 01 is used to collect video to be processed. In some embodiments, the terminal 01 is a smart phone, a tablet computer, a notebook computer, a desktop computer, a smart speaker, a smart watch, a vehicle terminal, a smart TV, a smart voice interaction device, etc., but is not limited thereto. The terminal 01 and the server 02 are directly or indirectly connected through wired or wireless communication, which is not limited in this disclosure.
其中,该服务器02用于接收终端01发送的视频,从视频中提取出多个视频帧,获取每个视频帧中的文字的文字序列特征、文字所在的文字区域的位置特征以及上述文字区域对应的图像特征,分别根据每个视频帧的上述文字序列特征、上述位置特征以及上述图像特征,确定上述每个视频帧对应的融合特征描述子,基于上述每个视频帧对应的融合特征描述子更新参考跟踪轨迹,得到上述视频中的文字的跟踪轨迹。在一些实施例中,该服务器02是独立的物理服务器,或者是多个物理服务器构成的服务器集群或者分布式系统,或者是提供云计算服务的云服务器。Wherein, the server 02 is used to receive the video sent by the terminal 01, extract a plurality of video frames from the video, obtain the character sequence feature of the text in each video frame, the position feature of the text area where the text is located, and the corresponding text area. The image features of each video frame are determined according to the above-mentioned character sequence feature, the above-mentioned position feature and the above-mentioned image feature of each video frame, and the fusion feature descriptor corresponding to each video frame is determined, and the fusion feature descriptor corresponding to each video frame is updated based on the above-mentioned Refer to the tracking track to obtain the tracking track of the text in the above video. In some embodiments, the server 02 is an independent physical server, or a server cluster or distributed system composed of multiple physical servers, or a cloud server providing cloud computing services.
需要说明的是,图1仅仅是本申请实施例提供的视频文字跟踪方法的一种实施环境,在实际应用中,还可以包括其他实施环境。比如,仅包含终端的实施例环境。It should be noted that FIG. 1 is only an implementation environment of the video text tracking method provided by the embodiment of the present application, and other implementation environments may also be included in practical applications. For example, an embodiment environment that includes only terminals.
图2是根据一示例性实施例示出的一种视频文字跟踪方法的流程图,如图2所示,该方法用于图1中包含终端和服务器的实施环境中,如上述实施环境所示,终端采集待处理视频后,由服务器针对该待处理视频执行视频文字跟踪方法。该方法包括以下步骤S11、S13、S15及S17。Fig. 2 is a flow chart of a video text tracking method according to an exemplary embodiment. As shown in Fig. 2, the method is used in an implementation environment including a terminal and a server in Fig. 1, as shown in the above implementation environment, After the terminal collects the video to be processed, the server executes the video text tracking method for the video to be processed. The method includes the following steps S11, S13, S15 and S17.
在S11中,从视频中提取出多个视频帧。In S11, multiple video frames are extracted from the video.
在一些实施例中,该多个视频帧为在时间轴上连续的视频帧。也即是,该视频中包括多个视频帧,多个视频帧按照在时间轴上的先后顺序依次排列,而服务器从视频中提取出的多个视频帧是在时间轴上相邻的。如服务器从视频中选定一个时间段,提取该时间段内的多个视频帧,即为连续的多个视频帧。In some embodiments, the plurality of video frames are consecutive video frames on the time axis. That is, the video includes multiple video frames, and the multiple video frames are arranged sequentially on the time axis, and the multiple video frames extracted by the server from the video are adjacent on the time axis. If the server selects a time period from the video and extracts multiple video frames within the time period, that is, multiple consecutive video frames.
本公开实施例中,采用多种方式从视频中提取出多个视频帧,本公开在此不做具体限定。In the embodiment of the present disclosure, multiple video frames are extracted from the video in various ways, which are not specifically limited in the present disclosure.
在一些实施例中,使用FFmpeg将视频抽取成连续的视频帧。经过抽帧后,该待处理视频被划分成n个在时间上连续的图像帧,表示为Frame-1,…Frame-t,…,Frame-n,其中Frame-t代表在视频中时间位置为t的视频帧,n为大于1的正整数。其中,FFmpeg是一套用来记录、转换数字视频,并能将其转化为流的开源计算机程序,其提供了录制、转换以及流化音视频的完整解决方案。In some embodiments, FFmpeg is used to decimate the video into consecutive video frames. After frame extraction, the video to be processed is divided into n temporally continuous image frames, denoted as Frame-1,...Frame-t,...,Frame-n, where Frame-t represents the temporal position in the video as The video frame of t, n is a positive integer greater than 1. Among them, FFmpeg is a set of open source computer programs used to record, convert digital video, and convert it into a stream. It provides a complete solution for recording, converting, and streaming audio and video.
在另一些实施例中,还通过其他视频流解码技术,将视频处理成在时间轴上连续的多个视频帧。In some other embodiments, the video is processed into a plurality of continuous video frames on the time axis through other video stream decoding technologies.
在S13中,获取每个视频帧中的文字的文字序列特征、文字所在的文字区域的位置特征以及上述文字区域对应的图像特征。In S13, character sequence features of the text in each video frame, positional features of the text area where the text is located, and image features corresponding to the text area are acquired.
在一些实施例中,在抽取出多个连续的视频帧之后,按照每个视频帧在时间轴上的顺序,依次获取每个视频帧中的文字的文字序列特征、文字所在的文字区域的位置特征以及上述文字区域对应的图像特征。例如,连续视频帧为Frame-1,…Frame-t,…,Frame-n,则首先获取Frame-1的文字序列特征、位置特征和图像特征,…,获取Frame-t的文字序列特征、位置特征和图像特征,…,最后获取Frame-n的文字序列特征、位置特征和图像特征。In some embodiments, after extracting a plurality of consecutive video frames, according to the sequence of each video frame on the time axis, sequentially acquire the text sequence features of the text in each video frame, and the position of the text area where the text is located features and image features corresponding to the above-mentioned text regions. For example, if the continuous video frames are Frame-1, ... Frame-t, ..., Frame-n, first obtain the text sequence features, position features and image features of Frame-1, ..., and obtain the text sequence features, position of Frame-t Features and image features, ..., and finally get the text sequence features, position features and image features of Frame-n.
在另一些实施例中,在抽取出多个连续的视频帧之后,还并行获取多个连续的视频帧各自对应的文字序列特征、位置特征和图像特征。In some other embodiments, after extracting a plurality of consecutive video frames, the character sequence features, position features and image features corresponding to each of the multiple consecutive video frames are obtained in parallel.
在一些实施例中,文字所在的文字区域为文本行级别的文字区域,即该文字区域中包括一行文本,该一行文本中的文字是有先后顺序的,因此,该文字序列特征表征文本区域中的文字的先后顺序。示例性地,该文字序列特征为通过文字的坐标,在相应的视频帧中截取出的特征图。In some embodiments, the text area where the text is located is a text area at the text line level, that is, the text area includes a line of text, and the text in the line of text is sequential. Therefore, the text sequence feature represents the text area in the text area. The sequence of the text. Exemplarily, the text sequence feature is a feature map intercepted in a corresponding video frame through the coordinates of the text.
在一些实施例中,该文字所在的文字区域的位置特征,为文字区域的位置坐标。在一示例性的方式中,该文字区域的位置坐标为该文字区域的左上角坐标、左下角坐标、右上角坐标和右下角坐标。In some embodiments, the location feature of the text area where the text is located is the position coordinates of the text area. In an exemplary manner, the position coordinates of the text area are the coordinates of the upper left corner, the lower left corner, the upper right corner and the lower right corner of the text area.
在一些实施例中,该文字区域对应的图像特征,为该文字区域在视频帧中所对应的图像的特征。In some embodiments, the image feature corresponding to the text area is the feature of the image corresponding to the text area in the video frame.
在一些实施例中,图3是根据一示例性实施例示出的一种确定文字序列特征、位置特征和图像特征的流程图。如图3所示,在上述S13中,上述获取每个视频帧中的文字的文字序列特征、文字所在的文字区域的位置特征以及上述文字区域对应的图像特征,包括以下步骤S1301、S1303、S1305及S1307:In some embodiments, Fig. 3 is a flowchart showing a method for determining character sequence features, location features and image features according to an exemplary embodiment. As shown in Figure 3, in the above S13, the above-mentioned acquisition of the character sequence features of the text in each video frame, the positional features of the text area where the text is located, and the image features corresponding to the above text area include the following steps S1301, S1303, S1305 and S1307:
在S1301中,从上述每个视频帧中确定出上述文字区域。In S1301, the above-mentioned text area is determined from each of the above-mentioned video frames.
在一些实施例中,在抽取出多个连续的视频帧之后,按照每个视频帧在时间轴上的顺序,依次从每个视频帧中确定出每个视频帧中的文字所在的文字区域。例如,连续视频帧为Frame-1,…Frame-t,…,Frame-n,则从Frame-1中确定出Frame-1中的文字所在的文字区域,…,从Frame-t中确定出Frame-t中的文字所在的文字区域,…,从Frame-n中确定出Frame-n中的字所在的文字区域。In some embodiments, after extracting a plurality of consecutive video frames, the text area where the text in each video frame is located is sequentially determined from each video frame according to the order of each video frame on the time axis. For example, if the continuous video frames are Frame-1, ... Frame-t, ..., Frame-n, then determine the text area where the text in Frame-1 is located from Frame-1, ..., determine Frame from Frame-t The text area where the text in -t is located, ..., determine the text area where the text in Frame-n is located from Frame-n.
在另一些实施例中,在抽取出多个连续的视频帧之后,并行从多个视频帧中确定出多个视频帧中的文字各自所在的文字区域。In some other embodiments, after extracting a plurality of consecutive video frames, the text regions where the texts in the video frames are respectively located are determined from the video frames in parallel.
示例性地,该文字区域为视频帧中的文字对应的文字框。Exemplarily, the text area is a text box corresponding to the text in the video frame.
本公开实施例中,采用多种方式确定上述文字区域,本公开在此具体不做限定。In the embodiments of the present disclosure, the foregoing text area is determined in various ways, which are not specifically limited in the present disclosure.
在一些实施例中,通过文字框识别工具从每个视频帧中确定出相应的文字区域。In some embodiments, the corresponding text area is determined from each video frame by a text frame recognition tool.
在另一些实施例中,通过文字跟踪网络从每个视频帧中确定出相应的文字区域。图4是根据一示例性实施例示出的一种通过文字跟踪网络从每个视频帧中确定出相应的文字区域的流程图。如图4所示,在上述S1301中,上述从上述每个视频帧中确定出上述文字区域,包括以下步骤S13011、S13013及S13015:In some other embodiments, the corresponding text area is determined from each video frame through a text tracking network. Fig. 4 is a flow chart of determining a corresponding text area from each video frame through a text tracking network according to an exemplary embodiment. As shown in FIG. 4, in the above S1301, the above-mentioned determination of the above-mentioned text area from each of the above-mentioned video frames includes the following steps S13011, S13013 and S13015:
在S13011中,对上述每个视频帧进行特征提取,得到上述每个视频帧对应的特征图。In S13011, feature extraction is performed on each of the above video frames to obtain a feature map corresponding to each of the above video frames.
在S13013中,确定上述特征图对应的文字区域热图;上述文字区域热图表征特征图中的文字区域和非文字区域。In S13013, determine a text area heat map corresponding to the feature map; the text area heat map represents a text area and a non-text area in the feature map.
在S13015中,对上述文字区域热图进行连通域分析,得到上述文字区域。In S13015, a connected domain analysis is performed on the heat map of the above-mentioned text area to obtain the above-mentioned text area.
在一些实施例中,按每个视频帧在时间轴上的顺序,依次对每个视频帧执行上述步骤S13011、S13013及S13015。In some embodiments, the above steps S13011 , S13013 and S13015 are executed sequentially for each video frame according to the order of each video frame on the time axis.
在另一些实施例中,对多个视频帧并行执行上述步骤S13011、S13013及S13015。In some other embodiments, the above steps S13011, S13013 and S13015 are executed in parallel for multiple video frames.
图5是根据一示例性实施例示出的一种文字跟踪网络示意图。如图5所示,该文字跟踪网络包括检测分支网络,该检测分支网络中包括特征提取子网络、第一特征融合子网络和特征检测子网络,该特征检测网络进一步包括过两个卷积层(convolutional)和一个全局平均池化层(global avg pooing)。Fig. 5 is a schematic diagram of a character tracking network according to an exemplary embodiment. As shown in Figure 5, the text tracking network includes a detection branch network, which includes a feature extraction subnetwork, a first feature fusion subnetwork, and a feature detection subnetwork, and the feature detection network further includes two convolutional layers. (convolutional) and a global average pooling layer (global avg pooing).
示例性地,在上述S13011中,通过特征提取子网络(例如,基于resnet18的经典卷积网络)对每个视频帧进行基础特征提取,得到每个视频帧的基础特征,该基础特征可以理解为一个Batch*64*W*H宽度的特征图。其中,Batch大小是一个超参数,用于定义在更新内部模型参数之前要处理的样本数,W为高度,H为宽度。接着通过第一特征融合子网络(例如,两个堆叠的特征金字塔增强模块(Feature Pyramid Enhancement Module,FPEM))对提取出的基础特征进行多尺度特征融合处理,得到每个视频帧对应的特征图。其中,FPEM用于增强卷积神经网络提取的特征,目的是使得提取出的特征更加鲁棒。Exemplarily, in the above S13011, the basic feature extraction is performed on each video frame through the feature extraction sub-network (for example, a classic convolutional network based on resnet18) to obtain the basic feature of each video frame, which can be understood as A feature map of Batch*64*W*H width. Among them, the batch size is a hyperparameter used to define the number of samples to be processed before updating the internal model parameters, W is the height, and H is the width. Then, through the first feature fusion sub-network (for example, two stacked feature pyramid enhancement modules (Feature Pyramid Enhancement Module, FPEM)), the extracted basic features are processed by multi-scale feature fusion to obtain the feature map corresponding to each video frame . Among them, FPEM is used to enhance the features extracted by the convolutional neural network, in order to make the extracted features more robust.
示例性地,在上述S13013中,在得到每个视频帧对应的特征图之后,通过特征检测子网络中的两个convolutional和一个global avg pooling,将每个视频帧的特征图处理成是否为文字区域的文字区域热图。可以理解的是,该文字区域热图中包括两部分内容,一个 是文字区域,另一个是非文字区域。Exemplarily, in the above S13013, after the feature map corresponding to each video frame is obtained, the feature map of each video frame is processed into whether it is text or not through two convolutional and one global avg pooling in the feature detection sub-network Text area heatmap of regions. It can be understood that the text area heat map includes two parts, one is a text area and the other is a non-text area.
示例性地,在上述S13015中,对该文字区域热图进行连通域分析,得到文字区域的位置,接着将该文字区域的位置拟合成旋转矩形,从而得到每个视频帧对应的文字区域。Exemplarily, in the above S13015, the connected domain analysis is performed on the text area heat map to obtain the position of the text area, and then the position of the text area is fitted to a rotated rectangle to obtain the text area corresponding to each video frame.
其中,连通域指的是:图像中具有相同的像素值且相邻的区域。连通域分析指的是:将图像中的连通域找出来并标记。Among them, the connected domain refers to the adjacent regions with the same pixel value in the image. Connected domain analysis refers to finding and marking the connected domains in the image.
需要说明的是,每个视频帧中通过上述方式,可以检测出至少一个文字区域,表示为box-1…box-m,共m个文字区域(m为大于或等于1的正整数)。例如,Frame-1检测出多个box;…Frame-t检测出多个box,…,Frame-n检测出多个box。It should be noted that, in each video frame, at least one text area can be detected through the above method, denoted as box-1...box-m, a total of m text areas (m is a positive integer greater than or equal to 1). For example, Frame-1 detects multiple boxes; ... Frame-t detects multiple boxes, ..., Frame-n detects multiple boxes.
本公开实施例中,通过对每个视频帧进行特征提取、热图检测以及连通域分析,能够准确从每个视频帧中确定出文字区域,且该文字区域的确定过程不需要依赖多个模型(文字检测、文字识别、其他模型等),能够降低文字区域的计算过程对计算资源的消耗。In the embodiment of the present disclosure, by performing feature extraction, heat map detection, and connected domain analysis on each video frame, the text area can be accurately determined from each video frame, and the determination process of the text area does not need to rely on multiple models (text detection, text recognition, other models, etc.), which can reduce the consumption of computing resources during the calculation process of the text area.
在S1303中,编码上述文字区域的位置信息,得到上述位置特征。In S1303, encode the location information of the above-mentioned character area to obtain the above-mentioned location feature.
在一些实施例中,按每个视频帧在时间轴上的顺序,依次编码每个视频帧中的文字区域的位置信息。In some embodiments, according to the order of each video frame on the time axis, the position information of the text area in each video frame is sequentially encoded.
在另一些实施例中,并行对多个视频帧中的文字区域的位置信息进编码。In some other embodiments, the location information of text regions in multiple video frames is encoded in parallel.
在一个示例性的实施例中,对文字区域的位置信息(即文字区域的左上角坐标、左下角坐标、右上角坐标和右下角坐标)进行位置编码,得到该文字区域的位置特征,该位置特征为1*128维。在一些实施例中,该位置编码包括但不限于:余弦位置编码(cos编码)、正弦位置编码(sin编码)等。In an exemplary embodiment, position coding is carried out to the position information of the text area (that is, the coordinates of the upper left corner, the coordinates of the lower left corner, the coordinates of the upper right corner, and the coordinates of the lower right corner of the text area) to obtain the position characteristics of the text area, the position The feature is 1*128 dimensions. In some embodiments, the position encoding includes but not limited to: cosine position encoding (cos encoding), sine position encoding (sin encoding) and the like.
需要说明的是,由于每一视频帧中包括至少一个文字区域,对于属于同一视频帧的至少一个文字区域而言,还按照至少一个文字区域在视频帧中的先后顺序,依次确定每个视频帧中的各个文字区域对应的位置特征。例如,某一个视频帧的文字区域为box-1…box-m,则首先确定box-1的位置特征,…,最后确定box-m的位置特征。It should be noted that, since each video frame includes at least one text area, for at least one text area belonging to the same video frame, each video frame is sequentially determined according to the sequence of at least one text area in the video frame The location features corresponding to each text area in . For example, if the text area of a certain video frame is box-1...box-m, first determine the location feature of box-1, ..., and finally determine the location feature of box-m.
在S1305中,基于上述文字区域的位置信息,从上述每个视频帧中提取出上述图像特征。In S1305, based on the position information of the text region, extract the above image features from each of the above video frames.
本公开实施例中,在得到每个视频帧中的各个文字区域的位置信息之后,根据该文字区域的位置信息,从每个视频帧中提取出各个文字区域对应的图像特征。In the embodiment of the present disclosure, after obtaining the position information of each text area in each video frame, the image features corresponding to each text area are extracted from each video frame according to the position information of the text area.
在一些实施例中,按每个视频帧在时间轴上的顺序,依次根据每个视频帧中的文字区域的位置信息,从每个视频帧中提取出各自对应的文字区域的图像特征。In some embodiments, according to the order of each video frame on the time axis, the image features of the respective corresponding text regions are extracted from each video frame according to the position information of the text regions in each video frame.
在另一些实施例中,根据多个视频帧中的文字区域的位置信息,并行从多个视频帧中提取各自对应的文字区域的图像特征。In some other embodiments, according to the position information of the text regions in the multiple video frames, the image features of the corresponding text regions are extracted from the multiple video frames in parallel.
需要说明的是,由于每一视频帧中包括至少一个文字区域,对于属于同一视频帧的至少一个文字区域而言,还按照至少一个文字区域在视频帧中的先后顺序,依次确定每个视频帧中的各个文字区域对应的图像特征。It should be noted that, since each video frame includes at least one text area, for at least one text area belonging to the same video frame, each video frame is sequentially determined according to the sequence of at least one text area in the video frame Image features corresponding to each text area in .
在一些实施例中,在S1305中,上述基于上述文字区域的位置信息,从上述每个视频帧中提取出上述图像特征,包括:In some embodiments, in S1305, the above-mentioned image features are extracted from each of the above-mentioned video frames based on the position information of the above-mentioned text area, including:
将上述文字区域的位置信息映射到上述特征图中,得到位置映射结果。Map the position information of the above-mentioned text area to the above-mentioned feature map to obtain a position mapping result.
从上述特征图中提取出位于上述位置映射结果中的图像,得到上述图像特征。其中,该位置映射结果即为文字区域对应的映射区域。Extracting the image located in the above position mapping result from the above feature map to obtain the above image feature. Wherein, the position mapping result is the mapping area corresponding to the text area.
在一些实施例中,将每个视频帧的文字区域的位置信息(即文字区域的左上角坐标、左下角坐标、右上角坐标和右下角坐标)映射到每个视频帧对应的特征图上,得到每个视频帧的文字区域的位置映射结果(即,一个映射区域),截取每个视频帧的特征图中位于该位置映射结果中的图像,得到每个视频帧的文字区域对应的图像特征。In some embodiments, the position information of the text area of each video frame (ie, the upper left corner coordinate, the lower left corner coordinate, the upper right corner coordinate and the lower right corner coordinate of the text area) is mapped to the feature map corresponding to each video frame, Obtain the position mapping result (i.e., a mapping area) of the text area of each video frame, intercept the image located in the position mapping result in the feature map of each video frame, and obtain the image feature corresponding to the text area of each video frame .
示例性地,在上述S1303和S1305中,通过上述文字跟踪网络提取出上述位置特征和图像特征。继续如图5所示,上述文字跟踪网络还包括识别分支网络,通过该识别分支 网络,根据文字区域的位置信息从每个视频帧中提取出相应的位置特征和图像特征。Exemplarily, in the above S1303 and S1305, the above location features and image features are extracted through the above text tracking network. Continuing as shown in Figure 5, the above-mentioned character tracking network also includes a recognition branch network, through which the recognition branch network extracts the corresponding location feature and image feature from each video frame according to the location information of the text region.
本公开实施例中,通过将文字区域的位置信息映射到上述特征图中,从而截取出相应的图像特征,使得图像特征能够与文字区域精准匹配,提高图像特征的精度,从而提高融合特征描述子的确定精度,进而提高视频文字跟踪的精度;此外,图像特征的确定过程不需要依赖多个模型(文字检测、文字识别、其他模型等),降低图像特征的计算过程对计算资源的消耗。In the embodiment of the present disclosure, by mapping the position information of the text area to the above feature map, the corresponding image features are intercepted, so that the image features can be accurately matched with the text area, the accuracy of the image features is improved, and the fusion feature descriptor is improved. In addition, the determination process of image features does not need to rely on multiple models (text detection, text recognition, other models, etc.), which reduces the consumption of computing resources in the calculation process of image features.
在S1307中,解码上述图像特征,得到上述文字序列特征。In S1307, the above-mentioned image features are decoded to obtain the above-mentioned character sequence features.
本公开实施例中,由于每个视频帧的各个文字区域是文本行级别的文字区域,即文字区域中包括一行文字,每个文字均有相应的位置和顺序,因此,在得到每个视频帧的各个文字区域对应的图像特征之后,对该图像特征进行解码,得到每个视频帧的各个文字区域对应的文字序列特征。In the embodiment of the present disclosure, since each text area of each video frame is a text area at the text line level, that is, a line of text is included in the text area, and each text has a corresponding position and order. Therefore, after obtaining each video frame After the image features corresponding to each text area of each video frame, the image features are decoded to obtain the text sequence features corresponding to each text area of each video frame.
在一些实施例中,按每个视频帧在时间轴上的顺序,依次对每个视频帧中的各个文字区域对应的图像特征进行解码,得到每个视频帧中的各个文字区域对应的文字序列特征。In some embodiments, according to the order of each video frame on the time axis, the image features corresponding to each text area in each video frame are decoded sequentially to obtain the text sequence corresponding to each text area in each video frame feature.
在另一些实施例中,并行对多个视频帧各自对应的各个文字区域对应的图像特征进行解码,得到每个视频帧中的各个文字区域对应的文字序列特征。In some other embodiments, the image features corresponding to the text regions corresponding to the plurality of video frames are decoded in parallel to obtain the character sequence features corresponding to the text regions in each video frame.
示例性地,在上述S1307中,通过上述文字跟踪网络对解码上述图像特征,得到上述文字序列特征。继续如图5所示,上述文字跟踪网络还包括识别分支网络,该识别分支网络对每个视频帧的各个文字区域截取到的图像特征进行解码(例如,连接主义时间分类解码(CTC解码)),从而得到每个视频帧的各个文字区域的文字序列特征。Exemplarily, in the above S1307, the above-mentioned image features are decoded through the above-mentioned character tracking network to obtain the above-mentioned character sequence features. Continuing as shown in Figure 5, the above-mentioned text tracking network also includes a recognition branch network, which decodes the image features intercepted by each text region of each video frame (for example, connectionist temporal classification decoding (CTC decoding)) , so as to obtain the text sequence features of each text area of each video frame.
本公开实施例中,首先从每个视频帧中确定出上述文字区域,并根据文字区域的位置信息,确定文字区域的位置特征、图像特征和文字序列特征,能够提高位置特征、图像特征和文字序列特征的确定精度,从而提高融合特征描述子的确定精度,进而提高视频文字跟踪的精度;此外,位置特征、图像特征和文字序列特征的确定过程不需要依赖多个模型(文字检测、文字识别、其他模型等),从而降低位置特征、图像特征和文字序列特征的确定过程对计算资源的消耗。In the embodiment of the present disclosure, firstly, the above-mentioned text area is determined from each video frame, and according to the position information of the text area, the position feature, image feature and text sequence feature of the text area are determined, which can improve the position feature, image feature and text The determination accuracy of sequence features, thereby improving the determination accuracy of fusion feature descriptors, and then improving the accuracy of video text tracking; in addition, the determination process of position features, image features and text sequence features does not need to rely on multiple models (text detection, text recognition, etc.) , other models, etc.), so as to reduce the consumption of computing resources in the process of determining position features, image features and text sequence features.
在S15中,分别根据每个视频帧的上述文字序列特征、上述位置特征以及上述图像特征,确定上述每个视频帧对应的融合特征描述子。In S15, according to the above-mentioned character sequence feature, the above-mentioned position feature and the above-mentioned image feature of each video frame, determine the fusion feature descriptor corresponding to each video frame.
在一些实施例中,按每个视频帧在时间轴上的顺序,依次对每个视频帧中的各个文字区域对应的文字序列特征、上述位置特征以及上述图像特征进行融合,得到每个视频帧中的各个文字区域对应的融合特征描述子。In some embodiments, according to the order of each video frame on the time axis, the text sequence features corresponding to each text area in each video frame, the above-mentioned position features, and the above-mentioned image features are fused in order to obtain each video frame Fusion feature descriptors corresponding to each text region in .
在另一些实施例中,并行对多个视频帧中的各个文字区域对应的文字序列特征、上述位置特征以及上述图像特征进行融合,得到每个视频帧中的各个文字区域对应的融合特征描述子。In some other embodiments, the text sequence features corresponding to each text area in multiple video frames, the above-mentioned position features, and the above-mentioned image features are fused in parallel to obtain a fusion feature descriptor corresponding to each text area in each video frame .
本公开实施例中,通过多种方式对每个视频帧的上述三个特征进行融合,得到每个视频帧对应的融合特征描述子,本公开在此不做具体限定。In the embodiment of the present disclosure, the above three features of each video frame are fused in various ways to obtain a fused feature descriptor corresponding to each video frame, which is not specifically limited in the present disclosure.
在一些实施例中,通过上述文字跟踪网络对上述三个特征进行融合,得到每个视频帧对应的融合特征描述子。示例性地,继续如图5所示,上述文字跟踪网络还包括多信息融合描述子分支网络,该多信息融合描述子分支网络进一步包括第二特征融合子网络和特征尺寸调整子网络。In some embodiments, the above-mentioned three features are fused through the above-mentioned character tracking network to obtain a fused feature descriptor corresponding to each video frame. Exemplarily, as shown in FIG. 5 , the above text tracking network further includes a multi-information fusion description sub-branch network, and the multi-information fusion description sub-branch network further includes a second feature fusion sub-network and a feature size adjustment sub-network.
在一些实施例中,图6是根据一示例性实施例示出的一种通过上述文字跟踪网络得到上述融合特征描述子的流程图。如图6所示,在上述S15中,上述分别根据每个视频帧的上述文字序列特征、上述位置特征以及上述图像特征,得到上述每个视频帧对应的融合特征描述子,包括以下步骤S1501及S1503:In some embodiments, FIG. 6 is a flow chart of obtaining the fusion feature descriptor through the text tracking network according to an exemplary embodiment. As shown in Figure 6, in the above S15, the above-mentioned fusion feature descriptor corresponding to each video frame is obtained according to the above-mentioned character sequence feature, the above-mentioned position feature and the above-mentioned image feature of each video frame, including the following steps S1501 and S1503:
在S1501中,对于每个视频帧,拼接上述文字序列特征、上述位置特征以及上述图像特征,得到该视频帧对应的融合特征。In S1501, for each video frame, the above-mentioned character sequence feature, the above-mentioned position feature and the above-mentioned image feature are concatenated to obtain the fusion feature corresponding to the video frame.
在S1503中,调整上述融合特征的尺寸,得到上述融合特征描述子。In S1503, the size of the fusion feature is adjusted to obtain the fusion feature descriptor.
示例性地,在上述S1501中,通过第二特征融合子网络,将每个视频帧的各个文字区域对应的文字序列特征、上述位置特征以及上述图像特征进行拼接,得到每个视频帧的各个文字区域对应的融合特征,该融合特征的维度为3*128维。Exemplarily, in the above S1501, through the second feature fusion sub-network, the text sequence features corresponding to the text areas of each video frame, the above position features and the above image features are spliced to obtain the text of each video frame The fusion feature corresponding to the region. The dimension of the fusion feature is 3*128 dimensions.
示例性地,在上述S1503中,通过特征尺寸调整子网络(比如,两个多层感知机(Multilayer Perceptron,MLP))对每个视频帧的各个文字区域对应的融合特征的输出尺寸进行调整(例如,3*128维->1*128维),得到每个视频帧的各个文字区域对应的融合特征描述子。其中,多层感知机是一种前向结构的人工神经网络,包含输入层、输出层及多个隐藏层融合特征描述子。Exemplarily, in the above S1503, the output size of the fusion feature corresponding to each text region of each video frame is adjusted by a feature size adjustment sub-network (for example, two multilayer perceptrons (Multilayer Perceptron, MLP)) ( For example, 3*128 dimensions -> 1*128 dimensions), to obtain the fusion feature descriptor corresponding to each text area of each video frame. Among them, the multi-layer perceptron is a forward-structured artificial neural network, which includes an input layer, an output layer, and multiple hidden layer fusion feature descriptors.
在一些实施例中,多信息融合描述子分支网络还包括位置特征提取子网络、图像特征提取子网络、文字序列特征提取子网络,并通过该文字序列特征提取子网络提取上述文字序列特征、通过位置特征提取子网络提取位置特征、通过图像特征提取子网络提取图像特征。In some embodiments, the multi-information fusion description sub-branch network also includes a position feature extraction sub-network, an image feature extraction sub-network, and a text sequence feature extraction sub-network, and the above-mentioned text sequence features are extracted through the text sequence feature extraction sub-network. The location feature extraction subnetwork extracts location features, and image features are extracted through the image feature extraction subnetwork.
本公开实施例中,通过上述端到端的文字跟踪网络进行特征提取、特征识别和特征融合,得到融合特征描述子,以通过该融合特征描述子得到视频文字的跟踪轨迹,实现了通过一个模型获取视频文字跟踪轨迹,视频文字跟踪轨迹的确定准确性和鲁棒性较高,并且计算资源消耗小。In the embodiment of the present disclosure, feature extraction, feature recognition, and feature fusion are performed through the above-mentioned end-to-end text tracking network to obtain a fusion feature descriptor, so as to obtain the tracking track of the video text through the fusion feature descriptor, and realize the acquisition through a model. The video text tracking trajectory, the determination accuracy and robustness of the video text tracking trajectory are high, and the calculation resource consumption is small.
需要说明的是,由于每个视频帧对应至少一个文字区域,则按照至少一个文字区域在视频帧中的先后顺序,依次确定出每个视频帧中的各个文字区域对应的融合特征描述子。例如,某一个视频帧的文字区域为box-1…box-m,则首先确定box-1对应一个融合特征描述子,…最后确定box-m对应一个融合特征描述子。It should be noted that since each video frame corresponds to at least one text region, the fusion feature descriptors corresponding to each text region in each video frame are sequentially determined according to the sequence of at least one text region in the video frame. For example, if the text area of a certain video frame is box-1...box-m, first determine that box-1 corresponds to a fusion feature descriptor, ...finally determine that box-m corresponds to a fusion feature descriptor.
本公开实施例中,通过对每个视频帧的各个文字区域的文字序列特征、上述位置特征以及上述图像特征进行拼接,得到融合特征,并调整融合特征的尺寸,得到上述融合特征描述子,使得融合特征描述子的确定能够充分考虑文字的文字序列特征、位置特征和图像特征,融合特征描述子的确定准确率较高,从而提高文字跟踪的准确率较高;此外,融合特征描述子的确定过程不需要依赖多个模型(文字检测、文字识别、其他模型等),降低融合特征描述子的确定过程对计算资源的消耗。In the embodiment of the present disclosure, the fusion feature is obtained by concatenating the character sequence features, the above position features, and the above image features of each text area of each video frame, and the size of the fusion feature is adjusted to obtain the above fusion feature descriptor, so that The determination of the fusion feature descriptor can fully consider the character sequence features, position features and image features of the text, and the determination accuracy of the fusion feature descriptor is high, thereby improving the accuracy of text tracking; in addition, the determination of the fusion feature descriptor The process does not need to rely on multiple models (text detection, text recognition, other models, etc.), reducing the consumption of computing resources in the process of determining the fusion feature descriptor.
在S17中,基于上述每个视频帧对应的融合特征描述子更新已存储的参考跟踪轨迹,得到上述视频中的文字的跟踪轨迹;上述跟踪轨迹用于表征文字的位置信息。In S17, based on the fusion feature descriptor corresponding to each video frame, the stored reference tracking track is updated to obtain the tracking track of the text in the video; the tracking track is used to represent the position information of the text.
本公开实施例中,设置一个用于表征文字的位置信息的参考跟踪轨迹,在得到每个视频帧的文字区域的融合特征描述子之后,使用每个视频帧中的文字区域的融合特征描述子,更新该参考跟踪轨迹中的参考融合特征描述子,得到视频中的文字的跟踪轨迹。In the embodiment of the present disclosure, a reference tracking track used to characterize the position information of the text is set, and after obtaining the fusion feature descriptor of the text area of each video frame, the fusion feature descriptor of the text area in each video frame is used , update the reference fusion feature descriptor in the reference tracking track, and obtain the tracking track of the text in the video.
示例性地,上述更新包括但不限于:修改、替换、添加、删除等。Exemplarily, the above update includes but not limited to: modification, replacement, addition, deletion and so on.
图7是根据一示例性实施例示出的更新参考跟踪轨迹,得到上述视频中的文字的跟踪轨迹的流程图。如图7所示,在上述S17中,上述基于上述每个视频帧对应的融合特征描述子更新参考跟踪轨迹,得到上述视频中的文字的跟踪轨迹,包括以下步骤S1701及S1703:Fig. 7 is a flow chart of updating the reference tracking track to obtain the tracking track of the text in the video according to an exemplary embodiment. As shown in FIG. 7, in the above S17, the reference tracking track is updated based on the fusion feature descriptor corresponding to each video frame above, and the tracking track of the text in the above video is obtained, including the following steps S1701 and S1703:
在S1701中,按顺序依次遍历上述每个视频帧,并在遍历上述每个视频帧时,执行以下操作:将遍历到的当前视频帧对应的当前融合特征描述子,与上述参考跟踪轨迹中的参考文字区域对应的参考融合特征描述子进行相似度匹配,得到上述当前视频帧对应的相似度匹配结果;基于上述相似度匹配结果、上述当前融合特征描述子和上述当前视频帧对应的当前文字区域,更新上述参考跟踪轨迹,得到上述当前视频帧对应的目标跟踪轨迹;将上述目标跟踪轨迹确定为更新后的参考跟踪轨迹;上述顺序表征上述每个视频帧在上述时间轴上的顺序。In S1701, traverse each of the above video frames in sequence, and perform the following operations when traversing each of the above video frames: combine the current fusion feature descriptor corresponding to the traversed current video frame with the above reference tracking track The reference fusion feature descriptor corresponding to the reference text area performs similarity matching to obtain the similarity matching result corresponding to the above-mentioned current video frame; based on the above-mentioned similarity matching result, the above-mentioned current fusion feature descriptor and the current text area corresponding to the above-mentioned current video frame , updating the above-mentioned reference tracking trajectory to obtain the target tracking trajectory corresponding to the above-mentioned current video frame; determining the above-mentioned target tracking trajectory as the updated reference tracking trajectory; the above-mentioned sequence represents the sequence of each of the above-mentioned video frames on the above-mentioned time axis.
在S1703中,将排序最后的视频帧对应的目标跟踪轨迹,作为上述视频中的文字的 跟踪轨迹。In S1703, the target tracking track corresponding to the last video frame is used as the tracking track of the text in the above-mentioned video.
在一些实施例中,由于本公开实施例需要分析视频文字的跟踪轨迹,视频帧中的视频帧和文字是有先后顺序的,因此,轨迹更新时候,需要按每个视频帧在时间轴上的顺序依次进行更新。In some embodiments, since the embodiments of the present disclosure need to analyze the tracking track of the video text, the video frames and the text in the video frame are sequential. Updates are performed sequentially.
示例性地,连续的视频帧为Frame-1,…Frame-t,…,Frame-n,每个视频帧均对应多个文字区域box-1,…box-t,…box-m。每个文字区域均对应一个融合特征描述子。Exemplarily, the continuous video frames are Frame-1, ... Frame-t, ..., Frame-n, and each video frame corresponds to a plurality of text areas box-1, ... box-t, ... box-m. Each text region corresponds to a fused feature descriptor.
该参考跟踪轨迹在初始化的时候为空。The reference tracking trajectory is empty during initialization.
遍历Frame-1,Frame-1即为当前视频帧,由于参考跟踪轨迹为空,直接将Frame-1的多个文字区域box-1,…box-t,…box-m,以及多个文字区域各自对应的融合特征描述子初始化至该参考跟踪轨迹,得到Frame-1对应的目标跟踪轨迹1,并将该目标跟踪轨迹1确定为更新后的参考跟踪轨迹。Traversing Frame-1, Frame-1 is the current video frame, since the reference tracking track is empty, the multiple text areas box-1, ... box-t, ... box-m, and multiple text areas of Frame-1 are directly The respective fusion feature descriptors are initialized to the reference tracking trajectory, and the target tracking trajectory 1 corresponding to Frame-1 is obtained, and the target tracking trajectory 1 is determined as the updated reference tracking trajectory.
遍历Frame-2,Frame-2即为当前视频帧,由于跟踪轨迹1中已经存储有Frame-1的多个文字区域以及多个文字区域各自对应的融合特征描述子,则依次计算Frame-2的多个文字区域各自对应的融合特征描述子,与跟踪轨迹1已经存在的融合特征描述子之间的相似度,得到Frame-2对应的相似度匹配结果。并基于该相似度匹配结果、Frame-2的多个文字区域各自对应的当前融合特征描述子以及Frame-2的多个文字区域,更新跟踪轨迹1,得到上述Frame-2对应的目标跟踪轨迹2,并将该目标跟踪轨迹2确定为更新后的参考跟踪轨迹。Traversing Frame-2, Frame-2 is the current video frame, since the multiple text regions of Frame-1 and the fusion feature descriptors corresponding to the multiple text regions have been stored in the tracking track 1, the Frame-2’s The similarity between the fusion feature descriptors corresponding to each of the multiple text regions and the existing fusion feature descriptor of track 1 is tracked, and the similarity matching result corresponding to Frame-2 is obtained. And based on the similarity matching results, the current fusion feature descriptors corresponding to each of the multiple text regions of Frame-2, and the multiple text regions of Frame-2, update the tracking trajectory 1 to obtain the target tracking trajectory 2 corresponding to the above Frame-2 , and determine the target tracking trajectory 2 as the updated reference tracking trajectory.
后续Frame-3至Frame-m的遍历过程与上述Frame-2相似,在此不再赘述。在遍历完Frame-m之后,输出Frame-m对应的目标跟踪轨迹m,该目标跟踪轨迹m即为该视频中的文字的跟踪轨迹。The subsequent traversal process from Frame-3 to Frame-m is similar to the above-mentioned Frame-2, and will not be repeated here. After traversing Frame-m, output the target tracking track m corresponding to Frame-m, and the target tracking track m is the tracking track of the text in the video.
本公开实施例中,通过依次将每个视频帧的各个文字区域的融合特征描述子与参考跟踪轨迹中的参考融合特征描述子进行相似度匹配,并根据每个视频帧的相似度结果依次对参考跟踪轨迹中的参考融合特征描述子进行更新,使得最终得到的跟踪轨迹能够充分考虑文字的文字序列特征、位置特征和图像特征,文字跟踪的准确率较高;此外,文字跟踪的跟踪轨迹的确定过程是一个端对端的过程(例如,使用一个端到端的文字跟踪网络实现文字跟踪),其不需要依赖多个模型(文字检测、文字识别、其他模型等),降低的跟踪轨迹的确定过程对计算资源的消耗。In the embodiment of the present disclosure, the fusion feature descriptors of each text region of each video frame are sequentially matched with the reference fusion feature descriptors in the reference tracking trajectory, and the similarity results of each video frame are sequentially compared The reference fusion feature descriptor in the reference tracking trajectory is updated, so that the final tracking trajectory can fully consider the text sequence features, position features and image features of the text, and the accuracy of text tracking is high; in addition, the tracking trajectory of text tracking The determination process is an end-to-end process (for example, using an end-to-end text tracking network to achieve text tracking), which does not need to rely on multiple models (text detection, text recognition, other models, etc.), reducing the determination process of the tracking trajectory consumption of computing resources.
在一些实施例中,上述将上述当前视频帧对应的当前融合特征描述子,与上述参考跟踪轨迹中的参考融合特征描述子进行相似度匹配,得到上述当前视频帧对应的相似度匹配结果,包括:In some embodiments, the above-mentioned current fusion feature descriptor corresponding to the above-mentioned current video frame is similarly matched with the reference fusion feature descriptor in the above-mentioned reference tracking track, and the similarity matching result corresponding to the above-mentioned current video frame is obtained, including :
将上述当前融合特征描述子与上述参考融合特征描述子,进行相似度混淆矩阵的计算,得到上述相似度匹配结果。The above-mentioned current fusion feature descriptor and the above-mentioned reference fusion feature descriptor are used to calculate the similarity confusion matrix to obtain the above-mentioned similarity matching result.
在一些实施例中,为了提高相似度计算的精度,采用进行相似度混淆矩阵的计算,得到相似度,然后再通过匈牙利算法选取相似度较高的相似度,得到相似度匹配结果。In some embodiments, in order to improve the accuracy of the similarity calculation, the similarity confusion matrix is calculated to obtain the similarity, and then the similarity with higher similarity is selected through the Hungarian algorithm to obtain the similarity matching result.
假设,连续的视频帧为Frame-1,…Frame-t,…,Frame-n,每个视频帧均对应多个文字区域box-1,…box-t,…box-m。每个文字区域均对应一个融合特征描述子。Assume that the continuous video frames are Frame-1, ... Frame-t, ..., Frame-n, and each video frame corresponds to multiple text areas box-1, ... box-t, ... box-m. Each text region corresponds to a fused feature descriptor.
该参考跟踪轨迹在初始化的时候为空。The reference tracking trajectory is empty during initialization.
遍历Frame-1,Frame-1即为当前视频帧,由于参考跟踪轨迹为空,直接将Frame-1的多个文字区域box-1、box-2和box-3,以及多个文字区域各自对应的融合特征描述子添加至该参考跟踪轨迹,得到Frame-1对应的目标跟踪轨迹1,并将该目标跟踪轨迹1确定为更新后的参考跟踪轨迹。Traversing Frame-1, Frame-1 is the current video frame, since the reference tracking track is empty, directly correspond to multiple text areas box-1, box-2 and box-3 of Frame-1, and multiple text areas The fusion feature descriptor of is added to the reference tracking trajectory to obtain the target tracking trajectory 1 corresponding to Frame-1, and the target tracking trajectory 1 is determined as the updated reference tracking trajectory.
遍历Frame-2,Frame-2即为当前视频帧,假设Frame-2包括box-1和box-2,则将Frame-2的box-1的融合特征描述子和box-2的融合特征描述子,分别与跟踪轨迹1中的box-1、box-2和box-3各自对应的融合特征描述子做一个相似度混淆矩阵的计算,得到 Frame-2的box-1的融合特征描述子分别与跟踪轨迹1中的box-1、box-2和box-3各自对应的融合特征描述子之间的相似度(即Frame-2的box-1,得到3个相似度),以及Frame-2的box-2的融合特征描述子分别与跟踪轨迹1中的box-1、box-2和box-3各自对应的融合特征描述子之间的相似度(即Frame-2的box-2,得到3个相似度)。最后通过匈牙利匹配算法从Frame-2的box-1的3个相似度,选取相似度最高的相似度作为Frame-2的box-1的相似度匹配结果。并通过匈牙利匹配算法从Frame-2的box-2的3个相似度,选取相似度最高的相似度作为Frame-2的box-2的相似度匹配结果。Traversing Frame-2, Frame-2 is the current video frame, assuming that Frame-2 includes box-1 and box-2, then the fusion feature descriptor of box-1 of Frame-2 and the fusion feature descriptor of box-2 , and calculate the similarity confusion matrix with the fusion feature descriptors corresponding to box-1, box-2 and box-3 in tracking track 1 respectively, and get the fusion feature descriptors of box-1 of Frame-2 respectively with Track the similarity between the fusion feature descriptors corresponding to box-1, box-2 and box-3 in track 1 (that is, box-1 of Frame-2, get 3 similarities), and Frame-2's The similarity between the fusion feature descriptors of box-2 and the fusion feature descriptors corresponding to box-1, box-2 and box-3 in tracking track 1 respectively (that is, box-2 of Frame-2, 3 similarity). Finally, the Hungarian matching algorithm is used to select the similarity with the highest similarity from the three similarities of box-1 of Frame-2 as the similarity matching result of box-1 of Frame-2. And through the Hungarian matching algorithm, from the three similarities of box-2 of Frame-2, the similarity with the highest similarity is selected as the similarity matching result of box-2 of Frame-2.
后续Frame-3至Frame-m的相似度匹配结果的确定过程与上述Frame-2相似,在此不再赘述。The subsequent determination process of the similarity matching results from Frame-3 to Frame-m is similar to the above-mentioned Frame-2, and will not be repeated here.
在一些实施例中,上述基于上述相似度匹配结果、上述当前融合特征描述子和上述当前视频帧对应的当前文字区域,更新上述参考跟踪轨迹,得到上述当前视频帧对应的目标跟踪轨迹,包括:In some embodiments, based on the above similarity matching result, the above-mentioned current fusion feature descriptor and the current text region corresponding to the above-mentioned current video frame, the above-mentioned reference tracking track is updated to obtain the target tracking track corresponding to the above-mentioned current video frame, including:
在上述相似度匹配结果小于预设相似度阈值的情况下,基于上述当前文字区域更新上述参考文字区域,并基于上述当前融合特征描述子更新上述参考融合特征描述子,得到上述目标跟踪轨迹。When the similarity matching result is smaller than the preset similarity threshold, the reference text region is updated based on the current text region, and the reference fusion feature descriptor is updated based on the current fusion feature descriptor to obtain the target tracking trajectory.
在一些实施例中,上述基于上述相似度匹配结果、上述当前融合特征描述子和上述当前视频帧对应的当前文字区域,更新上述参考跟踪轨迹,得到上述当前视频帧对应的目标跟踪轨迹,还包括:In some embodiments, the aforementioned reference tracking track is updated based on the aforementioned similarity matching result, the aforementioned current fusion feature descriptor, and the current text region corresponding to the aforementioned current video frame to obtain the target tracking track corresponding to the aforementioned current video frame, and further includes :
在上述相似度匹配结果大于或等于预设相似度阈值的情况下,将上述当前文字区域和上述当前融合特征描述子添加至上述参考跟踪轨迹中,得到上述目标跟踪轨迹。When the similarity matching result is greater than or equal to a preset similarity threshold, the current text region and the current fusion feature descriptor are added to the reference tracking track to obtain the target tracking track.
在实际应用中,当前视频帧中的文字区域可能是已有轨迹中的一部分,也可能并非是已有轨迹中的一部分,而是不同于已有轨迹的新的文字区域。In practical applications, the text area in the current video frame may or may not be a part of the existing track, but a new text area different from the existing track.
为了精准判断当前视频帧中的文字区域是否属于已有轨迹中的一部分,将当前视频帧对应的相似度匹配结果与预设相似度阈值进行比较。In order to accurately determine whether the text area in the current video frame belongs to a part of the existing trajectory, the similarity matching result corresponding to the current video frame is compared with a preset similarity threshold.
在当前视频帧对应的相似度匹配结果小于该预设相似度阈值的情况下,表明该当前文字区域是已有轨迹中的一部分,则使用当前文字区域更新上述参考文字区域,并使用当前融合特征描述子更新上述参考融合特征描述子,以将当前文字区域和当前融合特征描述子更新到已有轨迹中,从而精准确定出属于已有轨迹的文本区域,提高视频中的文字的跟踪轨迹的确定精度。When the similarity matching result corresponding to the current video frame is less than the preset similarity threshold, it indicates that the current text area is part of the existing trajectory, then use the current text area to update the above reference text area, and use the current fusion feature The descriptor updates the above-mentioned reference fusion feature descriptor to update the current text region and the current fusion feature descriptor to the existing trajectory, thereby accurately determining the text region belonging to the existing trajectory, and improving the determination of the tracking trajectory of the text in the video precision.
在当前视频帧对应的相似度匹配结果大于或等于该预设相似度阈值的情况下,表明该当前文字区域并不是已有轨迹中的一部分,而是新的文字,则将上述当前文字区域和上述当前融合特征描述子添加至上述参考跟踪轨迹中,以对参考跟踪轨迹进行更新,从而精准确定出不属于已有轨迹的文本区域,进一步提高视频中的文字的跟踪轨迹的确定精度。When the similarity matching result corresponding to the current video frame is greater than or equal to the preset similarity threshold, it indicates that the current text area is not part of the existing track, but a new text, then the above-mentioned current text area and The above-mentioned current fusion feature descriptor is added to the above-mentioned reference tracking trajectory to update the reference tracking trajectory, thereby accurately determining the text area that does not belong to the existing trajectory, and further improving the determination accuracy of the text tracking trajectory in the video.
继续以上述例子为例进行说明:Continuing with the above example as an example:
对于Frame-2,通过相似度混淆矩阵的计算,得到Frame-2的box-1的融合特征描述子分别与跟踪轨迹1中的box-1、box-2和box-3各自对应的融合特征描述子之间的相似度(即Frame-2的box-1,对应3个相似度),以及Frame-2的box-2的融合特征描述子分别与跟踪轨迹1中的box-1、box-2和box-3各自对应的融合特征描述子之间的相似度(即Frame-2的box-2,对应3个相似度)。For Frame-2, through the calculation of the similarity confusion matrix, the fusion feature descriptors of box-1 of Frame-2 and the fusion feature descriptions corresponding to box-1, box-2 and box-3 in tracking track 1 are obtained respectively. The similarity between sub-items (that is, box-1 of Frame-2, corresponding to 3 similarities), and the fusion feature descriptor of box-2 of Frame-2 and box-1 and box-2 in tracking track 1 respectively The similarity between the fusion feature descriptors corresponding to box-3 (that is, box-2 of Frame-2, corresponding to 3 similarities).
假设通过匈牙利匹配算法从Frame-2的box-1对应的3个相似度中,选取出的最高相似度,为与跟踪轨迹1中的box-1匹配得到的相似度,且该相似度大于或等于预设相似度阈值,表明该Frame-2的box-1是已有轨迹(跟踪轨迹1中的box-1)中的一部分,则使用Frame-2的box-1更新跟踪轨迹1中的box-1,并使用Frame-2的box-1的当前融合特征描述子更新跟踪轨迹1中的box-1的参考融合特征描述子,以将Frame-2的box-1和相应的融合特征描述子更新到跟踪轨迹1中。Assume that the highest similarity selected from the three similarities corresponding to box-1 of Frame-2 through the Hungarian matching algorithm is the similarity obtained by matching box-1 in tracking track 1, and the similarity is greater than or Equal to the preset similarity threshold, indicating that the box-1 of the Frame-2 is part of the existing track (box-1 in the tracking track 1), then use the box-1 of the Frame-2 to update the box in the tracking track 1 -1, and use the current fused feature descriptor of box-1 of Frame-2 to update the reference fused feature descriptor of box-1 in tracking trajectory 1 to combine box-1 of Frame-2 with the corresponding fused feature descriptor Update to track track 1.
假设通过匈牙利匹配算法从Frame-2的box-2对应3个相似度中,选取出的最高相似度,为与跟踪轨迹1中的box-2匹配得到的相似度,且该相似度小于预设相似度阈值,表明该Frame-2的box-2不是已有轨迹中的一部分,而是新的文字,则将上述Frame-2的box-2和相应的融合特征描述子初始化至跟踪轨迹1中。Assume that the highest similarity selected from the three similarities corresponding to box-2 of Frame-2 through the Hungarian matching algorithm is the similarity obtained by matching box-2 in tracking track 1, and the similarity is less than the preset The similarity threshold indicates that the box-2 of the Frame-2 is not a part of the existing trajectory, but a new text, then initialize the box-2 of the above Frame-2 and the corresponding fusion feature descriptor to the tracking trajectory 1 .
图8是根据一示例性实施例示出的一种视频文字跟踪装置框图。参照图8,该装置包括提取模块21,特征获取模块23、描述子获取模块25和跟踪轨迹确定模块27。Fig. 8 is a block diagram of a video text tracking device according to an exemplary embodiment. Referring to FIG. 8 , the device includes an extraction module 21 , a feature acquisition module 23 , a descriptor acquisition module 25 and a track determination module 27 .
提取模块21,被配置为从视频中提取出多个视频帧;The extraction module 21 is configured to extract a plurality of video frames from the video;
特征获取模块23,被配置为获取每个视频帧中的文字的文字序列特征、上述文字所在的文字区域的位置特征以及上述文字区域对应的图像特征;The feature acquisition module 23 is configured to acquire the character sequence feature of the text in each video frame, the position feature of the text area where the above text is located, and the image feature corresponding to the above text area;
描述子获取模块25,被配置为分别根据上述每个视频帧的上述文字序列特征、上述位置特征以及上述图像特征,得到上述每个视频帧对应的融合特征描述子;The descriptor acquisition module 25 is configured to obtain the fusion feature descriptor corresponding to each of the video frames according to the above-mentioned character sequence features, the above-mentioned position features, and the above-mentioned image features of each of the above-mentioned video frames;
跟踪轨迹确定模块27,被配置为基于上述每个视频帧对应的融合特征描述子更新已存储的参考跟踪轨迹,得到上述视频中的文字的跟踪轨迹;上述跟踪轨迹用于表征上述视频中文字的位置信息。The tracking trajectory determination module 27 is configured to update the stored reference tracking trajectory based on the fusion feature descriptor corresponding to each video frame, and obtain the tracking trajectory of the text in the video; the tracking trajectory is used to characterize the text in the video location information.
在一些实施例中,上述特征获取模块23,包括:In some embodiments, the feature acquisition module 23 includes:
文字区域确定单元,被配置为对于每个视频帧,从上述视频帧中确定出上述文字区域;a text area determination unit configured to, for each video frame, determine the text area from the video frame;
位置特征获取单元,被配置为编码上述文字区域的位置信息,得到上述位置特征;The location feature acquisition unit is configured to encode the location information of the above-mentioned text area to obtain the above-mentioned location feature;
图像特征提取单元,被配置为基于上述文字区域的位置信息,从上述视频帧中提取出上述图像特征;The image feature extraction unit is configured to extract the above image features from the above video frame based on the position information of the above text area;
文字序列特征获取单元,被配置为解码上述图像特征,得到上述文字序列特征。The character sequence feature acquisition unit is configured to decode the above-mentioned image feature to obtain the above-mentioned character sequence feature.
在一些实施例中,上述文字区域确定单元,包括:In some embodiments, the above text area determination unit includes:
特征图确定子单元,被配置为对上述视频帧进行特征提取,得到上述视频帧对应的特征图;The feature map determination subunit is configured to perform feature extraction on the video frame to obtain a feature map corresponding to the video frame;
文字区域热图确定子单元,被配置为确定上述特征图对应的文字区域热图;上述文字区域热图表征上述特征图中的文字区域和非文字区域;The text area heat map determination subunit is configured to determine the text area heat map corresponding to the feature map; the text area heat map represents the text area and non-text area in the feature map;
连通域分析子单元,被配置为对上述文字区域热图进行连通域分析,得到上述文字区域。The connected domain analysis subunit is configured to perform connected domain analysis on the heat map of the text area to obtain the text area.
在一些实施例中,上述图像特征提取单元,包括:In some embodiments, the above-mentioned image feature extraction unit includes:
映射子单元,被配置为将上述文字区域的位置信息映射到上述特征图中,得到上述文字区域对应的映射区域;The mapping subunit is configured to map the position information of the above-mentioned text area to the above-mentioned feature map, and obtain the mapping area corresponding to the above-mentioned text area;
图像特征提取子单元,被配置为从上述特征图中提取出位于上述映射区域中的图像,得到上述图像特征。The image feature extraction subunit is configured to extract the image located in the above-mentioned mapping area from the above-mentioned feature map to obtain the above-mentioned image features.
在一些实施例中,上述描述子获取模块25,包括:In some embodiments, the above-mentioned descriptor acquisition module 25 includes:
拼接单元,被配置为对于每个视频帧,拼接上述视频帧的上述文字序列特征、上述位置特征以及上述图像特征,得到上述视频帧对应的融合特征;The splicing unit is configured to, for each video frame, splice the above-mentioned text sequence feature, the above-mentioned position feature, and the above-mentioned image feature of the above-mentioned video frame to obtain the fusion feature corresponding to the above-mentioned video frame;
调整单元,被配置为调整上述融合特征的尺寸,得到上述融合特征描述子。The adjustment unit is configured to adjust the size of the above-mentioned fusion feature to obtain the above-mentioned fusion feature descriptor.
在一些实施例中,上述参考跟踪轨迹包括参考文字区域,上述多个视频帧为在时间轴上连续的多个视频帧,上述跟踪轨迹确定模块27,包括:In some embodiments, the above-mentioned reference tracking track includes a reference text area, and the above-mentioned multiple video frames are multiple video frames continuous on the time axis, and the above-mentioned tracking track determination module 27 includes:
相似度匹配单元,被配置为按顺序依次遍历上述每个视频帧,并在遍历上述每个视频帧时,执行以下操作:将遍历到的当前视频帧对应的当前融合特征描述子,与上述参考文字区域对应的参考融合特征描述子进行相似度匹配,得到上述当前视频帧对应的相似度匹配结果;基于上述相似度匹配结果、上述当前融合特征描述子和上述当前视频帧对应的当前文字区域,更新上述参考跟踪轨迹,得到上述当前视频帧对应的目标跟踪轨迹;且将上述目标跟踪轨迹确定为更新后的上述参考跟踪轨迹;上述顺序表征上述每个视频帧在上述时间轴上的顺序;The similarity matching unit is configured to traverse each of the above video frames in sequence, and when traversing each of the above video frames, perform the following operations: compare the current fusion feature descriptor corresponding to the traversed current video frame with the above reference The reference fusion feature descriptor corresponding to the text area performs similarity matching to obtain the similarity matching result corresponding to the above-mentioned current video frame; based on the above-mentioned similarity matching result, the above-mentioned current fusion feature descriptor and the current text area corresponding to the above-mentioned current video frame, Updating the above-mentioned reference tracking trajectory to obtain the target tracking trajectory corresponding to the above-mentioned current video frame; and determining the above-mentioned target tracking trajectory as the updated reference tracking trajectory; the above-mentioned order represents the order of each of the above-mentioned video frames on the above-mentioned time axis;
跟踪轨迹确定单元,被配置为将排序最后的视频帧对应的目标跟踪轨迹,确定为上述视频中的文字的跟踪轨迹。The tracking track determination unit is configured to determine the target tracking track corresponding to the last video frame as the tracking track of the text in the video.
在一些实施例中,上述相似度匹配单元,被配置为在上述相似度匹配结果小于预设相似度阈值的情况下,基于上述当前文字区域更新上述参考文字区域,并基于上述当前融合特征描述子更新上述参考融合特征描述子,得到上述目标跟踪轨迹。In some embodiments, the above-mentioned similarity matching unit is configured to update the above-mentioned reference text area based on the above-mentioned current text area when the above-mentioned similarity matching result is smaller than the preset similarity threshold, and based on the above-mentioned current fusion feature descriptor The above-mentioned reference fusion feature descriptor is updated to obtain the above-mentioned target tracking trajectory.
在一些实施例中,上述相似度匹配单元,被配置为在上述相似度匹配结果大于或等于预设相似度阈值的情况下,将上述当前文字区域和上述当前融合特征描述子添加至上述参考跟踪轨迹中,得到上述目标跟踪轨迹。In some embodiments, the above-mentioned similarity matching unit is configured to add the above-mentioned current text region and the above-mentioned current fusion feature descriptor to the above-mentioned reference tracking when the above-mentioned similarity matching result is greater than or equal to a preset similarity threshold In the trajectory, the above-mentioned target tracking trajectory is obtained.
在一些实施例中,文字跟踪网络包括检测分支网络、识别分支网络和多信息融合描述子分支网络;In some embodiments, the text tracking network includes a detection branch network, a recognition branch network and a multi-information fusion descriptor branch network;
所述检测分支网络,用于检测每个视频帧中的文字区域;The detection branch network is used to detect text regions in each video frame;
所述识别分支网络,用于执行所述获取每个视频帧中的文字的文字序列特征、所述文字所在的文字区域的位置特征以及所述文字区域对应的图像特征的步骤;The recognition branch network is used to perform the steps of obtaining the character sequence features of the characters in each video frame, the positional features of the character region where the characters are located, and the corresponding image features of the character region;
所述多信息融合描述子分支网络,用于执行所述分别根据所述每个视频帧的所述文字序列特征、所述位置特征以及所述图像特征,确定所述每个视频帧对应的融合特征描述子的步骤。The multi-information fusion description sub-branch network is used to perform the said fusion process corresponding to each video frame according to the character sequence feature, the position feature and the image feature of each video frame. Steps for feature descriptors.
在一些实施例中,所述检测分支网络包括特征提取子网络、第一特征融合子网络和特征检测子网络;In some embodiments, the detection branch network includes a feature extraction subnetwork, a first feature fusion subnetwork and a feature detection subnetwork;
所述特征提取子网络,用于对于每个视频帧,对所述视频帧进行基础特征提取,得到所述视频帧的基础特征;The feature extraction sub-network is used to extract the basic features of the video frame for each video frame to obtain the basic features of the video frame;
所述第一特征融合子网络,用于对提取出的基础特征进行多尺度特征融合处理,得到所述视频帧对应的特征图;The first feature fusion sub-network is used to perform multi-scale feature fusion processing on the extracted basic features to obtain the feature map corresponding to the video frame;
所述特征检测子网络,用于将所述视频帧的特征图处理成文字区域热图,所述文字区域热图表征所述特征图中的文字区域和非文字区域;对所述文字区域热图进行连通域分析,得到所述视频帧对应的文字区域。The feature detection sub-network is used to process the feature map of the video frame into a text area heat map, and the text area heat map represents a text area and a non-text area in the feature map; Connected domain analysis is performed on the graph to obtain the text region corresponding to the video frame.
关于上述实施例中的装置,其中各个模块执行操作的具体方式已经在有关该方法的实施例中进行了详细描述,此处将不做详细阐述说明Regarding the device in the above-mentioned embodiment, the specific manner in which each module performs operations has been described in detail in the embodiment of the method, and will not be described in detail here
在示例性实施例中,还提供了一种电子设备,包括处理器;用于存储处理器可执行指令的存储器;其中,上述处理器被配置为执行上述指令,以实现如下步骤:In an exemplary embodiment, there is also provided an electronic device, including a processor; a memory for storing instructions executable by the processor; wherein the processor is configured to execute the instructions to implement the following steps:
从视频中提取出多个视频帧;Extract multiple video frames from the video;
获取每个视频帧中的文字的文字序列特征、上述文字所在的文字区域的位置特征以及上述文字区域对应的图像特征;Obtain the character sequence feature of the text in each video frame, the position feature of the text area where the above text is located, and the image feature corresponding to the above text area;
分别根据上述每个视频帧的上述文字序列特征、上述位置特征以及上述图像特征,确定上述每个视频帧对应的融合特征描述子;According to the above-mentioned character sequence feature, the above-mentioned position feature and the above-mentioned image feature of each of the above-mentioned video frames, determine the fusion feature descriptor corresponding to each of the above-mentioned video frames;
基于上述每个视频帧对应的融合特征描述子更新已存储的参考跟踪轨迹,得到上述视频中的文字的跟踪轨迹;上述跟踪轨迹用于表征上述视频中的文字的位置信息。The stored reference tracking trajectory is updated based on the fusion feature descriptor corresponding to each video frame to obtain the tracking trajectory of the text in the video; the tracking trajectory is used to represent the position information of the text in the video.
在一些实施例中,上述处理器被配置为执行上述指令,以实现如下步骤:In some embodiments, the above-mentioned processor is configured to execute the above-mentioned instructions to implement the following steps:
对于每个视频帧,从上述视频帧中确定出上述文字区域;For each video frame, determine the above-mentioned text area from the above-mentioned video frame;
编码上述文字区域的位置信息,得到上述位置特征;Encoding the location information of the above-mentioned text area to obtain the above-mentioned location feature;
基于上述文字区域的位置信息,从上述视频帧中提取出上述图像特征;Extracting the above image features from the above video frame based on the position information of the above text area;
解码上述图像特征,得到上述文字序列特征。The above-mentioned image features are decoded to obtain the above-mentioned character sequence features.
在一些实施例中,上述处理器被配置为执行上述指令,以实现如下步骤:In some embodiments, the above-mentioned processor is configured to execute the above-mentioned instructions to implement the following steps:
对上述视频帧进行特征提取,得到上述视频帧对应的特征图;performing feature extraction on the above video frame to obtain a feature map corresponding to the above video frame;
确定上述特征图对应的文字区域热图;上述文字区域热图表征上述特征图中的文字区域和非文字区域;Determine the text area heat map corresponding to the above feature map; the above text area heat map represents the text area and non-text area in the above feature map;
对上述文字区域热图进行连通域分析,得到上述文字区域。Connected domain analysis is performed on the above text area heat map to obtain the above text area.
在一些实施例中,上述处理器被配置为执行上述指令,以实现如下步骤:In some embodiments, the above-mentioned processor is configured to execute the above-mentioned instructions to implement the following steps:
将上述文字区域的位置信息映射到上述特征图中,得到上述文字区域对应的映射区域;Mapping the position information of the above-mentioned text area to the above-mentioned feature map to obtain a mapping area corresponding to the above-mentioned text area;
从上述特征图中提取出位于上述映射区域中的图像,得到上述图像特征。Extracting images located in the above mapping region from the above feature map to obtain the above image features.
在一些实施例中,上述处理器被配置为执行上述指令,以实现如下步骤:In some embodiments, the above-mentioned processor is configured to execute the above-mentioned instructions to implement the following steps:
对于每个视频帧,拼接上述视频帧的上述文字序列特征、上述位置特征以及上述图像特征,得到上述视频帧对应的融合特征;For each video frame, splicing the above-mentioned character sequence feature, the above-mentioned position feature and the above-mentioned image feature of the above-mentioned video frame to obtain the fusion feature corresponding to the above-mentioned video frame;
调整上述融合特征的尺寸,得到上述融合特征描述子。Adjust the size of the above-mentioned fusion feature to obtain the above-mentioned fusion feature descriptor.
在一些实施例中,上述参考跟踪轨迹包括参考文字区域,上述多个视频帧在时间轴上连续,上述处理器被配置为执行上述指令,以实现如下步骤:In some embodiments, the above-mentioned reference tracking trajectory includes a reference text area, and the above-mentioned multiple video frames are continuous on the time axis, and the above-mentioned processor is configured to execute the above-mentioned instructions to implement the following steps:
按顺序依次遍历上述每个视频帧,并在遍历上述每个视频帧时,执行以下操作:将遍历到的当前视频帧对应的当前融合特征描述子,与上述参考文字区域对应的参考融合特征描述子进行相似度匹配,得到上述当前视频帧对应的相似度匹配结果;基于上述相似度匹配结果、上述当前融合特征描述子和上述当前视频帧对应的当前文字区域,更新上述参考跟踪轨迹,得到上述当前视频帧对应的目标跟踪轨迹;且将上述目标跟踪轨迹确定为更新后的上述参考跟踪轨迹;上述顺序表征上述每个视频帧在上述时间轴上的顺序;Traverse each of the above video frames in order, and when traversing each of the above video frames, perform the following operations: the current fusion feature descriptor corresponding to the traversed current video frame, and the reference fusion feature description corresponding to the above reference text area The above-mentioned similarity matching result corresponding to the current video frame is obtained; based on the above-mentioned similarity matching result, the above-mentioned current fusion feature descriptor and the current text area corresponding to the above-mentioned current video frame, the above-mentioned reference tracking track is updated to obtain the above-mentioned The target tracking trajectory corresponding to the current video frame; and determining the above-mentioned target tracking trajectory as the updated reference tracking trajectory; the above-mentioned order represents the order of each of the above-mentioned video frames on the above-mentioned time axis;
将排序最后的视频帧对应的目标跟踪轨迹,确定为上述视频中的文字的跟踪轨迹。The target tracking track corresponding to the last sorted video frame is determined as the tracking track of the text in the above video.
在一些实施例中,上述处理器被配置为执行上述指令,以实现如下步骤:In some embodiments, the above-mentioned processor is configured to execute the above-mentioned instructions to implement the following steps:
在上述相似度匹配结果小于预设相似度阈值的情况下,基于上述当前文字区域更新上述参考文字区域,并基于上述当前融合特征描述子更新上述参考融合特征描述子,得到上述目标跟踪轨迹。When the similarity matching result is smaller than the preset similarity threshold, the reference text region is updated based on the current text region, and the reference fusion feature descriptor is updated based on the current fusion feature descriptor to obtain the target tracking trajectory.
在一些实施例中,上述处理器被配置为执行上述指令,以实现如下步骤:In some embodiments, the above-mentioned processor is configured to execute the above-mentioned instructions to implement the following steps:
在上述相似度匹配结果大于或等于预设相似度阈值的情况下,将上述当前文字区域和上述当前融合特征描述子添加至上述参考跟踪轨迹中,得到上述目标跟踪轨迹。When the similarity matching result is greater than or equal to a preset similarity threshold, the current text region and the current fusion feature descriptor are added to the reference tracking track to obtain the target tracking track.
在一些实施例中,文字跟踪网络包括检测分支网络、识别分支网络和多信息融合描述子分支网络;In some embodiments, the text tracking network includes a detection branch network, a recognition branch network and a multi-information fusion descriptor branch network;
上述检测分支网络,用于检测每个视频帧中的文字区域;The above detection branch network is used to detect text regions in each video frame;
上述识别分支网络,用于执行上述获取每个视频帧中的文字的文字序列特征、上述文字所在的文字区域的位置特征以及上述文字区域对应的图像特征的步骤;The recognition branch network is used to execute the steps of obtaining the character sequence features of the characters in each video frame, the positional features of the character regions where the characters are located, and the image features corresponding to the character regions;
上述多信息融合描述子分支网络,用于执行上述分别根据上述每个视频帧的上述文字序列特征、上述位置特征以及上述图像特征,确定上述每个视频帧对应的融合特征描述子的步骤。The multi-information fusion descriptor branch network is used to perform the step of determining the fusion feature descriptor corresponding to each video frame according to the above-mentioned character sequence features, the above-mentioned position features and the above-mentioned image features of each of the above-mentioned video frames.
在一些实施例中,上述检测分支网络包括特征提取子网络、第一特征融合子网络和特征检测子网络;In some embodiments, the detection branch network includes a feature extraction subnetwork, a first feature fusion subnetwork and a feature detection subnetwork;
上述特征提取子网络,用于对于每个视频帧,对上述视频帧进行基础特征提取,得到上述视频帧的基础特征;The above-mentioned feature extraction sub-network is used to extract the basic features of the above-mentioned video frames for each video frame to obtain the basic features of the above-mentioned video frames;
上述第一特征融合子网络,用于对提取出的基础特征进行多尺度特征融合处理,得到上述视频帧对应的特征图;The above-mentioned first feature fusion sub-network is used to perform multi-scale feature fusion processing on the extracted basic features to obtain the feature map corresponding to the above-mentioned video frame;
上述特征检测子网络,用于将上述视频帧的特征图处理成文字区域热图,上述文字区域热图表征上述特征图中的文字区域和非文字区域;对上述文字区域热图进行连通域分析,得到上述视频帧对应的文字区域。The above-mentioned feature detection sub-network is used to process the feature map of the above-mentioned video frame into a text area heat map, and the above-mentioned text area heat map represents the text area and non-text area in the above-mentioned feature map; performing connected domain analysis on the above-mentioned text area heat map , to obtain the text area corresponding to the above video frame.
该电子设备是终端、服务器或者类似的运算装置,以该电子设备是服务器为例,图9是根据一示例性实施例示出的一种用于视频文字跟踪的电子设备的框图,该电子设备30可因配置或性能不同而产生比较大的差异,可以包括一个或一个以上中央处理器(Central  Processing Units,CPU)31(中央处理器31可以包括但不限于微处理器MCU或可编程逻辑器件FPGA等的处理装置)、用于存储数据的存储器33,一个或一个以上存储应用程序323或数据322的存储介质32(例如一个或一个以上海量存储设备)。其中,存储器33和存储介质32可以是短暂存储或持久存储。存储在存储介质32的程序可以包括一个或一个以上模块,每个模块可以包括对电子设备中的一系列指令操作。更进一步地,中央处理器31可以设置为与存储介质32通信,在电子设备30上执行存储介质32中的一系列指令操作。电子设备30还可以包括一个或一个以上电源36,一个或一个以上有线或无线网络接口35,一个或一个以上输入输出接口34,和/或,一个或一个以上操作系统321,例如Windows ServerTM,Mac OS XTM,UnixTM,LinuxTM,FreeBSDTM等等。The electronic device is a terminal, a server or a similar computing device. Taking the electronic device as a server as an example, FIG. 9 is a block diagram of an electronic device for video text tracking according to an exemplary embodiment. The electronic device 30 Can produce bigger difference because of configuration or performance difference, can comprise one or more central processing unit (Central Processing Units, CPU) 31 (central processing unit 31 can comprise but not limited to microprocessor MCU or programmable logic device FPGA etc.), a memory 33 for storing data, and one or more storage media 32 for storing application programs 323 or data 322 (for example, one or more mass storage devices). Wherein, the memory 33 and the storage medium 32 may be temporary storage or persistent storage. The program stored in the storage medium 32 may include one or more modules, and each module may include a series of instructions to operate on the electronic device. Further, the central processing unit 31 may be configured to communicate with the storage medium 32 , and execute a series of instruction operations in the storage medium 32 on the electronic device 30 . The electronic device 30 can also include one or more power supplies 36, one or more wired or wireless network interfaces 35, one or more input and output interfaces 34, and/or, one or more operating systems 321, such as Windows Server™, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM, etc.
输入输出接口34可以用于经由一个网络接收或者发送数据。上述的网络具体实例可包括电子设备30的通信供应商提供的无线网络。在一个实例中,输入输出接口34包括一个网络适配器(Network Interface Controller,NIC),其可通过基站与其他网络设备相连从而可与互联网进行通讯。在一个示例性实施例中,输入输出接口34可以为射频(Radio Frequency,RF)模块,其用于通过无线方式与互联网进行通讯。The input-output interface 34 may be used to receive or send data via a network. The specific example of the above network may include a wireless network provided by the communication provider of the electronic device 30 . In one example, the input and output interface 34 includes a network adapter (Network Interface Controller, NIC), which can be connected to other network devices through a base station so as to communicate with the Internet. In an exemplary embodiment, the input and output interface 34 may be a radio frequency (Radio Frequency, RF) module, which is used to communicate with the Internet in a wireless manner.
本领域普通技术人员可以理解,图9所示的结构仅为示意,其并不对上述电子设备的结构造成限定。例如,电子设备30还可包括比图9中所示更多或者更少的组件,或者具有与图9所示不同的配置。Those of ordinary skill in the art can understand that the structure shown in FIG. 9 is only for illustration, and it does not limit the structure of the above-mentioned electronic device. For example, electronic device 30 may also include more or fewer components than shown in FIG. 9 , or have a different configuration than that shown in FIG. 9 .
在示例性实施例中,还提供了一种计算机可读存储介质,当计算机可读存储介质中的指令由电子设备的处理器执行时,使得电子设备执行如下步骤:In an exemplary embodiment, a computer-readable storage medium is also provided, and when instructions in the computer-readable storage medium are executed by a processor of the electronic device, the electronic device is made to perform the following steps:
从视频中提取出多个视频帧;Extract multiple video frames from the video;
获取每个视频帧中的文字的文字序列特征、所述文字所在的文字区域的位置特征以及所述文字区域对应的图像特征;Obtain the character sequence feature of the text in each video frame, the position feature of the text area where the text is located, and the image feature corresponding to the text area;
分别根据所述每个视频帧的所述文字序列特征、所述位置特征以及所述图像特征,确定所述每个视频帧对应的融合特征描述子;Determining a fusion feature descriptor corresponding to each video frame according to the character sequence feature, the position feature, and the image feature of each video frame;
基于所述每个视频帧对应的融合特征描述子更新已存储的参考跟踪轨迹,得到所述视频中的文字的跟踪轨迹;所述跟踪轨迹用于表征所述视频中的文字的位置信息。Based on the fusion feature descriptor corresponding to each video frame, the stored reference tracking trajectory is updated to obtain the tracking trajectory of the text in the video; the tracking trajectory is used to represent the position information of the text in the video.
在一些实施例中,当计算机可读存储介质中的指令由电子设备的处理器执行时,使得电子设备执行如下步骤:In some embodiments, when the instructions in the computer-readable storage medium are executed by the processor of the electronic device, the electronic device is made to perform the following steps:
对于每个视频帧,从上述视频帧中确定出上述文字区域;For each video frame, determine the above-mentioned text area from the above-mentioned video frame;
编码上述文字区域的位置信息,得到上述位置特征;Encoding the location information of the above-mentioned text area to obtain the above-mentioned location feature;
基于上述文字区域的位置信息,从上述视频帧中提取出上述图像特征;Extracting the above image features from the above video frame based on the position information of the above text area;
解码上述图像特征,得到上述文字序列特征。The above-mentioned image features are decoded to obtain the above-mentioned character sequence features.
在一些实施例中,当计算机可读存储介质中的指令由电子设备的处理器执行时,使得电子设备执行如下步骤:In some embodiments, when the instructions in the computer-readable storage medium are executed by the processor of the electronic device, the electronic device is made to perform the following steps:
对上述视频帧进行特征提取,得到上述视频帧对应的特征图;performing feature extraction on the above video frame to obtain a feature map corresponding to the above video frame;
确定上述特征图对应的文字区域热图;上述文字区域热图表征上述特征图中的文字区域和非文字区域;Determine the text area heat map corresponding to the above feature map; the above text area heat map represents the text area and non-text area in the above feature map;
对上述文字区域热图进行连通域分析,得到上述文字区域。Connected domain analysis is performed on the above text area heat map to obtain the above text area.
在一些实施例中,当计算机可读存储介质中的指令由电子设备的处理器执行时,使得电子设备执行如下步骤:In some embodiments, when the instructions in the computer-readable storage medium are executed by the processor of the electronic device, the electronic device is made to perform the following steps:
将上述文字区域的位置信息映射到上述特征图中,得到上述文字区域对应的映射区域;Mapping the position information of the above-mentioned text area to the above-mentioned feature map to obtain a mapping area corresponding to the above-mentioned text area;
从上述特征图中提取出位于上述映射区域中的图像,得到上述图像特征。Extracting images located in the above mapping region from the above feature map to obtain the above image features.
在一些实施例中,当计算机可读存储介质中的指令由电子设备的处理器执行时,使得电子设备执行如下步骤:In some embodiments, when the instructions in the computer-readable storage medium are executed by the processor of the electronic device, the electronic device is made to perform the following steps:
对于每个视频帧,拼接上述视频帧的上述文字序列特征、上述位置特征以及上述图像特征,得到上述视频帧对应的融合特征;For each video frame, splicing the above-mentioned character sequence feature, the above-mentioned position feature and the above-mentioned image feature of the above-mentioned video frame to obtain the fusion feature corresponding to the above-mentioned video frame;
调整上述融合特征的尺寸,得到上述融合特征描述子。Adjust the size of the above-mentioned fusion feature to obtain the above-mentioned fusion feature descriptor.
在一些实施例中,上述参考跟踪轨迹包括参考文字区域,上述多个视频帧在时间轴上连续,当计算机可读存储介质中的指令由电子设备的处理器执行时,使得电子设备执行如下步骤:In some embodiments, the above-mentioned reference tracking trajectory includes a reference text area, and the above-mentioned multiple video frames are continuous on the time axis, when the instructions in the computer-readable storage medium are executed by the processor of the electronic device, the electronic device is made to perform the following steps :
按顺序依次遍历上述每个视频帧,并在遍历上述每个视频帧时,执行以下操作:将遍历到的当前视频帧对应的当前融合特征描述子,与上述参考文字区域对应的参考融合特征描述子进行相似度匹配,得到上述当前视频帧对应的相似度匹配结果;基于上述相似度匹配结果、上述当前融合特征描述子和上述当前视频帧对应的当前文字区域,更新上述参考跟踪轨迹,得到上述当前视频帧对应的目标跟踪轨迹;且将上述目标跟踪轨迹确定为更新后的上述参考跟踪轨迹;上述顺序表征上述每个视频帧在上述时间轴上的顺序;Traverse each of the above video frames in order, and when traversing each of the above video frames, perform the following operations: the current fusion feature descriptor corresponding to the traversed current video frame, and the reference fusion feature description corresponding to the above reference text area The above-mentioned similarity matching result corresponding to the current video frame is obtained; based on the above-mentioned similarity matching result, the above-mentioned current fusion feature descriptor and the current text area corresponding to the above-mentioned current video frame, the above-mentioned reference tracking track is updated to obtain the above-mentioned The target tracking trajectory corresponding to the current video frame; and determining the above-mentioned target tracking trajectory as the updated reference tracking trajectory; the above-mentioned order represents the order of each of the above-mentioned video frames on the above-mentioned time axis;
将排序最后的视频帧对应的目标跟踪轨迹,确定为上述视频中的文字的跟踪轨迹。The target tracking track corresponding to the last sorted video frame is determined as the tracking track of the text in the above video.
在一些实施例中,当计算机可读存储介质中的指令由电子设备的处理器执行时,使得电子设备执行如下步骤:In some embodiments, when the instructions in the computer-readable storage medium are executed by the processor of the electronic device, the electronic device is made to perform the following steps:
在上述相似度匹配结果小于预设相似度阈值的情况下,基于上述当前文字区域更新上述参考文字区域,并基于上述当前融合特征描述子更新上述参考融合特征描述子,得到上述目标跟踪轨迹。When the similarity matching result is smaller than the preset similarity threshold, the reference text region is updated based on the current text region, and the reference fusion feature descriptor is updated based on the current fusion feature descriptor to obtain the target tracking trajectory.
在一些实施例中,当计算机可读存储介质中的指令由电子设备的处理器执行时,使得电子设备执行如下步骤:In some embodiments, when the instructions in the computer-readable storage medium are executed by the processor of the electronic device, the electronic device is made to perform the following steps:
在上述相似度匹配结果大于或等于预设相似度阈值的情况下,将上述当前文字区域和上述当前融合特征描述子添加至上述参考跟踪轨迹中,得到上述目标跟踪轨迹。When the similarity matching result is greater than or equal to a preset similarity threshold, the current text region and the current fusion feature descriptor are added to the reference tracking track to obtain the target tracking track.
在一些实施例中,文字跟踪网络包括检测分支网络、识别分支网络和多信息融合描述子分支网络;In some embodiments, the text tracking network includes a detection branch network, a recognition branch network and a multi-information fusion descriptor branch network;
上述检测分支网络,用于检测每个视频帧中的文字区域;The above detection branch network is used to detect text regions in each video frame;
上述识别分支网络,用于执行上述获取每个视频帧中的文字的文字序列特征、上述文字所在的文字区域的位置特征以及上述文字区域对应的图像特征的步骤;The recognition branch network is used to execute the steps of obtaining the character sequence features of the characters in each video frame, the positional features of the character regions where the characters are located, and the image features corresponding to the character regions;
上述多信息融合描述子分支网络,用于执行上述分别根据上述每个视频帧的上述文字序列特征、上述位置特征以及上述图像特征,确定上述每个视频帧对应的融合特征描述子的步骤。The multi-information fusion descriptor branch network is used to perform the step of determining the fusion feature descriptor corresponding to each video frame according to the above-mentioned character sequence features, the above-mentioned position features and the above-mentioned image features of each of the above-mentioned video frames.
在一些实施例中,上述检测分支网络包括特征提取子网络、第一特征融合子网络和特征检测子网络;In some embodiments, the detection branch network includes a feature extraction subnetwork, a first feature fusion subnetwork and a feature detection subnetwork;
上述特征提取子网络,用于对于每个视频帧,对上述视频帧进行基础特征提取,得到上述视频帧的基础特征;The above-mentioned feature extraction sub-network is used to extract the basic features of the above-mentioned video frames for each video frame to obtain the basic features of the above-mentioned video frames;
上述第一特征融合子网络,用于对提取出的基础特征进行多尺度特征融合处理,得到上述视频帧对应的特征图;The above-mentioned first feature fusion sub-network is used to perform multi-scale feature fusion processing on the extracted basic features to obtain the feature map corresponding to the above-mentioned video frame;
上述特征检测子网络,用于将上述视频帧的特征图处理成文字区域热图,上述文字区域热图表征上述特征图中的文字区域和非文字区域;对上述文字区域热图进行连通域分析,得到上述视频帧对应的文字区域。The above-mentioned feature detection sub-network is used to process the feature map of the above-mentioned video frame into a text area heat map, and the above-mentioned text area heat map represents the text area and non-text area in the above-mentioned feature map; performing connected domain analysis on the above-mentioned text area heat map , to obtain the text area corresponding to the above video frame.
在示例性实施例中,还提供了一种计算机程序产品,包括计算机程序,该计算机程序被处理器执行时实现如下步骤:In an exemplary embodiment, a computer program product is also provided, including a computer program, which implements the following steps when the computer program is executed by a processor:
从视频中提取出多个视频帧;Extract multiple video frames from the video;
获取每个视频帧中的文字的文字序列特征、所述文字所在的文字区域的位置特征以及所述文字区域对应的图像特征;Obtain the character sequence feature of the text in each video frame, the position feature of the text area where the text is located, and the image feature corresponding to the text area;
分别根据所述每个视频帧的所述文字序列特征、所述位置特征以及所述图像特征,确 定所述每个视频帧对应的融合特征描述子;According to the character sequence feature, the position feature and the image feature of each video frame respectively, determine the fusion feature descriptor corresponding to each video frame;
基于所述每个视频帧对应的融合特征描述子更新已存储的参考跟踪轨迹,得到所述视频中的文字的跟踪轨迹;所述跟踪轨迹用于表征所述视频中的文字的位置信息。Based on the fusion feature descriptor corresponding to each video frame, the stored reference tracking trajectory is updated to obtain the tracking trajectory of the text in the video; the tracking trajectory is used to represent the position information of the text in the video.
在一些实施例中,该计算机程序被处理器执行时实现如下步骤:In some embodiments, the computer program implements the following steps when executed by a processor:
对于每个视频帧,从上述视频帧中确定出上述文字区域;For each video frame, determine the above-mentioned text area from the above-mentioned video frame;
编码上述文字区域的位置信息,得到上述位置特征;Encoding the location information of the above-mentioned text area to obtain the above-mentioned location feature;
基于上述文字区域的位置信息,从上述视频帧中提取出上述图像特征;Extracting the above image features from the above video frame based on the position information of the above text area;
解码上述图像特征,得到上述文字序列特征。The above-mentioned image features are decoded to obtain the above-mentioned character sequence features.
在一些实施例中,该计算机程序被处理器执行时实现如下步骤:In some embodiments, the computer program implements the following steps when executed by a processor:
对上述视频帧进行特征提取,得到上述视频帧对应的特征图;performing feature extraction on the above video frame to obtain a feature map corresponding to the above video frame;
确定上述特征图对应的文字区域热图;上述文字区域热图表征上述特征图中的文字区域和非文字区域;Determine the text area heat map corresponding to the above feature map; the above text area heat map represents the text area and non-text area in the above feature map;
对上述文字区域热图进行连通域分析,得到上述文字区域。Connected domain analysis is performed on the above text area heat map to obtain the above text area.
在一些实施例中,该计算机程序被处理器执行时实现如下步骤:In some embodiments, the computer program implements the following steps when executed by a processor:
将上述文字区域的位置信息映射到上述特征图中,得到上述文字区域对应的映射区域;Mapping the position information of the above-mentioned text area to the above-mentioned feature map to obtain a mapping area corresponding to the above-mentioned text area;
从上述特征图中提取出位于上述映射区域中的图像,得到上述图像特征。Extracting images located in the above mapping region from the above feature map to obtain the above image features.
在一些实施例中,该计算机程序被处理器执行时实现如下步骤:In some embodiments, the computer program implements the following steps when executed by a processor:
对于每个视频帧,拼接上述视频帧的上述文字序列特征、上述位置特征以及上述图像特征,得到上述视频帧对应的融合特征;For each video frame, splicing the above-mentioned character sequence feature, the above-mentioned position feature and the above-mentioned image feature of the above-mentioned video frame to obtain the fusion feature corresponding to the above-mentioned video frame;
调整上述融合特征的尺寸,得到上述融合特征描述子。Adjust the size of the above-mentioned fusion feature to obtain the above-mentioned fusion feature descriptor.
在一些实施例中,上述参考跟踪轨迹包括参考文字区域,上述多个视频帧在时间轴上连续,该计算机程序被处理器执行时实现如下步骤:In some embodiments, the above-mentioned reference tracking track includes a reference text area, and the above-mentioned multiple video frames are continuous on the time axis. When the computer program is executed by a processor, the following steps are implemented:
按顺序依次遍历上述每个视频帧,并在遍历上述每个视频帧时,执行以下操作:将遍历到的当前视频帧对应的当前融合特征描述子,与上述参考文字区域对应的参考融合特征描述子进行相似度匹配,得到上述当前视频帧对应的相似度匹配结果;基于上述相似度匹配结果、上述当前融合特征描述子和上述当前视频帧对应的当前文字区域,更新上述参考跟踪轨迹,得到上述当前视频帧对应的目标跟踪轨迹;且将上述目标跟踪轨迹确定为更新后的上述参考跟踪轨迹;上述顺序表征上述每个视频帧在上述时间轴上的顺序;Traverse each of the above video frames in order, and when traversing each of the above video frames, perform the following operations: the current fusion feature descriptor corresponding to the traversed current video frame, and the reference fusion feature description corresponding to the above reference text area The above-mentioned similarity matching result corresponding to the current video frame is obtained; based on the above-mentioned similarity matching result, the above-mentioned current fusion feature descriptor and the current text area corresponding to the above-mentioned current video frame, the above-mentioned reference tracking track is updated to obtain the above-mentioned The target tracking trajectory corresponding to the current video frame; and determining the above-mentioned target tracking trajectory as the updated reference tracking trajectory; the above-mentioned order represents the order of each of the above-mentioned video frames on the above-mentioned time axis;
将排序最后的视频帧对应的目标跟踪轨迹,确定为上述视频中的文字的跟踪轨迹。The target tracking track corresponding to the last sorted video frame is determined as the tracking track of the text in the above video.
在一些实施例中,该计算机程序被处理器执行时实现如下步骤:In some embodiments, the computer program implements the following steps when executed by a processor:
在上述相似度匹配结果小于预设相似度阈值的情况下,基于上述当前文字区域更新上述参考文字区域,并基于上述当前融合特征描述子更新上述参考融合特征描述子,得到上述目标跟踪轨迹。When the similarity matching result is smaller than the preset similarity threshold, the reference text region is updated based on the current text region, and the reference fusion feature descriptor is updated based on the current fusion feature descriptor to obtain the target tracking trajectory.
在一些实施例中,该计算机程序被处理器执行时实现如下步骤:In some embodiments, the computer program implements the following steps when executed by the processor:
在上述相似度匹配结果大于或等于预设相似度阈值的情况下,将上述当前文字区域和上述当前融合特征描述子添加至上述参考跟踪轨迹中,得到上述目标跟踪轨迹。When the similarity matching result is greater than or equal to a preset similarity threshold, the current text region and the current fusion feature descriptor are added to the reference tracking track to obtain the target tracking track.
在一些实施例中,文字跟踪网络包括检测分支网络、识别分支网络和多信息融合描述子分支网络;In some embodiments, the text tracking network includes a detection branch network, a recognition branch network and a multi-information fusion descriptor branch network;
上述检测分支网络,用于检测每个视频帧中的文字区域;The above detection branch network is used to detect text regions in each video frame;
上述识别分支网络,用于执行上述获取每个视频帧中的文字的文字序列特征、上述文字所在的文字区域的位置特征以及上述文字区域对应的图像特征的步骤;The recognition branch network is used to execute the steps of obtaining the character sequence features of the characters in each video frame, the positional features of the character regions where the characters are located, and the image features corresponding to the character regions;
上述多信息融合描述子分支网络,用于执行上述分别根据上述每个视频帧的上述文字序列特征、上述位置特征以及上述图像特征,确定上述每个视频帧对应的融合特征描述子的步骤。The multi-information fusion descriptor branch network is used to perform the step of determining the fusion feature descriptor corresponding to each video frame according to the above-mentioned character sequence features, the above-mentioned position features and the above-mentioned image features of each of the above-mentioned video frames.
在一些实施例中,上述检测分支网络包括特征提取子网络、第一特征融合子网络和特征检测子网络;In some embodiments, the detection branch network includes a feature extraction subnetwork, a first feature fusion subnetwork and a feature detection subnetwork;
上述特征提取子网络,用于对于每个视频帧,对上述视频帧进行基础特征提取,得到上述视频帧的基础特征;The above-mentioned feature extraction sub-network is used to extract the basic features of the above-mentioned video frames for each video frame to obtain the basic features of the above-mentioned video frames;
上述第一特征融合子网络,用于对提取出的基础特征进行多尺度特征融合处理,得到上述视频帧对应的特征图;The above-mentioned first feature fusion sub-network is used to perform multi-scale feature fusion processing on the extracted basic features to obtain the feature map corresponding to the above-mentioned video frame;
上述特征检测子网络,用于将上述视频帧的特征图处理成文字区域热图,上述文字区域热图表征上述特征图中的文字区域和非文字区域;对上述文字区域热图进行连通域分析,得到上述视频帧对应的文字区域。The above-mentioned feature detection sub-network is used to process the feature map of the above-mentioned video frame into a text area heat map, and the above-mentioned text area heat map represents the text area and non-text area in the above-mentioned feature map; performing connected domain analysis on the above-mentioned text area heat map , to obtain the text area corresponding to the above video frame.
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机程序来指令相关的硬件来完成,该计算机程序可存储于一非易失性计算机可读取存储介质中,该计算机程序在执行时,可包括如上述各方法的实施例的流程。其中,本公开所提供的各实施例中所使用的对存储器、存储、数据库或其它介质的任何引用,均可包括非易失性和/或易失性存储器。非易失性存储器可包括只读存储器(ROM)、可编程ROM(PROM)、电可编程ROM(EPROM)、电可擦除可编程ROM(EEPROM)或闪存。易失性存储器可包括随机存取存储器(RAM)或者外部高速缓冲存储器。作为说明而非局限,RAM以多种形式可得,诸如静态RAM(SRAM)、动态RAM(DRAM)、同步DRAM(SDRAM)、双数据率SDRAM(DDRSDRAM)、增强型SDRAM(ESDRAM)、同步链路(Synchlink)DRAM(SLDRAM)、存储器总线(Rambus)直接RAM(RDRAM)、直接存储器总线动态RAM(DRDRAM)、以及存储器总线动态RAM(RDRAM)等。Those of ordinary skill in the art can understand that all or part of the processes in the methods of the above embodiments can be realized by instructing related hardware through a computer program, and the computer program can be stored in a non-volatile computer-readable storage medium , when the computer program is executed, it may include the procedures of the embodiments of the above-mentioned methods. Wherein, any reference to memory, storage, database or other media used in various embodiments provided by the present disclosure may include non-volatile and/or volatile memory. Nonvolatile memory can include read only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory. Volatile memory can include random access memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in many forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Chain Synchlink DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.
本公开所有实施例均可以单独被执行,也可以与其他实施例相结合被执行,均视为本公开要求的保护范围。All the embodiments of the present disclosure can be implemented independently or in combination with other embodiments, which are all regarded as the scope of protection required by the present disclosure.

Claims (32)

  1. 一种视频文字跟踪方法,包括:A video text tracking method, comprising:
    从视频中提取出多个视频帧;Extract multiple video frames from the video;
    获取每个视频帧中的文字的文字序列特征、所述文字所在的文字区域的位置特征以及所述文字区域对应的图像特征;Obtain the character sequence feature of the text in each video frame, the position feature of the text area where the text is located, and the image feature corresponding to the text area;
    分别根据所述每个视频帧的所述文字序列特征、所述位置特征以及所述图像特征,确定所述每个视频帧对应的融合特征描述子;Determining a fusion feature descriptor corresponding to each video frame according to the character sequence feature, the position feature, and the image feature of each video frame;
    基于所述每个视频帧对应的融合特征描述子更新已存储的参考跟踪轨迹,得到所述视频中的文字的跟踪轨迹;所述跟踪轨迹用于表征所述视频中的文字的位置信息。Based on the fusion feature descriptor corresponding to each video frame, the stored reference tracking trajectory is updated to obtain the tracking trajectory of the text in the video; the tracking trajectory is used to represent the position information of the text in the video.
  2. 根据权利要求1所述的视频文字跟踪方法,其中,所述获取每个视频帧中的文字的文字序列特征、所述文字所在的文字区域的位置特征以及所述文字区域对应的图像特征,包括:The video text tracking method according to claim 1, wherein said obtaining the text sequence feature of the text in each video frame, the position feature of the text area where the text is located, and the image feature corresponding to the text area include :
    对于每个视频帧,从所述视频帧中确定出所述文字区域;For each video frame, determine the text area from the video frame;
    编码所述文字区域的位置信息,得到所述位置特征;Encoding the position information of the text area to obtain the position feature;
    基于所述文字区域的位置信息,从所述视频帧中提取出所述图像特征;extracting the image features from the video frame based on the position information of the text region;
    解码所述图像特征,得到所述文字序列特征。The image feature is decoded to obtain the character sequence feature.
  3. 根据权利要求2所述的视频文字跟踪方法,其中,所述从所述视频帧中确定出所述文字区域,包括:The video text tracking method according to claim 2, wherein said determining said text area from said video frame comprises:
    对所述视频帧进行特征提取,得到所述视频帧对应的特征图;performing feature extraction on the video frame to obtain a feature map corresponding to the video frame;
    确定所述特征图对应的文字区域热图;所述文字区域热图表征所述特征图中的文字区域和非文字区域;Determine the text area heat map corresponding to the feature map; the text area heat map represents the text area and non-text area in the feature map;
    对所述文字区域热图进行连通域分析,得到所述文字区域。Connected domain analysis is performed on the text area heat map to obtain the text area.
  4. 根据权利要求3所述的视频文字跟踪方法,其中,所述基于所述文字区域的位置信息,从所述视频帧中提取出所述图像特征,包括:The video text tracking method according to claim 3, wherein said extracting said image features from said video frame based on the position information of said text region comprises:
    将所述文字区域的位置信息映射到所述特征图中,得到所述文字区域对应的映射区域;mapping the position information of the text area to the feature map to obtain a mapping area corresponding to the text area;
    从所述特征图中提取出位于所述映射区域中的图像,得到所述图像特征。Extracting images located in the mapping region from the feature map to obtain the image features.
  5. 根据权利要求1所述的视频文字跟踪方法,其中,所述分别根据每个视频帧的所述文字序列特征、所述位置特征以及所述图像特征,得到所述每个视频帧对应的融合特征描述子,包括:The video text tracking method according to claim 1, wherein said fusion feature corresponding to each video frame is obtained according to said text sequence feature, said position feature and said image feature of each video frame respectively Descriptors, including:
    对于每个视频帧,拼接所述视频帧的所述文字序列特征、所述位置特征以及所述图像特征,得到所述视频帧对应的融合特征;For each video frame, splicing the character sequence feature, the position feature and the image feature of the video frame to obtain the fusion feature corresponding to the video frame;
    调整所述融合特征的尺寸,得到所述融合特征描述子。Adjust the size of the fusion feature to obtain the fusion feature descriptor.
  6. 根据权利要求1所述的视频文字跟踪方法,其中,所述参考跟踪轨迹包括参考文字区域,所述多个视频帧在时间轴上连续,所述基于所述每个视频帧对应的融合特征描述子更新已存储的参考跟踪轨迹,得到所述视频中的文字的跟踪轨迹,包括:The video text tracking method according to claim 1, wherein the reference tracking trajectory includes a reference text area, the plurality of video frames are continuous on the time axis, and the fusion feature description based on each video frame corresponds Sub-update the stored reference tracking track to obtain the tracking track of the text in the video, including:
    按顺序依次遍历所述每个视频帧,并在遍历所述每个视频帧时,执行以下操作:将遍历到的当前视频帧对应的当前融合特征描述子,与所述参考文字区域对应的参考融合特征描述子进行相似度匹配,得到所述当前视频帧对应的相似度匹配结果;基于所述相似度匹配结果、所述当前融合特征描述子和所述当前视频帧对应的当前文字区域,更新所述参考 跟踪轨迹,得到所述当前视频帧对应的目标跟踪轨迹;且将所述目标跟踪轨迹确定为更新后的所述参考跟踪轨迹;所述顺序表征所述每个视频帧在所述时间轴上的顺序;Traverse each video frame in sequence, and when traversing each video frame, perform the following operations: the current fusion feature descriptor corresponding to the current video frame traversed, the reference corresponding to the reference text area The fusion feature descriptor performs similarity matching to obtain the similarity matching result corresponding to the current video frame; based on the similarity matching result, the current fusion feature descriptor and the current text area corresponding to the current video frame, update The reference tracking track obtains the target tracking track corresponding to the current video frame; and the target tracking track is determined as the updated reference tracking track; the sequence characterizes each video frame at the time the order on the axis;
    将排序最后的视频帧对应的目标跟踪轨迹,确定为所述视频中的文字的跟踪轨迹。The target tracking track corresponding to the last sorted video frame is determined as the tracking track of the text in the video.
  7. 根据权利要求6所述的视频文字跟踪方法,其中,所述基于所述相似度匹配结果、所述当前融合特征描述子和所述当前视频帧对应的当前文字区域,更新所述参考跟踪轨迹,得到所述当前视频帧对应的目标跟踪轨迹,包括:The video text tracking method according to claim 6, wherein the reference tracking track is updated based on the similarity matching result, the current fusion feature descriptor and the current text region corresponding to the current video frame, Obtain the target tracking trajectory corresponding to the current video frame, including:
    在所述相似度匹配结果小于预设相似度阈值的情况下,基于所述当前文字区域更新所述参考文字区域,并基于所述当前融合特征描述子更新所述参考融合特征描述子,得到所述目标跟踪轨迹。When the similarity matching result is less than a preset similarity threshold, update the reference text region based on the current text region, and update the reference fusion feature descriptor based on the current fusion feature descriptor, to obtain the the target tracking trajectory.
  8. 根据权利要求6所述的视频文字跟踪方法,其中,所述基于所述相似度匹配结果、所述当前融合特征描述子和所述当前视频帧对应的当前文字区域,更新所述参考跟踪轨迹,得到所述当前视频帧对应的目标跟踪轨迹,包括:The video text tracking method according to claim 6, wherein the reference tracking track is updated based on the similarity matching result, the current fusion feature descriptor and the current text region corresponding to the current video frame, Obtain the target tracking trajectory corresponding to the current video frame, including:
    在所述相似度匹配结果大于或等于预设相似度阈值的情况下,将所述当前文字区域和所述当前融合特征描述子添加至所述参考跟踪轨迹中,得到所述目标跟踪轨迹。When the similarity matching result is greater than or equal to a preset similarity threshold, the current text region and the current fusion feature descriptor are added to the reference tracking track to obtain the target tracking track.
  9. 根据权利要求1所述的视频文字跟踪方法,其中,所述方法基于文字跟踪网络执行,所述文字跟踪网络包括检测分支网络、识别分支网络和多信息融合描述子分支网络;The video text tracking method according to claim 1, wherein the method is executed based on a text tracking network, and the text tracking network includes a detection branch network, a recognition branch network and a multi-information fusion description sub-branch network;
    所述检测分支网络,用于检测每个视频帧中的文字区域;The detection branch network is used to detect text regions in each video frame;
    所述识别分支网络,用于执行所述获取每个视频帧中的文字的文字序列特征、所述文字所在的文字区域的位置特征以及所述文字区域对应的图像特征的步骤;The recognition branch network is used to perform the steps of obtaining the character sequence features of the characters in each video frame, the positional features of the character region where the characters are located, and the corresponding image features of the character region;
    所述多信息融合描述子分支网络,用于执行所述分别根据所述每个视频帧的所述文字序列特征、所述位置特征以及所述图像特征,确定所述每个视频帧对应的融合特征描述子的步骤。The multi-information fusion description sub-branch network is used to perform the said fusion process corresponding to each video frame according to the character sequence feature, the position feature and the image feature of each video frame. Steps for feature descriptors.
  10. 根据权利要求9所述的视频文字跟踪方法,其中,所述检测分支网络包括特征提取子网络、第一特征融合子网络和特征检测子网络;The video text tracking method according to claim 9, wherein said detection branch network comprises a feature extraction subnetwork, a first feature fusion subnetwork and a feature detection subnetwork;
    所述特征提取子网络,用于对于每个视频帧,对所述视频帧进行基础特征提取,得到所述视频帧的基础特征;The feature extraction sub-network is used to extract the basic features of the video frame for each video frame to obtain the basic features of the video frame;
    所述第一特征融合子网络,用于对提取出的基础特征进行多尺度特征融合处理,得到所述视频帧对应的特征图;The first feature fusion sub-network is used to perform multi-scale feature fusion processing on the extracted basic features to obtain the feature map corresponding to the video frame;
    所述特征检测子网络,用于将所述视频帧的特征图处理成文字区域热图,所述文字区域热图表征所述特征图中的文字区域和非文字区域;对所述文字区域热图进行连通域分析,得到所述视频帧对应的文字区域。The feature detection sub-network is used to process the feature map of the video frame into a text area heat map, and the text area heat map represents a text area and a non-text area in the feature map; Connected domain analysis is performed on the graph to obtain the text region corresponding to the video frame.
  11. 一种视频文字跟踪装置,包括:A video text tracking device, comprising:
    提取模块,被配置为从视频中提取出多个视频帧;An extraction module configured to extract a plurality of video frames from the video;
    特征获取模块,被配置为获取每个视频帧中的文字的文字序列特征、所述文字所在的文字区域的位置特征以及所述文字区域对应的图像特征;The feature acquisition module is configured to acquire the character sequence feature of the text in each video frame, the position feature of the text area where the text is located, and the image feature corresponding to the text area;
    描述子获取模块,被配置为分别根据所述每个视频帧的所述文字序列特征、所述位置特征以及所述图像特征,得到所述每个视频帧对应的融合特征描述子;The descriptor acquisition module is configured to obtain the fusion feature descriptor corresponding to each video frame according to the character sequence feature, the position feature and the image feature of each video frame;
    跟踪轨迹确定模块,被配置为基于所述每个视频帧对应的融合特征描述子更新已存储的参考跟踪轨迹,得到所述视频中的文字的跟踪轨迹;所述跟踪轨迹用于表征所述视频中文字的位置信息。The tracking trajectory determination module is configured to update the stored reference tracking trajectory based on the fusion feature descriptor corresponding to each video frame to obtain the tracking trajectory of the text in the video; the tracking trajectory is used to characterize the video The location information of Chinese characters.
  12. 根据权利要求11所述的视频文字跟踪装置,其中,所述特征获取模块,包括:The video text tracking device according to claim 11, wherein the feature acquisition module includes:
    文字区域确定单元,被配置为对于每个视频帧,从所述视频帧中确定出所述文字区域;a text area determination unit configured to, for each video frame, determine the text area from the video frame;
    位置特征获取单元,被配置为编码所述文字区域的位置信息,得到所述位置特征;A location feature acquisition unit configured to encode the location information of the text area to obtain the location feature;
    图像特征提取单元,被配置为基于所述文字区域的位置信息,从所述视频帧中提取出所述图像特征;an image feature extraction unit configured to extract the image feature from the video frame based on the location information of the text region;
    文字序列特征获取单元,被配置为解码所述图像特征,得到所述文字序列特征。The character sequence feature acquisition unit is configured to decode the image feature to obtain the character sequence feature.
  13. 根据权利要求12所述的视频文字跟踪装置,其中,所述文字区域确定单元,包括:The video text tracking device according to claim 12, wherein the text area determining unit comprises:
    特征图确定子单元,被配置为对所述视频帧进行特征提取,得到所述视频帧对应的特征图;The feature map determination subunit is configured to perform feature extraction on the video frame to obtain a feature map corresponding to the video frame;
    文字区域热图确定子单元,被配置为确定所述特征图对应的文字区域热图;所述文字区域热图表征所述特征图中的文字区域和非文字区域;The text area heat map determination subunit is configured to determine the text area heat map corresponding to the feature map; the text area heat map represents the text area and non-text area in the feature map;
    连通域分析子单元,被配置为对所述文字区域热图进行连通域分析,得到所述文字区域。The connected domain analysis subunit is configured to perform connected domain analysis on the text area heat map to obtain the text area.
  14. 根据权利要求13所述的视频文字跟踪装置,其中,所述图像特征提取单元,包括:The video text tracking device according to claim 13, wherein the image feature extraction unit comprises:
    映射子单元,被配置为将所述文字区域的位置信息映射到所述特征图中,得到所述文字区域对应的映射区域;The mapping subunit is configured to map the position information of the text region to the feature map to obtain a mapping region corresponding to the text region;
    图像特征提取子单元,被配置为从所述特征图中提取出位于所述映射区域中的图像,得到所述图像特征。The image feature extraction subunit is configured to extract the image located in the mapping area from the feature map to obtain the image feature.
  15. 根据权利要求11所述的视频文字跟踪装置,其中,所述描述子获取模块,包括:The video text tracking device according to claim 11, wherein the descriptor acquisition module includes:
    拼接单元,被配置为对于每个视频帧,拼接所述视频帧的所述文字序列特征、所述位置特征以及所述图像特征,得到所述视频帧对应的融合特征;The splicing unit is configured to, for each video frame, splice the character sequence feature, the position feature, and the image feature of the video frame to obtain a fusion feature corresponding to the video frame;
    调整单元,被配置为调整所述融合特征的尺寸,得到所述融合特征描述子。The adjustment unit is configured to adjust the size of the fusion feature to obtain the fusion feature descriptor.
  16. 根据权利要求11所述的视频文字跟踪装置,其中,所述参考跟踪轨迹包括参考文字区域,所述多个视频帧为在时间轴上连续的多个视频帧,所述跟踪轨迹确定模块,包括:The video text tracking device according to claim 11, wherein, the reference tracking track includes a reference text area, and the plurality of video frames are a plurality of video frames continuous on the time axis, and the tracking track determination module includes :
    相似度匹配单元,被配置为按顺序依次遍历所述每个视频帧,并在遍历所述每个视频帧时,执行以下操作:将遍历到的当前视频帧对应的当前融合特征描述子,与所述参考文字区域对应的参考融合特征描述子进行相似度匹配,得到所述当前视频帧对应的相似度匹配结果;基于所述相似度匹配结果、所述当前融合特征描述子和所述当前视频帧对应的当前文字区域,更新所述参考跟踪轨迹,得到所述当前视频帧对应的目标跟踪轨迹;且将所述目标跟踪轨迹确定为更新后的所述参考跟踪轨迹;所述顺序表征所述每个视频帧在所述时间轴上的顺序;The similarity matching unit is configured to traverse each video frame in sequence, and when traversing each video frame, perform the following operations: combine the current fusion feature descriptor corresponding to the traversed current video frame with The reference fusion feature descriptor corresponding to the reference text area performs similarity matching to obtain a similarity matching result corresponding to the current video frame; based on the similarity matching result, the current fusion feature descriptor and the current video In the current text area corresponding to the frame, update the reference tracking track to obtain the target tracking track corresponding to the current video frame; and determine the target tracking track as the updated reference tracking track; the order represents the the order of each video frame on said time axis;
    跟踪轨迹确定单元,被配置为将排序最后的视频帧对应的目标跟踪轨迹,确定为所述视频中的文字的跟踪轨迹。The tracking track determination unit is configured to determine the target tracking track corresponding to the last video frame as the tracking track of the text in the video.
  17. 根据权利要求16所述的视频文字跟踪装置,其中,所述相似度匹配单元,被配置为在所述相似度匹配结果小于预设相似度阈值的情况下,基于所述当前文字区域更新所述参考文字区域,并基于所述当前融合特征描述子更新所述参考融合特征描述子,得到所述目标跟踪轨迹。The video text tracking device according to claim 16, wherein the similarity matching unit is configured to update the Referring to the text area, and updating the reference fusion feature descriptor based on the current fusion feature descriptor to obtain the target tracking trajectory.
  18. 根据权利要求16所述的视频文字跟踪装置,其中,所述相似度匹配单元,被配置为在所述相似度匹配结果大于或等于预设相似度阈值的情况下,将所述当前文字区域和所述当前融合特征描述子添加至所述参考跟踪轨迹中,得到所述目标跟踪轨迹。The video text tracking device according to claim 16, wherein the similarity matching unit is configured to combine the current text region and The current fusion feature descriptor is added to the reference tracking track to obtain the target tracking track.
  19. 根据权利要求11所述的视频文字跟踪装置,其中,文字跟踪网络包括检测分支网络、识别分支网络和多信息融合描述子分支网络;The video text tracking device according to claim 11, wherein the text tracking network includes a detection branch network, a recognition branch network and a multi-information fusion description sub-branch network;
    所述检测分支网络,用于检测每个视频帧中的文字区域;The detection branch network is used to detect text regions in each video frame;
    所述识别分支网络,用于执行所述获取每个视频帧中的文字的文字序列特征、所述文字所在的文字区域的位置特征以及所述文字区域对应的图像特征的步骤;The recognition branch network is used to perform the steps of obtaining the character sequence features of the characters in each video frame, the positional features of the character region where the characters are located, and the corresponding image features of the character region;
    所述多信息融合描述子分支网络,用于执行所述分别根据所述每个视频帧的所述文字序列特征、所述位置特征以及所述图像特征,确定所述每个视频帧对应的融合特征描述子的步骤。The multi-information fusion description sub-branch network is used to perform the said fusion process corresponding to each video frame according to the character sequence feature, the position feature and the image feature of each video frame. Steps for feature descriptors.
  20. 根据权利要求19所述的视频文字跟踪装置,其中,所述检测分支网络包括特征提取子网络、第一特征融合子网络和特征检测子网络;The video text tracking device according to claim 19, wherein the detection branch network comprises a feature extraction subnetwork, a first feature fusion subnetwork and a feature detection subnetwork;
    所述特征提取子网络,用于对于每个视频帧,对所述视频帧进行基础特征提取,得到所述视频帧的基础特征;The feature extraction sub-network is used to extract the basic features of the video frame for each video frame to obtain the basic features of the video frame;
    所述第一特征融合子网络,用于对提取出的基础特征进行多尺度特征融合处理,得到所述视频帧对应的特征图;The first feature fusion sub-network is used to perform multi-scale feature fusion processing on the extracted basic features to obtain the feature map corresponding to the video frame;
    所述特征检测子网络,用于将所述视频帧的特征图处理成文字区域热图,所述文字区域热图表征所述特征图中的文字区域和非文字区域;对所述文字区域热图进行连通域分析,得到所述视频帧对应的文字区域。The feature detection sub-network is used to process the feature map of the video frame into a text area heat map, and the text area heat map represents a text area and a non-text area in the feature map; Connected domain analysis is performed on the graph to obtain the text region corresponding to the video frame.
  21. 一种电子设备,包括:An electronic device comprising:
    处理器;processor;
    用于存储所述处理器可执行指令的存储器;其中,所述处理器被配置为执行所述指令,以实现如下步骤:A memory for storing instructions executable by the processor; wherein the processor is configured to execute the instructions to implement the following steps:
    从视频中提取出多个视频帧;Extract multiple video frames from the video;
    获取每个视频帧中的文字的文字序列特征、所述文字所在的文字区域的位置特征以及所述文字区域对应的图像特征;Obtain the character sequence feature of the text in each video frame, the position feature of the text area where the text is located, and the image feature corresponding to the text area;
    分别根据所述每个视频帧的所述文字序列特征、所述位置特征以及所述图像特征,确定所述每个视频帧对应的融合特征描述子;Determining a fusion feature descriptor corresponding to each video frame according to the character sequence feature, the position feature, and the image feature of each video frame;
    基于所述每个视频帧对应的融合特征描述子更新已存储的参考跟踪轨迹,得到所述视频中的文字的跟踪轨迹;所述跟踪轨迹用于表征所述视频中的文字的位置信息。Based on the fusion feature descriptor corresponding to each video frame, the stored reference tracking trajectory is updated to obtain the tracking trajectory of the text in the video; the tracking trajectory is used to represent the position information of the text in the video.
  22. 根据权利要求21所述的电子设备,其中,所述处理器被配置为执行所述指令,以实现如下步骤:The electronic device according to claim 21, wherein the processor is configured to execute the instructions to implement the following steps:
    对于每个视频帧,从所述视频帧中确定出所述文字区域;For each video frame, determine the text area from the video frame;
    编码所述文字区域的位置信息,得到所述位置特征;Encoding the position information of the text area to obtain the position feature;
    基于所述文字区域的位置信息,从所述视频帧中提取出所述图像特征;extracting the image features from the video frame based on the position information of the text region;
    解码所述图像特征,得到所述文字序列特征。The image feature is decoded to obtain the character sequence feature.
  23. 根据权利要求22所述的电子设备,其中,所述处理器被配置为执行所述指令,以实现如下步骤:The electronic device according to claim 22, wherein the processor is configured to execute the instructions to implement the following steps:
    对所述视频帧进行特征提取,得到所述视频帧对应的特征图;performing feature extraction on the video frame to obtain a feature map corresponding to the video frame;
    确定所述特征图对应的文字区域热图;所述文字区域热图表征所述特征图中的文字区域和非文字区域;Determine the text area heat map corresponding to the feature map; the text area heat map represents the text area and non-text area in the feature map;
    对所述文字区域热图进行连通域分析,得到所述文字区域。Connected domain analysis is performed on the text area heat map to obtain the text area.
  24. 根据权利要求23所述的电子设备,其中,所述处理器被配置为执行所述指令,以实现如下步骤:The electronic device according to claim 23, wherein the processor is configured to execute the instructions to implement the following steps:
    将所述文字区域的位置信息映射到所述特征图中,得到所述文字区域对应的映射区域;mapping the position information of the text area to the feature map to obtain a mapping area corresponding to the text area;
    从所述特征图中提取出位于所述映射区域中的图像,得到所述图像特征。Extracting images located in the mapping region from the feature map to obtain the image features.
  25. 根据权利要求21所述的电子设备,其中,所述处理器被配置为执行所述指令,以实现如下步骤:The electronic device according to claim 21, wherein the processor is configured to execute the instructions to implement the following steps:
    对于每个视频帧,拼接所述视频帧的所述文字序列特征、所述位置特征以及所述图像特征,得到所述视频帧对应的融合特征;For each video frame, splicing the character sequence feature, the position feature and the image feature of the video frame to obtain the fusion feature corresponding to the video frame;
    调整所述融合特征的尺寸,得到所述融合特征描述子。Adjust the size of the fusion feature to obtain the fusion feature descriptor.
  26. 根据权利要求21所述的电子设备,其中,所述参考跟踪轨迹包括参考文字区域,所述多个视频帧在时间轴上连续,所述处理器被配置为执行所述指令,以实现如下步骤:The electronic device according to claim 21, wherein the reference tracking track includes a reference text area, the plurality of video frames are continuous on the time axis, and the processor is configured to execute the instructions to achieve the following steps :
    按顺序依次遍历所述每个视频帧,并在遍历所述每个视频帧时,执行以下操作:将遍历到的当前视频帧对应的当前融合特征描述子,与所述参考文字区域对应的参考融合特征描述子进行相似度匹配,得到所述当前视频帧对应的相似度匹配结果;基于所述相似度匹配结果、所述当前融合特征描述子和所述当前视频帧对应的当前文字区域,更新所述参考跟踪轨迹,得到所述当前视频帧对应的目标跟踪轨迹;且将所述目标跟踪轨迹确定为更新后的所述参考跟踪轨迹;所述顺序表征所述每个视频帧在所述时间轴上的顺序;Traverse each video frame in sequence, and when traversing each video frame, perform the following operations: the current fusion feature descriptor corresponding to the current video frame traversed, the reference corresponding to the reference text area The fusion feature descriptor performs similarity matching to obtain the similarity matching result corresponding to the current video frame; based on the similarity matching result, the current fusion feature descriptor and the current text area corresponding to the current video frame, update The reference tracking track obtains the target tracking track corresponding to the current video frame; and determines the target tracking track as the updated reference tracking track; the sequence characterizes each video frame at the time the order on the axes;
    将排序最后的视频帧对应的目标跟踪轨迹,确定为所述视频中的文字的跟踪轨迹。The target tracking track corresponding to the last sorted video frame is determined as the tracking track of the text in the video.
  27. 根据权利要求26所述的电子设备,其中,所述处理器被配置为执行所述指令,以实现如下步骤:The electronic device according to claim 26, wherein the processor is configured to execute the instructions to implement the following steps:
    在所述相似度匹配结果小于预设相似度阈值的情况下,基于所述当前文字区域更新所述参考文字区域,并基于所述当前融合特征描述子更新所述参考融合特征描述子,得到所述目标跟踪轨迹。When the similarity matching result is less than a preset similarity threshold, update the reference text region based on the current text region, and update the reference fusion feature descriptor based on the current fusion feature descriptor, to obtain the the target tracking trajectory.
  28. 根据权利要求26所述的电子设备,其中,所述处理器被配置为执行所述指令,以实现如下步骤:The electronic device according to claim 26, wherein the processor is configured to execute the instructions to implement the following steps:
    在所述相似度匹配结果大于或等于预设相似度阈值的情况下,将所述当前文字区域和所述当前融合特征描述子添加至所述参考跟踪轨迹中,得到所述目标跟踪轨迹。When the similarity matching result is greater than or equal to a preset similarity threshold, the current text region and the current fusion feature descriptor are added to the reference tracking track to obtain the target tracking track.
  29. 根据权利要求21所述的电子设备,其中,文字跟踪网络包括检测分支网络、识别分支网络和多信息融合描述子分支网络;The electronic device according to claim 21, wherein the character tracking network includes a detection branch network, a recognition branch network and a multi-information fusion description sub-branch network;
    所述检测分支网络,用于检测每个视频帧中的文字区域;The detection branch network is used to detect text regions in each video frame;
    所述识别分支网络,用于执行所述获取每个视频帧中的文字的文字序列特征、所述文字所在的文字区域的位置特征以及所述文字区域对应的图像特征的步骤;The recognition branch network is used to perform the steps of obtaining the character sequence features of the characters in each video frame, the positional features of the character region where the characters are located, and the corresponding image features of the character region;
    所述多信息融合描述子分支网络,用于执行所述分别根据所述每个视频帧的所述文字序列特征、所述位置特征以及所述图像特征,确定所述每个视频帧对应的融合特征描述子的步骤。The multi-information fusion description sub-branch network is used to perform the said fusion process corresponding to each video frame according to the character sequence feature, the position feature and the image feature of each video frame. Steps for feature descriptors.
  30. 根据权利要求29所述的电子设备,其中,所述检测分支网络包括特征提取子网络、第一特征融合子网络和特征检测子网络;The electronic device according to claim 29, wherein the detection branch network comprises a feature extraction subnetwork, a first feature fusion subnetwork and a feature detection subnetwork;
    所述特征提取子网络,用于对于每个视频帧,对所述视频帧进行基础特征提取,得到所述视频帧的基础特征;The feature extraction sub-network is used to extract the basic features of the video frame for each video frame to obtain the basic features of the video frame;
    所述第一特征融合子网络,用于对提取出的基础特征进行多尺度特征融合处理,得到所述视频帧对应的特征图;The first feature fusion sub-network is used to perform multi-scale feature fusion processing on the extracted basic features to obtain the feature map corresponding to the video frame;
    所述特征检测子网络,用于将所述视频帧的特征图处理成文字区域热图,所述文字区域热图表征所述特征图中的文字区域和非文字区域;对所述文字区域热图进行连通域分析,得到所述视频帧对应的文字区域。The feature detection sub-network is used to process the feature map of the video frame into a text area heat map, and the text area heat map represents a text area and a non-text area in the feature map; Connected domain analysis is performed on the graph to obtain the text region corresponding to the video frame.
  31. 一种计算机可读存储介质,当所述计算机可读存储介质中的指令由电子设备的处理器执行时,使得所述电子设备执行如下步骤:A computer-readable storage medium, when instructions in the computer-readable storage medium are executed by a processor of an electronic device, the electronic device is made to perform the following steps:
    从视频中提取出多个视频帧;Extract multiple video frames from the video;
    获取每个视频帧中的文字的文字序列特征、所述文字所在的文字区域的位置特征以及所述文字区域对应的图像特征;Obtain the character sequence feature of the text in each video frame, the position feature of the text area where the text is located, and the image feature corresponding to the text area;
    分别根据所述每个视频帧的所述文字序列特征、所述位置特征以及所述图像特征,确定所述每个视频帧对应的融合特征描述子;Determining a fusion feature descriptor corresponding to each video frame according to the character sequence feature, the position feature, and the image feature of each video frame;
    基于所述每个视频帧对应的融合特征描述子更新已存储的参考跟踪轨迹,得到所述视频中的文字的跟踪轨迹;所述跟踪轨迹用于表征所述视频中的文字的位置信息。Based on the fusion feature descriptor corresponding to each video frame, the stored reference tracking trajectory is updated to obtain the tracking trajectory of the text in the video; the tracking trajectory is used to represent the position information of the text in the video.
  32. 一种计算机程序产品,包括计算机程序,所述计算机程序被处理器执行时实现如下步骤:A computer program product, including a computer program, when the computer program is executed by a processor, the following steps are implemented:
    从视频中提取出多个视频帧;Extract multiple video frames from the video;
    获取每个视频帧中的文字的文字序列特征、所述文字所在的文字区域的位置特征以及所述文字区域对应的图像特征;Obtain the character sequence feature of the text in each video frame, the position feature of the text area where the text is located, and the image feature corresponding to the text area;
    分别根据所述每个视频帧的所述文字序列特征、所述位置特征以及所述图像特征,确定所述每个视频帧对应的融合特征描述子;Determining a fusion feature descriptor corresponding to each video frame according to the character sequence feature, the position feature, and the image feature of each video frame;
    基于所述每个视频帧对应的融合特征描述子更新已存储的参考跟踪轨迹,得到所述视频中的文字的跟踪轨迹;所述跟踪轨迹用于表征所述视频中的文字的位置信息。Based on the fusion feature descriptor corresponding to each video frame, the stored reference tracking trajectory is updated to obtain the tracking trajectory of the text in the video; the tracking trajectory is used to represent the position information of the text in the video.
PCT/CN2022/097877 2021-12-24 2022-06-09 Video text tracking method and electronic device WO2023115838A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202111601711.0 2021-12-24
CN202111601711.0A CN114463376B (en) 2021-12-24 2021-12-24 Video text tracking method and device, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
WO2023115838A1 true WO2023115838A1 (en) 2023-06-29

Family

ID=81408054

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/097877 WO2023115838A1 (en) 2021-12-24 2022-06-09 Video text tracking method and electronic device

Country Status (2)

Country Link
CN (1) CN114463376B (en)
WO (1) WO2023115838A1 (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114463376B (en) * 2021-12-24 2023-04-25 北京达佳互联信息技术有限公司 Video text tracking method and device, electronic equipment and storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101533474A (en) * 2008-03-12 2009-09-16 三星电子株式会社 Character and image recognition system based on video image and method thereof
CN107545210A (en) * 2016-06-27 2018-01-05 北京新岸线网络技术有限公司 A kind of method of video text extraction
US20200320308A1 (en) * 2019-04-08 2020-10-08 Nedelco, Incorporated Identifying and tracking words in a video recording of captioning session
CN112954455A (en) * 2021-02-22 2021-06-11 北京奇艺世纪科技有限公司 Subtitle tracking method and device and electronic equipment
CN113392689A (en) * 2020-12-25 2021-09-14 腾讯科技(深圳)有限公司 Video character tracking method, video processing method, device, equipment and medium
CN114463376A (en) * 2021-12-24 2022-05-10 北京达佳互联信息技术有限公司 Video character tracking method and device, electronic equipment and storage medium

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2019136479A1 (en) * 2018-01-08 2019-07-11 The Regents On The University Of California Surround vehicle tracking and motion prediction
CN110874553A (en) * 2018-09-03 2020-03-10 杭州海康威视数字技术股份有限公司 Recognition model training method and device
CN109800757B (en) * 2019-01-04 2022-04-19 西北工业大学 Video character tracking method based on layout constraint
US11211053B2 (en) * 2019-05-23 2021-12-28 International Business Machines Corporation Systems and methods for automated generation of subtitles
CN111931571B (en) * 2020-07-07 2022-05-17 华中科技大学 Video character target tracking method based on online enhanced detection and electronic equipment
CN112101344B (en) * 2020-08-25 2022-09-06 腾讯科技(深圳)有限公司 Video text tracking method and device
CN112364863B (en) * 2020-10-20 2022-10-28 苏宁金融科技(南京)有限公司 Character positioning method and system for license document
CN112418216B (en) * 2020-11-18 2024-01-05 湖南师范大学 Text detection method in complex natural scene image
CN112818985A (en) * 2021-01-28 2021-05-18 深圳点猫科技有限公司 Text detection method, device, system and medium based on segmentation

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101533474A (en) * 2008-03-12 2009-09-16 三星电子株式会社 Character and image recognition system based on video image and method thereof
CN107545210A (en) * 2016-06-27 2018-01-05 北京新岸线网络技术有限公司 A kind of method of video text extraction
US20200320308A1 (en) * 2019-04-08 2020-10-08 Nedelco, Incorporated Identifying and tracking words in a video recording of captioning session
CN113392689A (en) * 2020-12-25 2021-09-14 腾讯科技(深圳)有限公司 Video character tracking method, video processing method, device, equipment and medium
CN112954455A (en) * 2021-02-22 2021-06-11 北京奇艺世纪科技有限公司 Subtitle tracking method and device and electronic equipment
CN114463376A (en) * 2021-12-24 2022-05-10 北京达佳互联信息技术有限公司 Video character tracking method and device, electronic equipment and storage medium

Also Published As

Publication number Publication date
CN114463376A (en) 2022-05-10
CN114463376B (en) 2023-04-25

Similar Documents

Publication Publication Date Title
US10997443B2 (en) User identity verification method, apparatus and system
US20210233319A1 (en) Context-aware tagging for augmented reality environments
WO2018141232A1 (en) Image processing method, computer storage medium, and computer device
US11003896B2 (en) Entity recognition from an image
US11450027B2 (en) Method and electronic device for processing videos
US10552471B1 (en) Determining identities of multiple people in a digital image
US11335127B2 (en) Media processing method, related apparatus, and storage medium
US10839006B2 (en) Mobile visual search using deep variant coding
Wei et al. Wide area localization and tracking on camera phones for mobile augmented reality systems
CN112270686A (en) Image segmentation model training method, image segmentation device and electronic equipment
CN111639968A (en) Trajectory data processing method and device, computer equipment and storage medium
WO2023115838A1 (en) Video text tracking method and electronic device
CN113221983B (en) Training method and device for transfer learning model, image processing method and device
EP2776981A2 (en) Methods and apparatuses for mobile visual search
WO2023123923A1 (en) Human body weight identification method, human body weight identification device, computer device, and medium
CN113128526B (en) Image recognition method and device, electronic equipment and computer-readable storage medium
CN111985531B (en) Method, device, equipment and storage medium for determining abnormal resource demand cluster
WO2021068524A1 (en) Image matching method and apparatus, computer device, and storage medium
US10834385B1 (en) Systems and methods for encoding videos using reference objects
Chen et al. Context-aware discriminative vocabulary learning for mobile landmark recognition
Zhang et al. Self-supervised pre-training on the target domain for cross-domain person re-identification
CN115205202A (en) Video detection method, device, equipment and storage medium
CN113421182A (en) Three-dimensional reconstruction method and device, electronic equipment and storage medium
CN113591656A (en) Image processing method, system, device, equipment and computer storage medium
CN114611565A (en) Data processing method, device, equipment and storage medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22909171

Country of ref document: EP

Kind code of ref document: A1