CN109800757B

CN109800757B - Video character tracking method based on layout constraint

Info

Publication number: CN109800757B
Application number: CN201910006843.5A
Authority: CN
Inventors: 冯晓毅; 宋真东; 王西汉; 蒋晓悦; 夏召强; 彭进业; 谢红梅; 李会方; 何贵青
Original assignee: Northwestern Polytechnical University
Current assignee: Northwestern Polytechnical University
Priority date: 2019-01-04
Filing date: 2019-01-04
Publication date: 2022-04-19
Anticipated expiration: 2039-01-04
Also published as: CN109800757A

Abstract

In order to solve the problem of multi-character tracking under the condition of large-amplitude camera movement, the invention provides a video character tracking method based on layout constraint. The input of the method is the character detection result of the video and the video frame, and the output is the track information after character tracking. Firstly, the character track is initialized according to the detection result of the initial video frame, and then the character track of the previous frame and the detection result of the current frame are sent to the tracking method of the invention to update the character track. The core of the text track updating is to correspond the text area detected by the current frame to the existing text track, and the process can be regarded as a data matching problem. Aiming at the problem, the invention designs a new data matching cost function and obtains the optimal matching result by solving the cost function. And repeating the track updating process until the video processing is finished, and finally obtaining a character tracking result. According to the method, the layout constraint is introduced into the data matching cost function, and the characters are tracked through the integral appearance structure among the character areas, so that the error tracking result caused by large-amplitude movement of the camera can be effectively avoided, and a better tracking effect is achieved.

Description

Video character tracking method based on layout constraint

Technical Field

The invention relates to the field of video processing, in particular to a character tracking method in a natural scene shooting video.

Background

The text in the video contains high-level semantic information and is usually closely related to the video content. Therefore, the extraction of video text plays an important role in many media analysis-based applications, such as blind assistance systems, driving assistance systems, autonomous mobile robots, and the like. The extraction of video characters generally comprises character detection and character tracking, wherein the character detection is completed to position character targets in video frame images, and the character tracking is completed to correspond the same character areas in a continuous image sequence. The text in the video usually has a temporal redundancy characteristic, i.e. the text will disappear after a certain time in the video. By utilizing the characteristic, the stability and the precision of video character detection can be improved through a character tracking technology. In addition, text tracking can also provide other relevant information for video analysis, such as: the time points of appearance and disappearance of characters in the video time sequence, the motion track of the characters in a certain period of time and the like. Some real-time processing systems can also take advantage of the temporal redundancy characteristics of text in video to increase system processing speed. It follows that text tracking techniques play an important role in video-based analytics applications.

The existing video character tracking method cannot well solve the problem of tracking multiple characters when a camera moves greatly. Because in natural scenes, text usually does not appear singly, but in a dense form. These characters often have the same size, aspect ratio and color characteristics, and most of the features extracted by the tracking algorithm cannot distinguish the characters well, which may cause a wrong match and may not track the characters correctly. This situation is exacerbated with large movements of the camera.

Based on the above problems, the invention provides a video character tracking method based on layout constraint, so as to solve multi-character tracking under large-amplitude camera movement.

Disclosure of Invention

In order to solve the problem of multi-character tracking under the condition of large-amplitude camera movement, the invention provides a video character tracking method based on layout constraint. The process flow of the invention is shown in figure 1. The input of the method is the character detection result of the video and the single-frame image of the video, and the output is the track of each character area in the video, namely the space information (position coordinate and width and height) in each frame. Firstly, initializing a character track through a character area detection result of an initial video frame, then sending the character track of a previous frame and the character area detection result of a current frame into the tracking method of the invention to update the character track, repeating the process until the video processing is finished, and finally obtaining a character tracking result. The core of the text track updating is to correspond the text area detected by the current frame to the existing text track, and the process can be regarded as a data matching problem. Aiming at the problem, the invention designs a new data matching cost function and obtains the optimal matching result by solving the cost function. According to the method, the layout constraint is introduced into the data matching cost function, and the characters are tracked through the integral appearance structure among the character areas, so that the error tracking result caused by large-amplitude movement of the camera can be effectively avoided, and a better tracking effect is achieved. The specific details of the present invention are as follows.

1. Designing data matching cost function

Firstly, a text area contained in a text track in a current frame is defined. Let the status of the ith character area in the t-th frame character track of the video be

Wherein

Is the abscissa and ordinate of the central point of the character area,

is the areaThe speed of movement in the image in the lateral and longitudinal directions,

the width and height of the text area,

extracting RGB color histograms for the color features of the character region, wherein each channel has 16 feature regions, and the number of the three channels is 48; the status of the character area in all the character tracks in the t-th frame is set as

Where i ∈ N_t，N_tIndicating the number of text regions in the t-th frame.

For every two character areas, the correlation between the position and the speed needs to be established, and the correlation is regarded as a structural constraint which is shown by formula (1):

wherein

Structural constraints representing the text regions i and j, all constraints for i being represented as

Then all text regions in the t-th frame are constrained to

The tracking task is to match the character region detection result to the existing character track, and set the p character region detection result information in the t frame as

Is the central coordinate of the detection result of the character area,

the width and height of the text region detection result are obtained. The set of the detection results of all the character areas in the t-th frame is

Wherein p ∈ M_t，M_tThe number of the character areas is detected.

Using binary symbols a in the invention^i,pShowing the matching condition of the character track and the character area detection result, when the character track i is matched with the character area detection result p, a^i,p1, otherwise a^i,p0. For the detection result of the text track in the t-1 th frame and the text area in the t-1 th frame, the data matching condition is described by formula (2):

A＝argminC(S_t-1,R_t-1,D_t) (2)

wherein A ═ { a ═ a^i,p|i∈N_t-1,p∈M_tIn the method, one character track is matched with at most one character area detection result, C (S)_t-1,R_t-1,D_t) And representing all possible pairing sets of the text tracks and the text area detection results. And the best match result is the set minimum argminC (S)_t-1,R_t-1,D_t)。

In successive frames, the mutual distance between the words with the same background does not change much. When the camera is in motion, the text should maintain a similar appearance to other text. In the character tracking, the method simultaneously considers the similarity of characters in adjacent frames and the similarity of appearance of other characters related to the characters, and the similarity of appearance of layout around the characters in the adjacent frames is layout constraint. Cost function C (S) based on layout constraint_t-1,R_t-1,D_t) As shown in equation (3):

wherein

And

the difference cost value of the text region detected by the t-1 frame text track and the t-1 frame text track is shown, and the cost is calculated by using the region size ratio and the overlapping rate, as shown in formulas (4) and (5):

in the formula

And

respectively showing the width and height of the ith character area of the t-1 th frame character track and the detection result p of the t-th frame character area,

the area of the overlap between the ith character area of the t-1 th frame character track and the minimum external surrounding frame of the p area of the detection result of the t frame character area is shown,

indicating the combined area.

In the formula (3)

The structural constraint of the t-1 frame representing the detection result p of the t-th frame character area

Prediction region

The similarity between the appearance feature of (2) and the appearance feature of the jth character area in the corresponding t-1 frame character track, and the calculation formulas are shown as (6) and (7):

in the formula H_b(s) represents the RGB color space normalized histogram feature, F_bIs the total number of features, b is the index,

including the center point coordinates of the predicted region location and the width and height of the predicted region.

2. Cost function optimization and solution

In order to simplify the calculation, the invention uses the formula (8) and the formula (9) to restrict the matching of the track and the detection result, and when the condition is not satisfied, the matching is regarded as a^i,p0. Equations (8) and (9) are as follows:

in the formula D(s)^a,s^b) Denotes s^aAnd s^bDistance between s^aAnd s^bIndicates the state between two character regions (w)^a,h^a) Is the width and height of the text region a, (Δ u)^a,b,Δv^a,b) For the text areas a and b transversely sum in the imageThe relative moving speed in the longitudinal direction is considered to be not matched when the inter-area distance and the relative speed are too large. In the present invention, τ is 10.

Finally, all the paired character region cost values can be calculated according to the formula (2), and an N is obtained_t-1×M_tThe similarity matrix of (2). The method for the alignment scheme [ J ] was used in reference 1 "Kuhn H W]Naval Research logics, 1955, (1-2):83-97. "the proposed method can calculate the best match result. And obtaining a 2 multiplied by Q matrix by comparing the similarity, wherein the matrix is a matching matrix of the index number of the character track and the index number of the character area detection result, and Q is the matching number. By using the matching matrix, new space information (position coordinates and width and height) of the existing character track in the current frame can be updated, namely character area tracking of the current frame is completed. For example: the t-1 th frame has 3 character tracks being tracked, the t-1 th frame has 3 detected character areas, and a matching matrix obtained by calculation through the algorithm of the invention is shown as (10):

the first column of the matrix represents the index number of the character track, and the second column represents the index number of the character area detection result. It indicates that the 1 st text track corresponds to the 2 nd detected text area, the 2 nd text track corresponds to the 1 st detected text area, and the 3 rd text track corresponds to the 3 rd detected text area. And replacing the corresponding coordinates and width and height information of the 3 detected character areas of the t-th frame with the space information in the corresponding character track according to the matching matrix to finish the updating of the character track of the t-th frame.

3. Advantageous effects

The method can accurately track the character track in the video when the large-scale camera moves. The invention uses the known database Minetto in the field of text tracking for testing. The Minetto database comprises 5 segments of scene text videos, and the resolution of the video frames is 640 multiplied by 480. In the testing stage, the text detection results of the video and each frame of video image are input into the tracking algorithm, and the algorithm outputs the track of each text area in the video, namely the space information (position coordinates and width and height) of the text area in each frame. The effectiveness of the algorithm is measured by calculating three known evaluation indexes of multi-target tracking accuracy (MOTP), multi-target tracking accuracy (MOTA) and track index change times (IDS). Compared With the method in the document 2 "Pei W Y, Yang C, Meng L Y, et al. scene Video Text Tracking With Graph Matching [ J ]. IEEE Access,2018,6: 19419-: the MOTP index is improved by 6 percent, the MOTA is improved by 19 percent, and the IDS is doubled.

Drawings

FIG. 1 is a flow chart of a method for video text tracking based on layout constraints.

Detailed Description

Referring to fig. 1, the specific steps of the video text tracking method based on layout constraint provided by the present invention are as follows:

step 1: input video and text detection results

The invention is based on the video character detection result. Text detection can be divided into on-line detection and off-line detection. For on-line detection, firstly inputting a video, then detecting characters frame by frame or frame skipping, inputting a detection result into the invention for character tracking, then detecting characters of the next frame, and repeating the process until the video processing is finished. For offline detection, firstly, a video is input, then character detection is carried out until the video processing is finished, and finally, the video and the detection structure of each frame are input into the invention for character tracking. The tracking method provided by the invention can be simultaneously applied to on-line detection and off-line detection.

Step 2: text track initialization

Track initialization is carried out on the detection result of the first frame of the video, each detected character area is regarded as a new character track and index numbering is carried out, and then the states S of all character areas are calculated_tSpeed (u) in its state_t,v_t) First stageThe starting point is (0,0), and t is 1. And calculating the structural constraint R between every two character areas according to the formula (1)_tAnd t is 1. Meanwhile, removing the structure constraint which does not meet the requirement according to constraint formulas (8) and (9), and recording the residual structure constraint R₁And the character track state S₁。

And step 3: text track update

And in the text track updating stage, matching the text area detection result of the t-th frame with the existing text track in the t-1 th frame, and replacing the corresponding spatial information (position coordinates and width and height) of the matched text area detection result with the spatial information of the text area in the corresponding text track. The input of the stage is the character track state S of the t-1 th frame_t-1Structural constraint R_t-1And the t-th frame character region detection result D_tAnd outputting the updated character track space information.

Step 3.1: data matching

Matching the character track of the t-1 th frame with the character area detection result of the t-1 th frame to form N_t-1×M_tAnd (4) combining the pairs. Then, the cost values of all pairs are calculated by formula (3) to obtain N_t-1×M_tThe similarity matrix of (2). Before using the formula (3), the constraint judgment is first performed by the formula (8), and when the condition is not satisfied, the calculation of the formula (3) is skipped, and the pairing cost value is set to 999. The method for the alignment scheme [ J ] was used in reference 1 "Kuhn H W]Naval Research logics, 1955, (1-2):83-97. "the proposed method can calculate the best match result. The result is a 2 × Q matrix, which is a matching matrix of the index number of the text track and the index number of the text area detection result, wherein Q is the matching number.

Step 3.1: updating matches to tracks

If the text trajectory matches the current text region detection result, then use document 3 "Links I K F. an Introduction to the Kalman Filter [ J ]]1995. "kalman filtering algorithm uses text region detection results

For in-track state

Updating and updating the normalized color histogram of the text region

Obtain a new state S_t。

Step 3.2: updating unmatched tracks

The existing character detection algorithm often has a missing detection phenomenon, so that a character track cannot be matched with a character area detection result. At this time, the updated character track state S is utilized_tAnd structural constraint R of t-1 frame_t-1The unmatched text tracks are predicted using equation (11), which is shown below:

wherein N is_rFor the number of matching to the text tracks, (x, y) is the text region center coordinate, and (Δ x, Δ y) is the center coordinate distance difference in the structural constraint. And replacing the old coordinates which are not matched with the character track with the predicted area center coordinates, recording the replacement times, and when the replacement times are more than 3, determining that the character track disappears, and deleting the track information from the character track.

Step 3.3: initializing new tracks

If the t frame character area detection result p cannot be matched with any character track, a new character track is considered to appear, and a new track state is established

And adds the new track status to the existing track.

Step 3.4: updating character track structure constraints

And calculating the structural constraint between every two character tracks according to the formula (1). Meanwhile, removing the structure constraint which does not meet the requirement according to constraint formulas (8) and (9), and recording the residual structureConstraint R_t。

Step 3.5: outputting an update track

And outputting the updated character track space information (position coordinates and width and height) of the t-th frame, and recording and updating the survival times of all tracks.

And 4, step 4: outputting character track information

And repeating the step 3 until the video processing is completed, wherein the text usually disappears after a period of time exists in the video according to the time redundancy characteristic of the text in the video. The invention filters the non-character area by using the time redundancy characteristic, judges the non-character area of the track when the survival frequency of the track is less than or equal to 15 frames, and deletes the character track information. And finally outputting the residual character track information after filtering.

Claims

1. The video character tracking method based on layout constraint is characterized in that: the method comprises the steps of data matching cost function and cost function optimization and solving, and specifically comprises the following steps:

(1) data matching cost function:

firstly, defining a character area contained in a character track in a current frame; let the status of the ith character area in the t-th frame character track of the video be

Wherein

Is the abscissa and ordinate of the central point of the character area,

the speed of movement of the region in the image in the lateral and longitudinal directions,

the width and height of the text area,

Where i ∈ N_t，N_tRepresenting the number of character areas in the t-th frame;

wherein

Then all text regions in the t-th frame are constrained to

Is the central coordinate of the detection result of the character area,

the width and height of the text region detection result are obtained, and the set of all text region detection results in the t-th frame is

Wherein p ∈ M_t，M_tDetecting the number of the character areas;

binary symbol a^i,pShowing the matching condition of the character track and the character area detection result, when the ith character area in the character track is matched with the character area detection result p, a^i,p1, otherwise a^i,p0; for the detection result of the text track in the t-1 th frame and the text area in the t-1 th frame, the data matching condition is described by formula (2):

A＝argminC(S_t-1,R_t-1,D_t) (2)

wherein A ═ { a ═ a^i,p|i∈N_t-1,p∈M_tAt most, a text track matches a text area detection result, C (S)_t-1,R_t-1,D_t) Representing all possible pairing sets of the character track and the character area detection result; and the best match result is the set minimum argminC (S)_t-1,R_t-1,D_t)；

In the continuous frames, the mutual distance between the characters with the same background does not change too much; when the camera is in motion, the text should maintain a similar appearance to other text; in the character tracking, the similarity of characters in adjacent frames and the similarity of other character appearances related to the characters are simultaneously considered, and the similarity of the layout appearances around the characters in the adjacent frames is the layout constraint; cost function C (S) based on layout constraint_t-1,R_t-1,D_t) As shown in equation (3):

wherein

And

representing the difference cost value of the text track of the t-1 th frame and the text area detected by the t-1 th frame, and calculating the cost by using the area size ratio and the overlapping rate, as shown in formulas (4) and (5):

in the formula

And

the area of the minimum external surrounding frame overlapping of the ith character area of the t-1 th frame character track and the p area of the detection result of the t frame character area is shown,

represents the combined area;

in the formula (3)

Prediction region

the center point coordinate containing the position of the prediction area and the width and the height of the prediction area are included;

(2) optimizing and solving a cost function:

in order to simplify the calculation, matching of the trajectory and the detection result is constrained using formula (8) and formula (9), and when the condition is not satisfied, it is regarded as a^i,p0; equations (8) and (9) are as follows:

in the formula s^aAnd s^bThe state between two character areas is shown, when the distance between the areas and the relative speed are too large, the two areas are considered not to be matched, and the value of tau is 10;

finally, calculating the cost values of all the matched character areas according to the formula (2) to obtainTo a N_t-1×M_tThe similarity matrix of (2 x Q) is obtained by comparing the similarity, the matrix is a matching matrix of the index number of the character track and the index number of the character area detection result, wherein Q is the matching number, and the matching matrix can be used for updating the new position coordinate and width and height of the existing character track in the current frame, namely completing character area tracking of the current frame.