CN116012410A

CN116012410A - Target tracking method and device, target selection method, medium and electronic equipment

Info

Publication number: CN116012410A
Application number: CN202111237706.6A
Authority: CN
Inventors: 马雪浩; 佘翰笙
Original assignee: Guangzhou Shiyuan Electronics Thecnology Co Ltd; Guangzhou Shirui Electronics Co Ltd
Current assignee: Guangzhou Shiyuan Electronics Thecnology Co Ltd; Guangzhou Shirui Electronics Co Ltd
Priority date: 2021-10-22
Filing date: 2021-10-22
Publication date: 2023-04-25
Also published as: WO2023065938A1

Abstract

The embodiment of the disclosure provides a target tracking method and device, a target selection method, a storage medium and electronic equipment, and relates to the technical field of computers. The method comprises the following steps: obtaining a video frame, and performing target detection processing on the video frame through a first detection module to obtain N detection targets in the video frame and first coordinate information of each detection target, wherein N is a positive integer; determining a local image in the video frame according to the first coordinate information to obtain an ith image corresponding to an ith detection target; performing target detection processing on the ith image through a second detection module to acquire second coordinate information of the ith detection target, wherein the ith detection target is any one of N detection targets, and the value of i is a positive integer not greater than N; tracking of the ith detection target is achieved through the second coordinate information of the ith detection target. The technical scheme is beneficial to improving the accuracy and recall rate of target detection and guaranteeing the real-time performance of target detection.

Description

Target tracking method and device, target selection method, medium and electronic equipment

Technical Field

The embodiment of the disclosure relates to the technical field of computers, in particular to a target tracking method and device, a target selecting method, a computer readable storage medium and electronic equipment.

Background

The target detection is widely applied to the fields of robot navigation, intelligent video monitoring, industrial detection, aerospace and the like, is an important branch of image processing and computer vision disciplines, and is also a core part of an intelligent monitoring system. Meanwhile, the target detection is also a basic algorithm in the field of general identity recognition, and plays a vital role in subsequent tasks such as face recognition, gait recognition, crowd counting, instance segmentation and the like. The deep learning target detection algorithm taking the neural network as a main model is developed more rapidly.

It should be noted that the information disclosed in the above background section is only for enhancing understanding of the background of the disclosed embodiments and thus may include information that does not form the prior art that is already known to those of ordinary skill in the art.

Disclosure of Invention

An object of an embodiment of the present disclosure is to provide a target tracking method and apparatus, a target selecting method, a computer-readable storage medium, and an electronic device, which overcome at least to some extent the disadvantage that in the related art, target detection cannot be optimized for both detection accuracy and detection time.

Other features and advantages of embodiments of the present disclosure will be apparent from the following detailed description, or may be learned by practice of embodiments of the disclosure in part.

According to a first aspect of embodiments of the present disclosure, there is provided a target tracking method, the method including: obtaining a video frame, and performing target detection processing on the video frame through a first detection module to obtain N detection targets in the video frame and first coordinate information of each detection target, wherein N is a positive integer; obtaining a local image in the video frame according to the first coordinate information to obtain an ith image corresponding to an ith detection target; performing object detection processing on the ith image through a second detection module to obtain second coordinate information of the ith detection object, wherein the ith detection object is any one of the N detection objects, and the value of i is a positive integer not greater than N; tracking of the ith detection target is achieved through the second coordinate information of the ith detection target.

In an embodiment of the present disclosure, after performing, by the first detection module, object detection processing on the video frame to obtain N detection objects in the video frame and first coordinate information of each detection object, the method further includes: and displaying the N detection targets through a screen.

In an embodiment of the present disclosure, after performing, by the first detection module, object detection processing on the video frame to obtain N detection objects in the video frame and first coordinate information of each detection object, the method further includes: and determining a coordinate list according to the first coordinate information of each detection target.

In an embodiment of the disclosure, the obtaining the local image in the video frame according to the first coordinate information to obtain an ith image corresponding to an ith detection target includes: acquiring first coordinate information corresponding to the ith detection target from the coordinate list; acquiring the video frame; and positioning in the video frame according to the first coordinate information corresponding to the ith detection target to obtain a local image of the video frame, and determining the local image as the ith image.

In an embodiment of the present disclosure, after the acquiring the ith detection target and the first coordinate information corresponding to the ith detection target in the coordinate list, the method further includes: expanding the first coordinate information to obtain expanded first coordinate information corresponding to the ith detection target; the positioning in the video frame according to the first coordinate information corresponding to the ith detection target includes: and positioning in the video frame according to the expanded first coordinate information corresponding to the ith detection target.

In an embodiment of the disclosure, the expanding the first coordinate information includes: and carrying out equal-ratio expansion on the width and the height of the first coordinate information.

In one embodiment of the present disclosure, the performing, by a second detection module, object detection processing on the ith image to obtain second coordinate information of the ith detection object includes: and performing object detection processing on the ith image through the second detection module, and if the ith image comprises the ith detection object, acquiring position information of the ith detection object in the ith image to obtain the second coordinate information.

In one embodiment of the present disclosure, the method further includes: if the ith image does not include the ith detection target, acquiring a jth image corresponding to the jth detection target and first coordinate information corresponding to the jth detection target, and performing target detection processing on the jth image through a second detection module to determine second coordinate information of the jth detection target, wherein the value of j is a positive integer not greater than N and is not equal to i.

In one embodiment of the disclosure, the performing, by the first detection module, the object detection processing on the video frame includes: in a first thread, performing target detection processing on the video frame through a first detection module; the target detection processing for the ith image by the second detection module includes: in a second thread running asynchronously with the first thread, performing object detection processing on the ith image through a second detection module; the first thread only has write-in authority to the coordinate list; the second thread only has read authority to the coordinate list; the coordinate list adopts mutual exclusion lock to protect data.

According to a second aspect of embodiments of the present disclosure, there is provided a target selection method, the method including: responding to the received target selection instruction, and acquiring a video of the current scene; performing target detection processing on a video frame of the video through the first detection module to obtain a plurality of faces in the video frame and first coordinate information of each face; randomly acquiring a target face from the plurality of faces, and intercepting a local image of the video frame according to the first coordinate information; performing target detection processing on the local image through the second detection module so as to acquire second coordinate information of the target face; and displaying the target face according to the second coordinate information.

In an embodiment of the present disclosure, after performing, by the first detection module, object detection processing on a video frame of the video to obtain a plurality of faces in the video frame and first coordinate information of each face, the method further includes: and displaying the faces in the video frames through a screen.

In an embodiment of the present disclosure, after performing, by the first detection module, object detection processing on a video frame of the video to obtain a plurality of faces in the video frame and first coordinate information of each face, the method further includes: and determining a coordinate list according to the first coordinate information of each face.

In an embodiment of the present disclosure, the randomly acquiring a target face from the plurality of faces, and intercepting a local image of the video frame according to the first coordinate information includes: acquiring first coordinate information corresponding to the target face from the coordinate list; acquiring the video frame; and positioning in the video frame according to the first coordinate information corresponding to the target face to obtain a local image of the video frame.

In an embodiment of the present disclosure, after the obtaining the target face and the first coordinate information corresponding to the target face in the coordinate list, the method further includes: expanding the first coordinate information to obtain expanded first coordinate information corresponding to the target face; the positioning in the video frame according to the first coordinate information corresponding to the target face includes: and positioning in the video frame according to the expanded first coordinate information corresponding to the target face.

In an embodiment of the disclosure, the performing, by a second detection module, target detection processing on the local image to obtain second coordinate information of the target face includes: and performing target detection processing on the target face through the second detection module, and if the local image comprises the target face, acquiring the position information of the target face in the local image to obtain the second coordinate information.

In one embodiment of the present disclosure, the method further includes: and if the local image does not comprise the target face, acquiring a local image corresponding to another target face and first coordinate information corresponding to the other target face, and performing target detection processing on the local image through a second detection module so as to determine second coordinate information of the other target face.

According to a third aspect of embodiments of the present disclosure, there is provided an object tracking apparatus, the apparatus comprising: a first detection module for: obtaining a video frame, and performing target detection processing on the video frame through a first detection module to obtain N detection targets in the video frame and first coordinate information of each detection target, wherein N is a positive integer; a local image acquisition module for: obtaining a local image in the video frame according to the first coordinate information to obtain an ith image corresponding to the ith detection target; a second detection module for: performing object detection processing on the ith image through a second detection module to determine second coordinate information of the ith detection object, wherein the ith detection object is any one of the N detection objects, and the value of i is a positive integer not greater than N; a tracking module for: tracking of the ith detection target is achieved through the second coordinate information of the ith detection target.

According to a fourth aspect of embodiments of the present disclosure, there is provided a target selection device, the device comprising: the video acquisition module is used for: responding to the received target selection instruction, and acquiring a video of the current scene; a first detection module for: performing target detection processing on a video frame of the video through the first detection module to obtain a plurality of faces in the video frame and first coordinate information of each face; a local image acquisition module for: randomly acquiring a target face from the plurality of faces, and intercepting a local image of the video frame according to the first coordinate information; a second detection module for: performing target detection processing on the local image through the second detection module so as to acquire second coordinate information of the target face; a display module for: and displaying the target face according to the second coordinate information.

According to a fifth aspect of embodiments of the present disclosure, there is provided an electronic device, comprising: the target tracking method according to the first aspect is implemented by the processor when executing the computer program, or the target selecting method according to the second aspect is implemented when executing the computer program.

According to a sixth aspect of the embodiments of the present disclosure, there is provided a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the target tracking method described in the first aspect, or implements the target selection method described in the second aspect.

The target tracking method and device, the target selection method, the computer-readable storage medium and the electronic device provided by the embodiment of the disclosure have the following technical effects:

in the target tracking process provided by the embodiment of the disclosure, a video frame is acquired; performing target detection processing on the video frame through a first detection module to obtain a plurality of detection targets in the video frame and first coordinate information of each detection target, thereby realizing global detection on the video frame; in order to reduce detection time consumption, any one of the plurality of detection targets and an image corresponding to the detection target are acquired, and then target detection processing is performed on the image through a second detection module, compared with the first coordinate information, the thread can acquire second coordinate information of the more accurate detection target, so that local detection of the video frame is realized. The technical scheme combines the global detection and the local detection of the video frames, is beneficial to improving the accuracy and recall rate of target detection and is beneficial to ensuring the real-time performance of target detection.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the disclosure and together with the description, serve to explain the principles of the disclosure. It will be apparent to those of ordinary skill in the art that the drawings in the following description are merely examples of the disclosure and that other drawings may be derived from them without undue effort.

FIG. 1 schematically illustrates a flow chart of a target tracking method provided by an exemplary embodiment of the present disclosure;

FIG. 2 schematically illustrates a flow chart of a target tracking algorithm provided by another exemplary embodiment of the present disclosure;

FIG. 3 schematically illustrates a block diagram of an FSAM module;

FIG. 4 schematically shows a block diagram of a target detection model;

FIG. 5 schematically illustrates a flow chart of a partial image determination process provided by an embodiment of the present disclosure;

FIG. 6 schematically illustrates a flowchart for acquiring second coordinate information provided by an embodiment of the present disclosure;

FIG. 7 schematically illustrates a flow chart of a method of selecting a target provided by another embodiment of the present disclosure;

FIG. 8 schematically illustrates a block diagram of a target tracking apparatus provided by an embodiment of the present disclosure;

FIG. 9 schematically illustrates a block diagram of a target selection device provided by an embodiment of the present disclosure;

fig. 10 schematically illustrates a block diagram of an electronic device provided by an embodiment of the disclosure.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present disclosure more apparent, the embodiments of the present disclosure will be described in further detail below with reference to the accompanying drawings.

When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary embodiments do not represent all implementations consistent with the embodiments of the present disclosure. Rather, they are merely examples of apparatus and methods consistent with aspects of embodiments of the present disclosure, as detailed in the accompanying claims.

In describing embodiments of the present disclosure, it should be understood that the terms "first," "second," and the like are used for descriptive purposes only and are not to be construed as indicating or implying relative importance. The specific meaning of the terms in the embodiments of the present disclosure will be understood by those of ordinary skill in the art in a specific context. Furthermore, in the description of the embodiments of the present disclosure, unless otherwise indicated, "a plurality" means two or more. "and/or", describes an association relationship of an association object, and indicates that there may be three relationships, for example, a and/or B, and may indicate: a exists alone, A and B exist together, and B exists alone. The character "/" generally indicates that the context-dependent object is an "or" relationship.

The steps of the object tracking method in the present exemplary embodiment will be described in more detail with reference to the accompanying drawings and examples.

In which fig. 1 schematically illustrates a flowchart of a target tracking method in an exemplary embodiment according to the present disclosure. Referring to fig. 1, the method includes the steps of:

in S101, a video frame is acquired, and a first detection module performs object detection processing on the video frame to obtain N detection objects in the video frame and first coordinate information of each detection object, where N is a positive integer;

in S102, determining a local image in a video frame according to first coordinate information to obtain an ith image corresponding to an ith detection target;

in S103, performing object detection processing on the ith image by using a second detection module to obtain second coordinate information of the ith detection object, where the ith detection object is any one of N detection objects, and the value of i is a positive integer not greater than N;

in S104, tracking of the i-th detection target is achieved by the second coordinate information of the i-th detection target.

In the target tracking process provided by the embodiment shown in fig. 1, a video frame is acquired in the target tracking process provided by the embodiment of the present disclosure; performing target detection processing on the video frame through a first detection module to obtain a plurality of detection targets in the video frame and first coordinate information of each detection target, thereby realizing global detection on the video frame; in order to reduce detection time consumption, any one of the plurality of detection targets and an image corresponding to the detection target are acquired, and then target detection processing is performed on the image through a second detection module, compared with the first coordinate information, the thread can acquire second coordinate information of the more accurate detection target, so that local detection of the video frame is realized. The technical scheme combines the global detection and the local detection of the video frames, is beneficial to improving the accuracy and recall rate of target detection and is beneficial to ensuring the real-time performance of target detection.

In an exemplary embodiment, fig. 2 schematically illustrates a flow chart of a target tracking method in another exemplary embodiment according to the present disclosure. The following describes in detail the implementation of the steps involved in the embodiment shown in fig. 1 with reference to fig. 2:

referring to the flowchart of the target tracking algorithm shown in fig. 2, S101 is illustratively implemented by a first thread, and S102, S103, and S104 are implemented by a second thread running asynchronously with the first thread. The method comprises the steps of determining position information of targets through a double-thread algorithm, wherein the first thread is mainly responsible for maintaining and updating a coordinate list, and the coordinate list stores first coordinate information of each detection target; the second thread is mainly responsible for acquiring coordinate information (i.e., second coordinate information) with higher accuracy. The main flow of the algorithm is as follows:

in a first thread, firstly, a video frame is acquired, a first detection module is called based on the acquired video frame to detect targets, so that coordinate information of all targets in the video frame is acquired, and the coordinate information is stored in a coordinate list. To improve accuracy of the coordinate information, a loop call may be made to the first thread to maintain an updated coordinate list.

Because the resolution of the image input by the first detection module is higher (1024×1024), the time is high (> 500 ms), so that the position coordinates stored in the coordinate list have certain hysteresis, that is, the position coordinates of the target are not accurate enough, and only the general position of the target can be estimated. Further processing by the second thread is required to obtain the exact location of the target.

In the second thread, a target and its coordinate information are randomly selected from the coordinate list. The video frame is then acquired, and the width and height of the coordinates of the target are expanded in equal proportion, for example, by 1.5 times. And selecting on the video frame based on the position coordinates of the expanded target to obtain a partial image, and calling a second detection module to detect the target based on the obtained partial image. If the target is detected, outputting the position coordinate information of the target to an upper layer application; if no object is detected, the second thread is called again, and other objects and the coordinate information thereof are randomly selected from the coordinate list until the object is detected.

Because the resolution of the image input by the second detection module is low (256×256), the time is less (< 30 ms), and therefore, the position coordinates of the target can be further corrected and refined on the basis of the coordinate list, so that the accurate position coordinates of the target can be obtained.

The following describes the specific embodiment of S101 in fig. 1 in detail with reference to fig. 3 and 4:

in S101, a video frame is acquired, and a target detection process is performed on the video frame by a first detection module, so as to obtain N detection targets in the video frame, and first coordinate information of each detection target, where N is a positive integer.

The first detection module can realize target detection through a neural network and a YOLO target detection algorithm. In the embodiment of the present disclosure, the first detection module is implemented based on a YOLO (You Only Look Once) target detection algorithm and an FSAM (Fully Connected Spatial Attention Module, full connection space attention mechanism module) provided by the embodiment of the present disclosure, and the specific implementation method is as follows:

in the disclosed embodiment, the FSAM is obtained based on the combination of the spatial perception module target detection of the fully connected neural network and the convolutional neural network. The FSAM can overcome the problems that a single convolution layer is limited by the size of a convolution kernel and the space global perceptibility is limited, so that the context and the global perceptibility of the whole network are further increased, and the effect of a model is improved.

Refer to the FSAM module architecture diagram shown in fig. 3. FSAM is composed of convolution layer and full link layer, and totally two flow branches are formed, and the detailed working flow is as follows:

Assuming that there is a three-dimensional tensor feature map a with data arrangement of [ C, H, W ] (C is channel, H is height, W is width), it goes through the first convolution layer (Convolution Layer) of the right branch shown in fig. 2, then goes through the extrusion convolution layer (Extrusion Convolution Layer) to reduce the dimension, and compresses the channel dimension of the feature map a. The dimension reduction operation is mainly to reduce the calculation amount. Then, a tensor flattening operation (flat) is performed to Flatten the [ C, H, W ] three-dimensional tensor into a two-dimensional tensor of [ C, (h×w) ]. Feature extraction is then performed through the two fully connected layers (Full Connection Layer). Since each hidden node of the fully connected layer is associated with each other, more global spatial feature information can be extracted. The output value of the output node is then activated to between 0-1 by an activation operation of the Sigmoid function (Sigmoid). And finally, performing tensor remodeling (Reshape), and performing two-dimensional tensor remodeling to form a three-dimensional tensor, and multiplying (multiplexing) the tensor with the feature map extracted from the feature map A in the left branch after passing through two convolution layers (Convolution Layer) to obtain a final output feature.

Reference is made to the object detection model structure diagram shown in fig. 4. And improving a YOLO target detection algorithm based on the obtained FSAM to obtain the target detection model. The object detection model inputs (Input) images, passes through a backbone network composed of FSAM stacks, and finally passes through 3 detection branches (yolodetector head, each detection branch employing a YOLOv3 detection head) to regress the position coordinates of the object.

Illustratively, after the video frame in step S101 is input to the first detection module, all the objects in the video frame may be obtained and recorded as N. And uses a border (boundingbox) to identify the coordinate information of each object, i.e., the first coordinate information. The first coordinate information includes frame coordinate codes [ x_center, y_center, width, height ] of the object. Where x_center and y_center represent the center position of the bezel, and width and height represent the width and height of the bezel, respectively.

Through the embodiment corresponding to S101, each target in the video frame and the first coordinate information corresponding to each target may be obtained. Illustratively, the information obtained from the global detection is saved to a coordinate list for use by the second thread. Illustratively, as shown in FIG. 2, the updated coordinate list is maintained by making a loop call to the first thread. Thereby enabling global detection of the entire video frame,

because the resolution of the detected image in the first thread is high (e.g., 1024×1024), it takes a long time (> 500 ms), resulting in a certain hysteresis of the position coordinates stored in the coordinate list, i.e., the position coordinates of the objects are not accurate enough, and only the general position of each object in the video frame, i.e., the first coordinate information, can be estimated. In order to obtain more accurate position information of each target in the video frame, S102-S104 are executed, that is, local detection is implemented on the video frame through the second thread, and target tracking is implemented through second coordinate information obtained through the local detection.

In an exemplary embodiment, the ith detection target may be randomly determined from the N detection targets determined in S101, thereby increasing the interest of target tracking.

In another exemplary embodiment, N detection targets detected in the video frame may also be displayed into the screen after S101 is performed, for the user to select a tracking target therefrom. Illustratively, each detection target (e.g., a human head image) corresponds to a control, such that the user determines at least one tracking target (i.e., the ith detection target of S102-S104) from the N detection targets by way of touch-related controls.

The following describes the specific embodiment of S102 in fig. 1 in detail with reference to fig. 5:

in S102, a local image in the video frame is determined according to the first coordinate information, and an ith image corresponding to the ith detection target is obtained.

In an exemplary embodiment, as a specific implementation of S102. Fig. 5 schematically illustrates a flowchart of a partial image determination process provided by an embodiment of the present disclosure, applied to a second thread. Referring to fig. 5, the embodiment shown in this figure includes S501-S503.

The following also describes in detail the implementation of the steps involved in the embodiment shown in fig. 5 with reference to fig. 2:

In S501, the i-th detection target and the first coordinate information corresponding to the i-th detection target are acquired in the coordinate list.

As shown in fig. 2, after the first thread is executed, the second thread randomly extracts a target and the first coordinate information corresponding to the target from the coordinate list obtained by the first thread, and marks the target as the i-th detection target and the first coordinate information corresponding to the i-th detection target.

In S502, a video frame is acquired.

In S503, positioning is performed in the video frame according to the first coordinate information corresponding to the i-th detection target, so as to obtain a local image of the video frame, and the local image is determined as the i-th image.

In an exemplary embodiment, the same video frame input in the first thread is acquired. And according to the first coordinate information corresponding to the i-th detection target acquired in S501. And selecting an image surrounded by frame coordinate codes from the video frame according to the position of the frame in the first coordinate information, namely obtaining a local image, and determining the local image as the ith image.

Because the ith image is a local area of the whole video frame, compared with the ith image with low resolution (for example, 256×256), local detection takes relatively less time (< 30 ms), so that the position coordinates of the target can be further corrected and refined on the basis of the coordinate list, and the accurate position coordinates of the target can be obtained.

In an exemplary embodiment, before the obtained ith image is input into the second detection module, in order to increase the probability and accuracy of detecting the target, the frame coordinates where the ith image is located may be expanded, for example, the x_center, y_center in the frame coordinate code [ x_center, width, height ] is unchanged, and the width and height are expanded by 50%, that is, the width to height ratio is expanded to 1.5 times as large as the original, so that an expanded first coordinate information may be obtained. And selecting the image surrounded by the expanded frame coordinate codes from the video frame according to the coordinate information, and obtaining an expanded local image. The object detection is performed on the basis of this image, and the result is more accurate than that of an unexpanded partial image.

The following describes the specific embodiment of S103 in fig. 1 in detail with reference to fig. 6:

in S103, performing object detection processing on the ith image by using the second detection module to obtain second coordinate information of the ith detection object, where the ith detection object is any one of the N detection objects, and the value of i is a positive integer not greater than N.

In an exemplary embodiment, as a specific implementation of S103. Fig. 6 schematically illustrates a flowchart for acquiring second coordinate information, which is provided in an embodiment of the present disclosure, and is applied to a second thread. Referring to FIG. 6, the embodiment shown in this figure includes S601-S604.

In S601, object detection processing is performed on the i-th image by the second detection module.

In S602, it is determined whether the i-th image includes the i-th detection target.

In an exemplary embodiment, the second detection module may also implement target detection through a neural network and YOLO target detection algorithm. The second detection module may also employ the above-described object detection model shown in fig. 4 of the present disclosure. Therefore, the specific embodiment of the second detection module may refer to the corresponding embodiment of fig. 3 and fig. 4, and will not be described herein.

In an exemplary embodiment, the ith detection target is any one of the N detection targets described above. If the obtained ith image includes the corresponding ith detection target acquired in the first thread, S603 is executed: and acquiring the position information of the ith detection target in the ith image to obtain second coordinate information. And the frame coordinate code of the ith detection target is the second coordinate information, and the second coordinate information is output to the upper layer application.

In an exemplary embodiment, if the obtained i-th image does not include the corresponding i-th detection target acquired in the first thread, S604 is performed: and acquiring a j-th image corresponding to the j-th detection target and first coordinate information corresponding to the j-th detection target, and performing target detection processing on the j-th image through a second detection module to determine second coordinate information of the j-th detection target.

Namely, continuously selecting other detection targets except the ith detection target in the coordinate list, and carrying out target detection of the next partial image, for example: selecting a j-th detection target and the first coordinate information corresponding to the j-th detection target, performing target detection processing on the j-th image through a second detection module to determine second coordinate information of the j-th detection target, and outputting the second coordinate information to an upper layer application.

And by analogy, performing loop call on the second thread until all N detection targets in the coordinate list are detected.

With continued reference to fig. 1, in S104, tracking of the i-th detection target is achieved by the second coordinate information of the i-th detection target.

In an exemplary embodiment, more accurate position information of the ith detection target can be obtained according to the second coordinate information of the ith detection target, and further, detection and tracking of the target are realized according to the more accurate second coordinate information.

It should be noted that, the first thread and the second thread run asynchronously and independently, and do not block each other. And the two processes data communication among different threads through the coordinate list, wherein the first thread has only write permission to the coordinate list, and the second thread has only read permission to the coordinate list. The coordinate list adopts the mutual exclusion lock to protect data, so that the condition that the second thread also performs read operation simultaneously when the first thread performs write operation on the coordinate list is prevented, and the data can be completely read.

In an exemplary embodiment, fig. 7 schematically illustrates a flow chart of a method of object selection in another exemplary embodiment according to the present disclosure. The following describes in detail the implementation of the steps involved in the embodiment shown in fig. 7:

in S701, in response to a received target selection instruction, a video of a current scene is acquired.

In S702, a target detection process is performed on a video frame of a video by a first detection module, so as to obtain a plurality of faces in the video frame and first coordinate information of each face.

In S703, a target face is randomly acquired from the plurality of faces, and a partial image of the video frame is captured according to the first coordinate information.

In S704, the local image is subjected to target detection processing by the second detection module to obtain second coordinate information of the target face.

In S705, a target face is displayed according to the second coordinate information.

In an exemplary embodiment, in a classroom scene, the video frame may be obtained by acquiring a real-time picture in the classroom through a camera of a lottery integrated machine in the classroom. Then, the processor of the decimation integrated machine executes the embodiment corresponding to S701-S702, and the positions of all the faces of the students in the classroom can be identified through the first detection module (that is, the plurality of faces and the first coordinate information of each face are obtained). Because the resolution of the image input by the first detection module is higher (1024×1024), the time is high (> 500 ms), so that the first coordinate information has certain hysteresis, that is, the position coordinates of the face of the student are not accurate enough, and only the general position of the target can be estimated. Thus, further processing by the second detection module is required.

In an exemplary embodiment, S703 is performed to randomly determine the target face among a plurality of student faces, and the target face may be determined by the random selection method, and the classroom atmosphere may be activated by selecting students to perform classroom interaction. In another exemplary embodiment, the detected faces may also be displayed in a screen for the user (e.g., a lecture teacher, etc.) to select any one of the students after S702 is performed. Illustratively, a control is associated with each student face, so that the user determines at least one student face (i.e., the target face) from the control by touching the associated control.

Further, the processor of the decimation integrated machine executes the embodiment corresponding to S704, and may identify the target face and the second coordinate information. Specifically, by executing the embodiment corresponding to S705, it is also shown that the target face realizes the lottery for the students in the classroom.

It should be noted that, the working principle of the first detection module in S702 is the same as that of the first detection module in S101, and will not be described herein. The working principle of the second detection module in S704 is the same as that of the second detection module in S103, and will not be described here again. The embodiment for acquiring the local image in S703 is the same as the embodiment for acquiring the i-th image in S102, and will not be described here again.

In an exemplary embodiment, S701 and S702 are implemented by a first thread, and S703 and S704 are implemented by a second thread running asynchronously with the first thread. The method comprises the steps of determining position information of a target face through a double-thread algorithm, wherein the first thread is mainly responsible for maintaining and updating a coordinate list, and the coordinate list stores first coordinate information of each face; the second thread is mainly responsible for acquiring coordinate information with higher precision (namely, second coordinate information). The main flow of the algorithm is as follows:

in a first thread, firstly, responding to a received target selection instruction, acquiring a video of a current scene, calling a first detection module to detect a target based on the acquired video frame, acquiring coordinate information of all faces in the video frame, and storing the coordinate information into a coordinate list. Namely: in an exemplary embodiment, after the faces of the plurality of students in the video frame and the first coordinate information of each student face are obtained by the first detection module, the first coordinate information of each student face may be stored in a coordinate list.

Because the resolution of the image input by the first detection module is higher (1024×1024), the time is high (> 500 ms), so that the position coordinates stored in the coordinate list have certain hysteresis, that is, the position coordinates of the face are not accurate enough, and only the general position of the target can be estimated. Further processing by the second thread is required to obtain the exact location of the target.

In the second thread, a target face and its coordinate information are first selected from the coordinate list. And then the video frame is acquired, and the width and the height of the coordinates of the target face are expanded in equal proportion, for example, the width and the height are respectively expanded to 1.5 times of the original width and the height. Namely: in an exemplary embodiment, the first coordinate information may be further extended (for example, the width and the height of the coordinates are extended in an equal ratio) to obtain extended first coordinate information corresponding to the randomly selected student face, and the positioning is performed in the video frame according to the extended first coordinate information corresponding to the randomly selected student face. The possibility that the local image obtained after expansion contains the face of the student is higher, so that the detection result is more accurate.

Further, selecting on the video frame based on the position coordinates of the extended target face to obtain a local image of the video frame, and calling a second detection module to detect the target based on the obtained local image. If the target face is detected, outputting second position coordinate information of the target face to an upper layer application (S705: displaying the target face according to the second coordinate information); if the target face is not detected, the second thread is called again, and other target faces and the coordinate information thereof are selected from the coordinate list until the target face is detected.

The object detection model provided by the present disclosure can receive continuous input of video frames and input position information of objects in a picture in real time. The target detection model can be used for detecting the target of each video frame and outputting the position information and the confidence of the detected target in each video frame. Compared with a traditional target detection model of a pure convolutional neural network, the target detection model has stronger global space perception capability and improves the target detection effect.

In combination with the target detection model, a target decimation algorithm can be designed. For example, at the start of the decimation, a timer is used to perform the decimation. The interval setting of the timer is gradually shortened, the frequency of the drawing frame change is gradually increased, and the random drawing is ensured to be not identical with the last drawing result every time, so that the problem that the drawing frame is not changed due to the fact that the drawing results of the previous and the next two times are identical is avoided. When the lottery is finished, the interval setting of the timer is gradually increased, and correspondingly, the frequency of the change of the lottery frame is gradually slowed down, finally the lottery frame stays on the selected target, and the final result is displayed.

The following are device embodiments of the present disclosure that may be used to perform method embodiments of the present disclosure. For details not disclosed in the embodiments of the apparatus of the present disclosure, please refer to the embodiments of the method of the present disclosure.

Wherein fig. 8 illustrates a block diagram of a target tracking apparatus in an exemplary embodiment according to the present disclosure. Referring to fig. 8, an object tracking device 800 in an embodiment of the present disclosure includes: a first detection module 801, a local image acquisition module 802, a second detection module 803, and a tracking module 804, wherein:

the first detection module 801 is configured to: obtaining a video frame, and performing target detection processing on the video frame through a first detection module to obtain N detection targets in the video frame and first coordinate information of each detection target, wherein N is a positive integer;

The local image acquisition module 802 is configured to: determining a local image in the video frame according to the first coordinate information to obtain an ith image corresponding to an ith detection target;

the second detection module 803 is configured to: performing target detection processing on the ith image through a second detection module to determine second coordinate information of the ith detection target, wherein the ith detection target is any one of N detection targets, and the value of i is a positive integer not greater than N;

the tracking module 804 is configured to: tracking of the ith detection target is achieved through the second coordinate information of the ith detection target.

It should be noted that, when the object tracking apparatus provided in the above embodiment performs the object tracking method, only the division of the above functional modules is used as an example, and in practical application, the above functional allocation may be performed by different functional modules according to needs, that is, the internal structure of the device is divided into different functional modules, so as to complete all or part of the functions described above. In addition, the target tracking device and the target tracking method provided in the foregoing embodiments belong to the same concept, so for details not disclosed in the embodiments of the apparatus of the present disclosure, please refer to the embodiments of the information collaboration method described in the present disclosure, and details are not repeated here.

Wherein fig. 9 illustrates a block diagram of a target selecting device in an exemplary embodiment according to the present disclosure. Referring to fig. 9, a target selecting apparatus 900 in an embodiment of the present disclosure includes: a video acquisition module 901, a first detection module 902, a partial image acquisition module 903, a second detection module 904, and a display module 905, wherein:

the video acquisition module 901 is configured to: responding to the received target selection instruction, and acquiring a video of the current scene;

the first detection module 902 is configured to: performing target detection processing on a video frame of the video through the first detection module to obtain a plurality of faces in the video frame and first coordinate information of each face;

the local image acquisition module 903 is configured to: randomly acquiring a target face from the plurality of faces, and intercepting a local image of the video frame according to the first coordinate information;

the second detection module 904 is configured to: performing target detection processing on the local image through the second detection module so as to acquire second coordinate information of the target face;

the display module 905 is configured to: and displaying the target face according to the second coordinate information.

It should be noted that, in the target selecting device provided in the foregoing embodiment, only the division of the foregoing functional modules is used as an example when the target selecting method is executed, and in practical application, the foregoing functional allocation may be performed by different functional modules according to needs, that is, the internal structure of the device is divided into different functional modules, so as to complete all or part of the functions described above. In addition, the target selecting device and the target selecting method provided in the foregoing embodiments belong to the same concept, so for details not disclosed in the embodiments of the apparatus of the present disclosure, please refer to the embodiments of the information collaboration method described in the present disclosure, and the details are not repeated here.

The foregoing embodiment numbers of the present disclosure are merely for description and do not represent advantages or disadvantages of the embodiments.

The disclosed embodiments also provide a computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of any of the methods of the previous embodiments. The computer readable storage medium may include, among other things, any type of disk including floppy disks, optical disks, DVDs, CD-ROMs, micro-drives, and magneto-optical disks, ROM, RAM, EPROM, EEPROM, DRAM, VRAM, flash memory devices, magnetic or optical cards, nanosystems (including molecular memory ICs), or any type of media or device suitable for storing instructions and/or data.

The disclosed embodiments also provide an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps of any of the methods of the embodiments described above when the program is executed by the processor.

Fig. 10 schematically illustrates a block diagram of an electronic device in an exemplary embodiment according to the present disclosure. Referring to fig. 10, an electronic device 1000 includes: a processor 1001 and a memory 1002.

In the embodiment of the disclosure, the processor 1001 is a control center of a computer system, and may be a processor of a physical machine or a processor of a virtual machine. The processor 1001 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and so on. The processor 1001 may be implemented in at least one hardware form of DSP (Digital Signal Processing ), FPGA (Field-Programmable GateArray, field programmable gate array), PLA (ProgrammableLogic Array ). The processor 1001 may also include a main processor, which is a processor for processing data in an awake state, also referred to as a CPU (CentralProcessing Unit ), and a coprocessor; a coprocessor is a low-power processor for processing data in a standby state.

In the embodiment of the present disclosure, the processor 1001 is specifically configured to:

obtaining a video frame, and performing target detection processing on the video frame through a first detection module to obtain N detection targets in the video frame and first coordinate information of each detection target, wherein N is a positive integer; obtaining a local image in the video frame according to the first coordinate information to obtain an ith image corresponding to an ith detection target; performing object detection processing on the ith image through a second detection module to obtain second coordinate information of the ith detection object, wherein the ith detection object is any one of the N detection objects, and the value of i is a positive integer not greater than N; tracking of the ith detection target is achieved through the second coordinate information of the ith detection target.

Further, the processor 1001 is further configured to: and after the target detection processing is carried out on the video frame through the first detection module to obtain N detection targets in the video frame and the first coordinate information of each detection target, displaying the N detection targets through a screen.

Further, the processor 1001 is further configured to: and after the target detection processing is carried out on the video frame through the first detection module to obtain N detection targets in the video frame and the first coordinate information of each detection target, determining a coordinate list according to the first coordinate information of each detection target.

Further, the obtaining the local image in the video frame according to the first coordinate information to obtain an ith image corresponding to an ith detection target includes: acquiring first coordinate information corresponding to the ith detection target from the coordinate list; acquiring the video frame; and positioning in the video frame according to the first coordinate information corresponding to the ith detection target to obtain a local image of the video frame, and determining the local image as the ith image.

Further, the processor 1001 is further configured to: the first coordinate information corresponding to the ith detection target is expanded after the ith detection target and the first coordinate information corresponding to the ith detection target are acquired in the coordinate list, so that expanded first coordinate information corresponding to the ith detection target is obtained; the positioning in the video frame according to the first coordinate information corresponding to the ith detection target includes: and positioning in the video frame according to the expanded first coordinate information corresponding to the ith detection target.

Further, the expanding the first coordinate information includes: and carrying out equal-ratio expansion on the width and the height of the first coordinate information.

Further, the performing, by the second detection module, object detection processing on the ith image to obtain second coordinate information of the ith detection object, includes: and performing object detection processing on the ith image through the second detection module, and if the ith image comprises the ith detection object, acquiring position information of the ith detection object in the ith image to obtain the second coordinate information.

Further, if the ith image does not include the ith detection target, acquiring a jth image corresponding to the jth detection target and first coordinate information corresponding to the jth detection target, and performing target detection processing on the jth image through a second detection module to determine second coordinate information of the jth detection target, wherein j is a positive integer not greater than N and not equal to i.

Further, the performing, by the first detection module, the object detection processing on the video frame includes: in a first thread, performing target detection processing on the video frame through a first detection module; the target detection processing for the ith image by the second detection module includes: in a second thread running asynchronously with the first thread, performing object detection processing on the ith image through a second detection module; the first thread only has write-in authority to the coordinate list; the second thread only has read authority to the coordinate list; the coordinate list adopts mutual exclusion lock to protect data.

In the embodiment of the present disclosure, the processor 1001 is further specifically configured to:

responding to the received target selection instruction, and acquiring a video of the current scene; performing target detection processing on a video frame of the video through the first detection module to obtain a plurality of faces in the video frame and first coordinate information of each face; randomly acquiring a target face from the plurality of faces, and intercepting a local image of the video frame according to the first coordinate information; performing target detection processing on the local image through the second detection module so as to acquire second coordinate information of the target face; and displaying the target face according to the second coordinate information.

Further, after performing object detection processing on the video frame of the video by using the first detection module to obtain a plurality of faces in the video frame and first coordinate information of each face, the method further includes: and displaying the faces in the video frames through a screen.

Further, after performing object detection processing on the video frame of the video by the first detection module to obtain a plurality of faces in the video frame and first coordinate information of each face, the processor 1001 is further configured to: and determining a coordinate list according to the first coordinate information of each face.

Further, the step of randomly acquiring the target face from the plurality of faces and capturing a partial image of the video frame according to the first coordinate information includes: acquiring first coordinate information corresponding to the target face from the coordinate list; acquiring the video frame; and positioning in the video frame according to the first coordinate information corresponding to the target face to obtain a local image of the video frame.

Further, after the acquiring the target face and the first coordinate information corresponding to the target face in the coordinate list, the processor 1001 is further configured to: expanding the first coordinate information to obtain expanded first coordinate information corresponding to the target face; the positioning in the video frame according to the first coordinate information corresponding to the target face includes: and positioning in the video frame according to the expanded first coordinate information corresponding to the target face.

Further, the performing, by the second detection module, the target detection processing on the local image to obtain second coordinate information of the target face, includes: and performing target detection processing on the target face through the second detection module, and if the local image comprises the target face, acquiring the position information of the target face in the local image to obtain the second coordinate information.

Further, the processor 1001 is further configured to: and if the local image does not comprise the target face, acquiring a local image corresponding to another target face and first coordinate information corresponding to the other target face, and performing target detection processing on the local image through a second detection module so as to determine second coordinate information of the other target face.

Memory 1002 may include one or more computer-readable storage media, which may be non-transitory. Memory 1002 may also include high-speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In some embodiments of the present disclosure, a non-transitory computer readable storage medium in memory 1002 is used to store at least one instruction for execution by processor 1001 to implement the methods in embodiments of the present disclosure.

In some embodiments, the electronic device 1000 further includes: a peripheral interface 1003, and at least one peripheral. The processor 1001, the memory 1002, and the peripheral interface 1003 may be connected by a bus or signal line. The various peripheral devices may be connected to the peripheral device interface 1003 via a bus, signal wire, or circuit board. Specifically, the peripheral device includes: at least one of a display 1004, a camera 1005, and an audio circuit 1006.

Peripheral interface 1003 may be used to connect I/O (Input/Output) related at least one peripheral to processor 1001 and memory 1002. In some embodiments of the present disclosure, the processor 1001, memory 1002, and peripheral interface 1003 are integrated on the same chip or circuit board; in some other embodiments of the present disclosure, either or both of the processor 1001, memory 1002, and peripheral interface 1003 may be implemented on separate chips or circuit boards. The embodiments of the present disclosure are not particularly limited thereto.

The display 1004 is used to display a UI (user interface). The UI may include graphics, text, icons, video, and any combination thereof. When the display 1004 is a touch display, the display 1004 also has the ability to collect touch signals at or above the surface of the display 1004. The touch signal may be input to the processor 1001 as a control signal for processing. At this point, the display 1004 may also be used to provide virtual buttons and/or virtual keyboards, also referred to as soft buttons and/or soft keyboards. In some embodiments of the present disclosure, the display 1004 may be one, providing a front panel of the electronic device 1000; in other embodiments of the present disclosure, the display 1004 may be at least two, respectively disposed on different surfaces of the electronic device 1000 or in a folded design; in still other embodiments of the present disclosure, the display 1004 may be a flexible display disposed on a curved surface or a folded surface of the electronic device 1000. Even more, the display 1004 may be arranged in a non-rectangular irregular pattern, i.e., a shaped screen. The display 1004 may be made of LCD (Liquid CrystalDisplay ), OLED (organic light-emitting diode) or other materials.

The camera 1005 is used to capture images or video. Optionally, the camera 1005 includes a front camera and a rear camera. In general, a front camera is disposed on a front panel of an electronic device, and a rear camera is disposed on a rear surface of the electronic device. In some embodiments, the at least two rear cameras are any one of a main camera, a depth camera, a wide-angle camera and a tele camera, so as to realize that the main camera and the depth camera are fused to realize a background blurring function, and the main camera and the wide-angle camera are fused to realize a panoramic shooting and VR (virtual reality) shooting function or other fusion shooting functions. In some embodiments of the present disclosure, camera 1005 may also include a flash. The flash lamp can be a single-color temperature flash lamp or a double-color temperature flash lamp. The dual-color temperature flash lamp refers to a combination of a warm light flash lamp and a cold light flash lamp, and can be used for light compensation under different color temperatures.

The audio circuitry 1006 may include a microphone and a speaker. The microphone is used for collecting sound waves of users and the environment, and converting the sound waves into electric signals to be input to the processor 1001 for processing. For purposes of stereo acquisition or noise reduction, the microphone may be multiple and separately disposed at different locations of the electronic device 1000. The microphone may also be an array microphone or an omni-directional pickup microphone.

The power supply 1007 is used to power the various components in the electronic device 1000. The power source 1007 may be an alternating current, a direct current, a disposable battery, or a rechargeable battery. When the power source 1007 includes a rechargeable battery, the rechargeable battery may be a wired rechargeable battery or a wireless rechargeable battery. The wired rechargeable battery is a battery charged through a wired line, and the wireless rechargeable battery is a battery charged through a wireless coil. The rechargeable battery may also be used to support fast charge technology.

The block diagrams of the electronic device structures shown in the embodiments of the present disclosure do not constitute a limitation of the electronic device 1000, and the electronic device 1000 may include more or less components than illustrated, or may combine some components, or may employ a different arrangement of components.

In this disclosure, the terms "first," "second," and the like are used for descriptive purposes only and are not to be construed as indicating or implying a relative importance or order; the term "plurality" means two or more, unless expressly defined otherwise. The terms "mounted," "connected," "secured," and the like are to be construed broadly, and may be, for example, fixedly connected, detachably connected, or integrally connected; "coupled" may be directly coupled or indirectly coupled through intermediaries. The specific meaning of the terms in this disclosure will be understood by those of ordinary skill in the art as the case may be.

In the description of the present disclosure, it should be understood that the azimuth or positional relationship indicated by the terms "upper", "lower", etc. are based on the azimuth or positional relationship shown in the drawings, and are merely for convenience of describing the present disclosure and simplifying the description, and do not indicate or imply that the apparatus or unit referred to must have a specific direction, be configured and operated in a specific azimuth, and thus should not be construed as limiting the present disclosure.

The foregoing is merely specific embodiments of the disclosure, but the protection scope of the disclosure is not limited thereto, and any person skilled in the art can easily think about changes or substitutions within the technical scope of the disclosure, and it is intended to cover the scope of the disclosure. Accordingly, equivalent variations from the claims of the present disclosure are intended to be covered by this disclosure.

Claims

1. A target tracking method, comprising:

obtaining a video frame, and performing target detection processing on the video frame through a first detection module to obtain N detection targets in the video frame and first coordinate information of each detection target, wherein N is a positive integer;

obtaining a local image in the video frame according to the first coordinate information to obtain an ith image corresponding to an ith detection target;

Performing target detection processing on the ith image through a second detection module to acquire second coordinate information of the ith detection target, wherein the ith detection target is any one of the N detection targets, and the value of i is a positive integer not greater than N;

tracking the ith detection target is achieved through the second coordinate information of the ith detection target.

2. The object tracking method according to claim 1, wherein after the object detection processing is performed on the video frame by the first detection module to obtain N detected objects in the video frame, and the first coordinate information of each of the detected objects, the method further comprises:

and displaying the N detection targets through a screen.

3. The object tracking method according to claim 1, wherein after the object detection processing is performed on the video frame by the first detection module to obtain N detected objects in the video frame, and the first coordinate information of each of the detected objects, the method further comprises:

and determining a coordinate list according to the first coordinate information of each detection target.

4. The method of claim 3, wherein the obtaining the local image in the video frame according to the first coordinate information to obtain the i-th image corresponding to the i-th detection target includes:

Acquiring the ith detection target and first coordinate information corresponding to the ith detection target from the coordinate list;

acquiring the video frame;

and positioning in the video frame according to the first coordinate information corresponding to the ith detection target to obtain a local image of the video frame, and determining the local image as the ith image.

5. The object tracking method according to claim 4, wherein after the acquisition of the i-th detection object and the first coordinate information corresponding to the i-th detection object in the coordinate list, the method further comprises:

expanding the first coordinate information to obtain expanded first coordinate information corresponding to the ith detection target;

the positioning in the video frame according to the first coordinate information corresponding to the ith detection target includes:

and positioning in the video frame according to the expanded first coordinate information corresponding to the ith detection target.

6. The target tracking method according to claim 5, wherein the expanding the first coordinate information includes:

and carrying out equal-ratio expansion on the width and the height of the first coordinate information.

7. The object tracking method according to any one of claims 1 to 6, characterized in that the object detection processing is performed on the i-th image by a second detection module to acquire second coordinate information of the i-th detection object, comprising:

and performing target detection processing on the ith image through the second detection module, and if the ith image comprises the ith detection target, acquiring the position information of the ith detection target in the ith image to obtain the second coordinate information.

8. The target tracking method of claim 7, wherein the method further comprises:

if the ith image does not comprise the ith detection target, acquiring a jth image corresponding to the jth detection target and first coordinate information corresponding to the jth detection target, and performing target detection processing on the jth image through a second detection module to determine second coordinate information of the jth detection target, wherein the value of j is a positive integer not greater than N and not equal to i.

9. The method for tracking a target according to claim 3, wherein,

the target detection processing for the video frame through the first detection module comprises the following steps: in a first thread, performing target detection processing on the video frame through a first detection module;

The target detection processing for the ith image through the second detection module comprises the following steps: in a second thread running asynchronously with the first thread, performing target detection processing on the ith image through a second detection module;

the first thread only has write-in authority to the coordinate list, and the second thread only has read-out authority to the coordinate list; and the coordinate list adopts a mutual exclusion lock to conduct data protection.

10. A method of selecting a target, the method comprising:

responding to the received target selection instruction, and acquiring a video of the current scene;

performing target detection processing on a video frame of the video through the first detection module to obtain a plurality of faces in the video frame and first coordinate information of each face;

randomly acquiring a target face from the plurality of faces, and intercepting a local image of the video frame according to the first coordinate information;

performing target detection processing on the local image through the second detection module to acquire second coordinate information of the target face;

and displaying the target face according to the second coordinate information.

11. An object tracking device, comprising:

A first detection module for: obtaining a video frame, and performing target detection processing on the video frame through a first detection module to obtain N detection targets in the video frame and first coordinate information of each detection target, wherein N is a positive integer;

a local image acquisition module for: obtaining a local image in the video frame according to the first coordinate information to obtain an ith image corresponding to an ith detection target;

a second detection module for: performing target detection processing on the ith image through a second detection module to determine second coordinate information of the ith detection target, wherein the ith detection target is any one of the N detection targets, and the value of i is a positive integer not greater than N;

a tracking module for: tracking the ith detection target is achieved through the second coordinate information of the ith detection target.

12. An electronic device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the object tracking method according to any one of claims 1 to 9 or the object selection method according to claim 10 when executing the computer program.

13. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the object tracking method according to any one of claims 1 to 9, or the object selection method according to claim 10.