CN111222509B

CN111222509B - Target detection method and device and electronic equipment

Info

Publication number: CN111222509B
Application number: CN202010052654.4A
Authority: CN
Inventors: 王旭
Original assignee: Beijing ByteDance Network Technology Co Ltd
Current assignee: Beijing ByteDance Network Technology Co Ltd
Priority date: 2020-01-17
Filing date: 2020-01-17
Publication date: 2023-08-18
Anticipated expiration: 2040-01-17
Also published as: CN111222509A

Abstract

The embodiment of the disclosure provides a target detection method, a target detection device and electronic equipment, belonging to the technical field of target detection, wherein the method comprises the following steps: performing object detection on a first video frame in the object video to obtain one or more object detection results; setting a first target representation area matched with the target detection result, wherein the area of the first target representation area is larger than the actual area of the target detection result; performing key point detection on the first video frame and a plurality of adjacent video frames adjacent to the first video frame by using a preset key point model and the first target representation area; and determining a target object existing in the first video frame and a plurality of adjacent video frames adjacent to the first video frame based on the key point detection result. By the processing scheme, the target object can be detected and tracked rapidly.

Description

Target detection method and device and electronic equipment

Technical Field

The disclosure relates to the technical field of target detection, and in particular relates to a target detection method, a target detection device and electronic equipment.

Background

The object detection, also called object extraction, is an image segmentation based on the geometric and statistical characteristics of the object, which combines the segmentation and recognition of the object into one, and the accuracy and the real-time performance are an important capability of the whole system. Especially in complex scenes, when multiple targets need to be processed in real time, automatic extraction and recognition of the targets are particularly important.

With the development of computer technology and the wide application of computer vision principle, the real-time tracking research of targets by using computer image processing technology is getting more and more popular, and the dynamic real-time tracking positioning of targets has wide application value in the aspects of intelligent traffic systems, intelligent monitoring systems, military target detection, surgical instrument positioning in medical navigation surgery and the like.

Disclosure of Invention

In view of the above, embodiments of the present disclosure provide a target detection method, device and electronic equipment, so as to at least partially solve the problems in the prior art.

In a first aspect, an embodiment of the present disclosure provides a target detection method, including:

performing object detection on a first video frame in the object video to obtain one or more object detection results;

setting a first target representation area matched with the target detection result, wherein the area of the first target representation area is larger than the actual area of the target detection result;

performing key point detection on the first video frame and a plurality of adjacent video frames adjacent to the first video frame by using a preset key point model and the first target representation area;

and determining a target object existing in the first video frame and a plurality of adjacent video frames adjacent to the first video frame based on the key point detection result.

According to a specific implementation manner of the embodiment of the present disclosure, after determining, based on the result of the keypoint detection, a target object existing in the first video frame and a plurality of adjacent video frames adjacent to the first video frame, the method further includes:

acquiring a field of view in a second video frame subsequent to the plurality of adjacent video frames adjacent to the first video frame;

judging whether the visual field in the second video frame is the same as the visual field in the first video frame;

if not, restarting executing the target detection on the second video frame.

According to a specific implementation manner of the embodiment of the present disclosure, after the performing the object detection on the second video frame is restarted, the method further includes:

setting a second target representation area matched with a target detection result of the second video frame, wherein the area of the second target representation area is larger than the actual area of the target detection result of the second video frame;

performing key point detection on the second video frame and a plurality of adjacent video frames adjacent to the second video frame by using a preset key point model and the second target representation area;

and determining a target object existing in the second video frame and a plurality of adjacent video frames adjacent to the second video frame based on the result of the key point detection.

According to a specific implementation manner of an embodiment of the present disclosure, the performing object detection on a first video frame in an object video to obtain one or more object detection results includes:

and detecting the object existing in the first video frame by utilizing a sliding window detector contained in a preset target detection model, so as to form one or more target detection results.

According to a specific implementation manner of the embodiment of the present disclosure, the detecting, by using a sliding window detector included in a preset target detection model, an object existing in the first video frame includes:

acquiring a neural network model after training the target detection model through a training sample;

scanning the first video frame through a window with a fixed size and a fixed step length, and sending an image in the window in the first video frame into a trained convolution network for detection;

by changing the size of the scanning window, the presence or absence of an object and the positioning of the object are detected.

According to a specific implementation manner of the embodiment of the present disclosure, the setting a first target representation area matched with the target detection result includes:

acquiring an actual area of the target detection result on the first video frame;

taking the center of the actual area as the center, and respectively expanding the preset times in the horizontal direction and the vertical direction;

and taking the expanded area as the first target representation area.

According to a specific implementation manner of the embodiment of the present disclosure, the performing, by using a preset keypoint model and the first target representation area, keypoint detection on the first video frame and a plurality of adjacent video frames adjacent to the first video frame includes:

acquiring an image in the target representation area;

and executing the key point detection on the image objects in the target representation area based on the key point model to obtain a key point detection result.

According to a specific implementation manner of the embodiment of the present disclosure, the determining, based on the result of the keypoint detection, a target object existing in the first video frame and a plurality of adjacent video frames adjacent to the first video frame includes:

judging whether the key point detection result is matched with a preset target model or not;

if yes, the key point detection result is used as a target object existing in the first video frame and a plurality of adjacent video frames adjacent to the first video frame.

In a second aspect, an embodiment of the present disclosure provides an object detection apparatus, including:

a first detection module, configured to perform target detection on a first video frame in a target video to obtain one or more target detection results;

the setting module is used for setting a first target representation area matched with the target detection result, and the area of the first target representation area is larger than the actual area of the target detection result;

the second detection module is used for detecting key points of the first video frame and a plurality of adjacent video frames adjacent to the first video frame by utilizing a preset key point model and the first target representation area;

and the determining module is used for determining the first video frame and target objects existing in a plurality of adjacent video frames adjacent to the first video frame based on the key point detection result.

In a third aspect, embodiments of the present disclosure further provide an electronic device, including:

at least one processor; the method comprises the steps of,

a memory communicatively coupled to the at least one processor; wherein, the liquid crystal display device comprises a liquid crystal display device,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the object detection method of the first aspect or any implementation of the first aspect.

In a fourth aspect, embodiments of the present disclosure also provide a non-transitory computer-readable storage medium storing computer instructions for causing the computer to perform the method of object detection in the foregoing first aspect or any implementation manner of the first aspect.

In a fifth aspect, embodiments of the present disclosure also provide a computer program product comprising a computer program stored on a non-transitory computer readable storage medium, the computer program comprising program instructions which, when executed by a computer, cause the computer to perform the object detection method of the first aspect or any implementation manner of the first aspect.

The target detection scheme in the embodiment of the disclosure comprises the steps of performing target detection on a first video frame in a target video to obtain one or more target detection results; setting a first target representation area matched with the target detection result, wherein the area of the first target representation area is larger than the actual area of the target detection result; performing key point detection on the first video frame and a plurality of adjacent video frames adjacent to the first video frame by using a preset key point model and the first target representation area; and determining a target object existing in the first video frame and a plurality of adjacent video frames adjacent to the first video frame based on the key point detection result. By the processing scheme, the target object is detected rapidly.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present disclosure, the drawings that are needed in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present disclosure, and other drawings may be obtained according to these drawings without inventive effort to a person of ordinary skill in the art.

Fig. 1 is a flowchart of a target detection method according to an embodiment of the present disclosure;

FIG. 2 is a flow chart of another object detection method provided in an embodiment of the present disclosure;

FIG. 3 is a flowchart of another object detection method according to an embodiment of the present disclosure;

FIG. 4 is a flowchart of another object detection method according to an embodiment of the present disclosure;

fig. 5 is a schematic structural diagram of an object detection device according to an embodiment of the present disclosure;

fig. 6 is a schematic diagram of an electronic device according to an embodiment of the disclosure.

Detailed Description

Embodiments of the present disclosure are described in detail below with reference to the accompanying drawings.

Other advantages and effects of the present disclosure will become readily apparent to those skilled in the art from the following disclosure, which describes embodiments of the present disclosure by way of specific examples. It will be apparent that the described embodiments are merely some, but not all embodiments of the present disclosure. The disclosure may be embodied or practiced in other different specific embodiments, and details within the subject specification may be modified or changed from various points of view and applications without departing from the spirit of the disclosure. It should be noted that the following embodiments and features in the embodiments may be combined with each other without conflict. All other embodiments, which can be made by one of ordinary skill in the art without inventive effort, based on the embodiments in this disclosure are intended to be within the scope of this disclosure.

It is noted that various aspects of the embodiments are described below within the scope of the following claims. It should be apparent that the aspects described herein may be embodied in a wide variety of forms and that any specific structure and/or function described herein is merely illustrative. Based on the present disclosure, one skilled in the art will appreciate that one aspect described herein may be implemented independently of any other aspect, and that two or more of these aspects may be combined in various ways. For example, an apparatus may be implemented and/or a method practiced using any number of the aspects set forth herein. In addition, such apparatus may be implemented and/or such methods practiced using other structure and/or functionality in addition to one or more of the aspects set forth herein.

It should also be noted that the illustrations provided in the following embodiments merely illustrate the basic concepts of the disclosure by way of illustration, and only the components related to the disclosure are shown in the drawings and are not drawn according to the number, shape and size of the components in actual implementation, and the form, number and proportion of the components in actual implementation may be arbitrarily changed, and the layout of the components may be more complicated.

In addition, in the following description, specific details are provided in order to provide a thorough understanding of the examples. However, it will be understood by those skilled in the art that the aspects may be practiced without these specific details.

The embodiment of the disclosure provides a target detection method. The object detection method provided in this embodiment may be performed by a computing device, which may be implemented as software, or as a combination of software and hardware, and the computing device may be integrally provided in a server, a client, or the like.

Referring to fig. 1, the target detection method in the embodiment of the present disclosure may include the following steps:

s101, performing target detection on a first video frame in the target video to obtain one or more target detection results.

The target video may be a video acquired in various ways, for example, the target video may be a video captured in real time by a camera or the like, or may be a pre-recorded video. The target video comprises a plurality of video frames, and the video frames of the target video comprise a plurality of objects.

As an application scene, the target video contains a plurality of automobiles, the automobile contains license plates, and the license plates in the automobiles can be used as targets for target detection.

The conventional target detection method needs to divide objects in the video, find all the objects in the video in an object division mode, and then identify all the objects in each video frame, so that the object actually needing to be identified is identified. This approach, due to the large number of steps involved, can result in a large amount of wasted system resources, reducing the speed and efficiency of target detection.

As a specific implementation manner of the scheme of the application, one video frame can be selected from the target video as a first video frame, and the first video frame can be a starting frame in the target video or any video frame in the target video. All objects in the first video frame can be detected in an object segmentation mode, and one or more target detection results are obtained. As an example, the first video frame may be an image including a plurality of objects such as a building, an automobile, and a person, and the object detection results including a building, a building window, an automobile license plate, a person, and a person head portrait are obtained by performing object detection on the first video frame.

The target detection result includes all objects in the first video frame, and the target detection of the present application is to find a specific target object (for example, license plate).

S102, setting a first target representation area matched with the target detection result, wherein the area of the first target representation area is larger than the actual area of the target detection result.

Different target detection results have different sizes, and correspondingly, the different target detection results occupy different areas in the first video frame, and in order to represent different target detection results, the different target detection results can be performed by means of a target representation area (first target representation area). The first target representation area may be a rectangular box or the first target representation area may be of other types of shapes. The shape of the first target representation area is not limited here.

In the process of setting the first target representation area, in order to make the first target representation area have robustness, the area of the first target representation area may be larger than the actual area of the target detection result, so that even if the target detection result generates smiling jitter in the next video frame, the first target representation area may still cover the object represented in the target detection result.

And S103, detecting key points of the first video frame and a plurality of adjacent video frames adjacent to the first video frame by utilizing a preset key point model and the first target representation area.

Through the obtained first target representation area, an object contained in the area can be roughly determined, and then the object in the first target representation area is required to be identified in a key point detection mode. The detection of the key points can be performed by constructing a key point model for detection, and the key point model and the detection mode thereof are not limited herein, and the key point model is used for detecting the key points in a common target detection mode.

In order to improve the efficiency of target detection and recognition, in addition to performing key point detection on the first video frame, for a plurality of video frames adjacent to the first key frame, the first target representation area obtained by the detection of the first video frame is directly utilized to perform key point detection on the objects in the plurality of adjacent video frames without performing the target detection in step S101, so that the efficiency of target detection is greatly improved.

And S104, determining the first video frame and target objects existing in a plurality of adjacent video frames adjacent to the first video frame based on the key point detection result.

By obtaining the results of the key point detection, the shapes of the objects existing in the first video frame and the plurality of adjacent video frames adjacent to the first video frame can be obtained, and by comparing these shapes with the shape of the target object, the target object can be obtained by screening. For example, if all license plates in the target video are detected, the result of the detection of the key points may be compared with the shape of the license plates, so as to screen out all license plates contained in the target video.

By the scheme in the embodiment, only the first video frame can be subjected to target detection, and the adjacent video frames are not subjected to target detection, so that the target detection efficiency is greatly improved.

Referring to fig. 2, according to a specific implementation manner of the embodiment of the present disclosure, after determining the first video frame and the target objects existing in the plurality of adjacent video frames adjacent to the first video frame based on the result of the keypoint detection, the method further includes:

s201, obtaining the field of view in a second video frame after a plurality of adjacent video frames adjacent to the first video frame.

In the process of capturing video, there is a phenomenon that the field of view is shifted, and therefore, it is necessary to detect whether the field of view in the target video is changed in real time after a plurality of adjacent video frames.

The field of view in the video may be represented in a variety of ways, for example, by calculating the average pixel value of a video frame. The expression of the visual field is not limited here.

S202, judging whether the visual field in the second video frame is the same as the visual field in the first video frame.

By comparing whether the field of view in the second video frame and the field of view in the first video frame exceed a preset value, it can be determined whether the field of view in the second video frame is the same as the field of view in the first video frame.

S203, if not, restarting to execute target detection on the second video frame.

When the field of view in the second video frame is found to be different from the field of view in the first video frame by comparison, then the first target representation area in the first video frame has not been available for continued use in the second video frame. At this point, object detection needs to be restarted for the second video frame.

According to a specific implementation manner of the embodiment of the present disclosure, after the performing the object detection on the second video frame is restarted, the method further includes: setting a second target representation area matched with a target detection result of the second video frame, wherein the area of the second target representation area is larger than the actual area of the target detection result of the second video frame; performing key point detection on the second video frame and a plurality of adjacent video frames adjacent to the second video frame by using a preset key point model and the second target representation area; and determining a target object existing in the second video frame and a plurality of adjacent video frames adjacent to the second video frame based on the result of the key point detection. The above manner is similar to that in steps S102 to S104, and will not be described here again.

According to a specific implementation manner of an embodiment of the present disclosure, the performing object detection on a first video frame in an object video to obtain one or more object detection results includes: and detecting the object existing in the first video frame by utilizing a sliding window detector contained in a preset target detection model, so as to form one or more target detection results.

Referring to fig. 3, according to a specific implementation manner of the embodiment of the present disclosure, the detecting, by using a sliding window detector included in a preset target detection model, an object existing in the first video frame includes:

s301, acquiring a neural network model after training the target detection model through a training sample;

s302, scanning the first video frame through a window with a fixed size and a fixed step length, and sending an image in the window in the first video frame to a trained convolution network for detection;

s303, detecting whether an object exists or not and positioning of the object by changing the size of the scanning window.

Referring to fig. 4, according to a specific implementation manner of the embodiment of the present disclosure, the setting a first target representation area matched with the target detection result includes:

s401, acquiring an actual area of the target detection result on the first video frame;

s402, expanding preset multiples in the horizontal direction and the vertical direction by taking the center of the actual area as the center;

s403, taking the expanded area as the first target representation area.

According to a specific implementation manner of the embodiment of the present disclosure, the performing, by using a preset keypoint model and the first target representation area, keypoint detection on the first video frame and a plurality of adjacent video frames adjacent to the first video frame includes: acquiring an image in the target representation area; and executing the key point detection on the image objects in the target representation area based on the key point model to obtain a key point detection result.

According to a specific implementation manner of the embodiment of the present disclosure, the determining, based on the result of the keypoint detection, a target object existing in the first video frame and a plurality of adjacent video frames adjacent to the first video frame includes: judging whether the key point detection result is matched with a preset target model or not; if yes, the key point detection result is used as a target object existing in the first video frame and a plurality of adjacent video frames adjacent to the first video frame.

Corresponding to the above method embodiment, referring to fig. 5, the disclosed embodiment further provides an object detection device 50, including:

a first detection module 501, configured to perform target detection on a first video frame in a target video to obtain one or more target detection results;

a setting module 502, configured to set a first target representation area that is matched with the target detection result, where an area of the first target representation area is greater than an actual area of the target detection result;

a second detection module 503, configured to detect key points of the first video frame and a plurality of adjacent video frames adjacent to the first video frame by using a preset key point model and the first target representation area;

a determining module 504, configured to determine, based on a result of the keypoint detection, a target object existing in the first video frame and a plurality of adjacent video frames adjacent to the first video frame.

The parts of this embodiment, which are not described in detail, are referred to the content described in the above method embodiment, and are not described in detail herein.

Referring to fig. 6, an embodiment of the present disclosure also provides an electronic device 60, comprising:

at least one processor; the method comprises the steps of,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the object detection method of the foregoing method embodiments.

The disclosed embodiments also provide a non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the target detection method in the foregoing method embodiments.

The disclosed embodiments also provide a computer program product comprising a computer program stored on a non-transitory computer readable storage medium, the computer program comprising program instructions which, when executed by a computer, cause the computer to perform the object detection method in the foregoing method embodiments.

Referring now to fig. 6, a schematic diagram of an electronic device 60 suitable for use in implementing embodiments of the present disclosure is shown. The electronic devices in the embodiments of the present disclosure may include, but are not limited to, mobile terminals such as mobile phones, notebook computers, digital broadcast receivers, PDAs (personal digital assistants), PADs (tablet computers), PMPs (portable multimedia players), in-vehicle terminals (e.g., in-vehicle navigation terminals), and the like, and stationary terminals such as digital TVs, desktop computers, and the like. The electronic device shown in fig. 6 is merely an example and should not be construed to limit the functionality and scope of use of the disclosed embodiments.

As shown in fig. 6, the electronic device 60 may include a processing means (e.g., a central processing unit, a graphics processor, etc.) 601, which may perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 602 or a program loaded from a storage means 608 into a Random Access Memory (RAM) 603. In the RAM 603, various programs and data necessary for the operation of the electronic device 60 are also stored. The processing device 601, the ROM602, and the RAM 603 are connected to each other through a bus 604. An input/output (I/O) interface 605 is also connected to bus 604.

In general, the following devices may be connected to the I/O interface 605: input devices 606 including, for example, a touch screen, touchpad, keyboard, mouse, image sensor, microphone, accelerometer, gyroscope, etc.; an output device 607 including, for example, a Liquid Crystal Display (LCD), a speaker, a vibrator, and the like; storage 608 including, for example, magnetic tape, hard disk, etc.; and a communication device 609. The communication means 609 may allow the electronic device 60 to communicate with other devices wirelessly or by wire to exchange data. While an electronic device 60 having various means is shown, it is to be understood that not all of the illustrated means are required to be implemented or provided. More or fewer devices may be implemented or provided instead.

In particular, according to embodiments of the present disclosure, the processes described above with reference to flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method shown in the flowcharts. In such an embodiment, the computer program may be downloaded and installed from a network via communication means 609, or from storage means 608, or from ROM 602. The above-described functions defined in the methods of the embodiments of the present disclosure are performed when the computer program is executed by the processing device 601.

It should be noted that the computer readable medium described in the present disclosure may be a computer readable signal medium or a computer readable storage medium, or any combination of the two. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples of the computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this disclosure, a computer-readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present disclosure, however, the computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, with the computer-readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, fiber optic cables, RF (radio frequency), and the like, or any suitable combination of the foregoing.

The computer readable medium may be contained in the electronic device; or may exist alone without being incorporated into the electronic device.

The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to: acquiring at least two internet protocol addresses; sending a node evaluation request comprising the at least two internet protocol addresses to node evaluation equipment, wherein the node evaluation equipment selects an internet protocol address from the at least two internet protocol addresses and returns the internet protocol address; receiving an Internet protocol address returned by the node evaluation equipment; wherein the acquired internet protocol address indicates an edge node in the content distribution network.

Alternatively, the computer-readable medium carries one or more programs that, when executed by the electronic device, cause the electronic device to: receiving a node evaluation request comprising at least two internet protocol addresses; selecting an internet protocol address from the at least two internet protocol addresses; returning the selected internet protocol address; wherein the received internet protocol address indicates an edge node in the content distribution network.

Computer program code for carrying out operations of the present disclosure may be written in one or more programming languages, including an object oriented programming language such as Java, smalltalk, C ++ and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider).

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units involved in the embodiments of the present disclosure may be implemented by means of software, or may be implemented by means of hardware. The name of the unit does not in any way constitute a limitation of the unit itself, for example the first acquisition unit may also be described as "unit acquiring at least two internet protocol addresses".

It should be understood that portions of the present disclosure may be implemented in hardware, software, firmware, or a combination thereof.

The foregoing is merely specific embodiments of the disclosure, but the protection scope of the disclosure is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the disclosure are intended to be covered by the protection scope of the disclosure. Therefore, the protection scope of the present disclosure shall be subject to the protection scope of the claims.

Claims

1. A method of detecting an object, comprising:

determining target objects existing in the first video frame and a plurality of adjacent video frames adjacent to the first video frame based on the key point detection result, wherein the key point detection result is the shape of an object existing in any video frame and is used for screening the object with the same shape as the target object from the existing objects as a target object;

after determining the first video frame and the target object existing in the plurality of adjacent video frames adjacent to the first video frame based on the result of the keypoint detection, the method further includes:

if not, restarting executing the target detection on the second video frame.

2. The method of claim 1, wherein after restarting object detection for the second video frame, the method further comprises:

3. The method of claim 1, wherein performing object detection on a first video frame in the object video to obtain one or more object detection results comprises:

4. A method according to claim 3, wherein detecting the object present in the first video frame using a sliding window detector included in a predetermined object detection model comprises:

5. The method of claim 1, wherein the setting a first target representation area that matches the target detection result comprises:

and taking the expanded area as the first target representation area.

6. The method of claim 1, wherein performing keypoint detection on the first video frame and a plurality of neighboring video frames neighboring the first video frame using a preset keypoint model and the first target representation area comprises:

acquiring an image in the target representation area;

7. The method of claim 1, wherein the determining, based on the results of the keypoint detection, a target object present in the first video frame and a plurality of neighboring video frames neighboring the first video frame comprises:

8. An object detection apparatus, comprising:

a determining module, configured to determine, based on a result of the keypoint detection, a target object existing in the first video frame and a plurality of adjacent video frames adjacent to the first video frame, where the result of the keypoint detection is a shape of an object existing in any video frame, and is configured to screen, from the existing objects, an object having the same shape as the target object; after determining the first video frame and the target objects existing in the plurality of adjacent video frames adjacent to the first video frame based on the result of the keypoint detection, the method further comprises: acquiring a field of view in a second video frame subsequent to the plurality of adjacent video frames adjacent to the first video frame; judging whether the visual field in the second video frame is the same as the visual field in the first video frame; if not, restarting executing the target detection on the second video frame.

9. An electronic device, the electronic device comprising:

at least one processor; the method comprises the steps of,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the object detection method of any one of the preceding claims 1-7.

10. A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the target detection method of any one of the preceding claims 1-7.