CN109598743B

CN109598743B - Pedestrian target tracking method, device and equipment

Info

Publication number: CN109598743B
Application number: CN201811386432.5A
Authority: CN
Inventors: 车广富; 董玉新; 安山
Original assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Jingdong Shangke Information Technology Co Ltd
Current assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Jingdong Shangke Information Technology Co Ltd
Priority date: 2018-11-20
Filing date: 2018-11-20
Publication date: 2021-09-03
Anticipated expiration: 2038-11-20
Also published as: CN109598743A

Abstract

The embodiment of the invention provides a pedestrian target tracking method, a device and equipment. The method comprises the following steps: acquiring continuous video images shot by a plurality of cameras, wherein each camera is preset with an optimal shooting view field; determining a pedestrian target in the continuous video images collected by each camera by adopting a convolutional neural network model for target pedestrian detection; and matching and tracking the pedestrian target in the continuous video images acquired by the plurality of cameras. The method of the embodiment of the invention realizes the accurate tracking of the pedestrian target under the condition of crossing the cameras.

Description

Pedestrian target tracking method, device and equipment

Technical Field

The embodiment of the invention relates to the technical field of computer vision, in particular to a pedestrian target tracking method, device and equipment.

Background

With the rapid development of the technical fields of computer vision, machine learning, big data analysis, artificial intelligence and the like, various intelligent applications, products, services and the like based on computer vision are rapidly developed, such as unmanned stores, unmanned restaurants, intelligent security and the like, and convenience is brought to the life of people.

Identification and tracking of targets is particularly important in various computer vision-based applications. Taking the unmanned store application as an example, in order to accurately recommend products to a user, and accurately analyze purchasing behavior and consumption habits of the user, the target user in the unmanned store needs to be accurately tracked in real time. In an unmanned store, in order to avoid a blind visual area, a plurality of cameras are generally adopted to acquire video data, and the problem of overlapping visual fields inevitably exists between adjacent cameras. When a target user enters a visual field overlapping area, the target user needs to be identified and tracked again, visual field coordinate systems of the cameras are different, appearances and forms are easy to change, and the existing method cannot meet the requirement of cross-camera target tracking.

Disclosure of Invention

The embodiment of the invention provides a pedestrian target tracking method, a device and equipment, which are used for realizing cross-camera pedestrian target tracking.

In a first aspect, an embodiment of the present invention provides a pedestrian target tracking method, including:

acquiring continuous video images shot by a plurality of cameras, wherein each camera is preset with an optimal shooting view field;

determining a pedestrian target in the continuous video images collected by each camera by adopting a convolutional neural network model for target pedestrian detection;

and matching and tracking the pedestrian target in the continuous video images acquired by the plurality of cameras.

In one possible implementation, before acquiring consecutive video images captured by a plurality of cameras, the method further includes:

and performing field division on cameras with overlapped fields of view in the plurality of cameras to determine the optimal shooting field of view of each camera, so that the same pedestrian target is only positioned in the optimal shooting field of view of one camera.

In one possible implementation, before determining the pedestrian target in the continuous video images acquired by each camera using the convolutional neural network model for target pedestrian detection, the method further includes:

acquiring a plurality of video image samples, wherein the video image samples comprise position information of a pedestrian target in the video image samples;

and training the pedestrian target head detection model according to the plurality of video image samples to obtain a neural network model.

In one possible implementation, training a pedestrian target head detection model according to a plurality of video image samples includes:

and training the pedestrian target head detection model by using a transfer learning algorithm.

In a possible implementation manner, determining a pedestrian target in a continuous video image acquired by each camera by using a convolutional neural network model for target pedestrian detection specifically includes:

determining the position of the head of the pedestrian in each video image by adopting a neural network model;

and determining the position of the pedestrian target according to the position of the head of the pedestrian and the optimal shooting visual field corresponding to the video image.

In one possible implementation, the neural network model includes a global neural network model and a local neural network model;

the global neural network model is used for determining the head position of the pedestrian in the global range of the video image;

the local neural network model is used for determining the head position of the pedestrian within a preset range of the determined position of the previous frame.

In one possible implementation, determining the position of the head of the pedestrian in each video image using a neural network model includes:

determining a first position set of the head of the pedestrian in each video image by adopting a global neural network model;

determining a second set of positions of the pedestrian's head in each video image using a local neural network model;

determining the position of a pedestrian target according to the position of the head of the pedestrian and the optimal shooting visual field corresponding to the video image, wherein the method comprises the following steps:

determining a first target position set of each video image according to the first position set of the video image and the optimal shooting visual field corresponding to the video image;

and determining a second target position set of each video image according to the second position set of the video image and the optimal shooting visual field corresponding to the video image.

In one possible implementation, matching and tracking a pedestrian target in consecutive video images acquired by a plurality of cameras includes:

and matching and tracking the pedestrian target in the continuous video images acquired by each camera by using a Hungarian algorithm.

In a possible implementation manner, the pedestrian target in the continuous video images acquired by each camera is matched and tracked by using the hungarian algorithm, and the method specifically comprises the following steps:

matching and tracking the pedestrian target by using a Hungarian algorithm according to the similarity of position frames in a first target position set and a second target position set of the video images acquired by each camera;

the similarity of the position frames is determined according to the following formula:

wherein, O_iRepresenting the ith position box in the first set of target positions, N representing the total number of position boxes in the first set of target positions, O_jRepresenting the jth position box in the second set of target positions, and M representing the total number of position boxes in the second set of target positions.

In a possible implementation manner, matching and tracking a pedestrian target in continuous video images acquired by a plurality of cameras further includes:

if the unmatched position frames in the first target position set meet a first preset condition, determining that the pedestrian corresponding to the position frame just enters the best shooting view field corresponding to the first target position set;

if the position frame in the second target position set meets a second preset condition, determining that the pedestrian corresponding to the position frame leaves the best shooting view corresponding to the second target position set;

the first preset condition and the second preset condition are determined according to the moment when the pedestrian leaves and enters the best shooting visual field and the track interval.

In a second aspect, an embodiment of the present invention provides a pedestrian target tracking device, including:

the acquisition module is used for acquiring continuous video images shot by a plurality of cameras, and each camera is preset with an optimal shooting view;

the determining module is used for determining the pedestrian target in the continuous video images acquired by each camera by adopting a convolutional neural network model for target pedestrian detection;

and the tracking module is used for matching and tracking the pedestrian target in the continuous video images acquired by the cameras.

In a third aspect, an embodiment of the present invention provides an electronic device, including:

at least one processor and memory;

the memory stores computer-executable instructions;

the at least one processor executes computer-executable instructions stored by the memory to cause the at least one processor to perform the pedestrian target tracking method of any one of the first aspects.

In a fourth aspect, an embodiment of the present invention provides a computer-readable storage medium, in which computer-executable instructions are stored, and when the computer-executable instructions are executed by a processor, the computer-readable storage medium is configured to implement the pedestrian target tracking method according to any one of the first aspect.

According to the pedestrian target tracking method, the device and the equipment provided by the embodiment of the invention, the continuous video images shot by the multiple cameras are obtained, the optimal shooting view field is preset for each camera, the pedestrian target in the continuous video images collected by each camera is determined by adopting the convolutional neural network model for detecting the target pedestrian, and the pedestrian target in the continuous video images collected by the multiple cameras is matched and tracked, so that the accurate tracking of the pedestrian target under the condition of crossing the cameras is realized. The optimal shooting visual field is preset for each camera, so that the visual fields of the cameras are prevented from being overlapped, re-identification and tracking of pedestrian targets in an overlapped area are avoided, the calculation cost is reduced, and the tracking speed is increased.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and together with the description, serve to explain the principles of the invention.

FIG. 1 is a flowchart of an embodiment of a pedestrian target tracking method provided by the present invention;

FIG. 2 is a schematic diagram of an optimal shooting view according to an embodiment of the present invention;

FIG. 3 is a diagram of a video image sample according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of a sample fabrication of a local neural network model according to an embodiment of the present invention;

FIG. 5 is a schematic structural diagram of an embodiment of a pedestrian target tracking device according to the present invention;

fig. 6 is a schematic structural diagram of an embodiment of an electronic device provided in the present invention.

With the above figures, certain embodiments of the invention have been illustrated and described in more detail below. The drawings and the description are not intended to limit the scope of the inventive concept in any way, but rather to illustrate it by those skilled in the art with reference to specific embodiments.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present invention. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the invention, as detailed in the appended claims.

The terms "comprising" and "having," and any variations thereof, in the description and claims of this invention are intended to cover non-exclusive inclusions. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those steps or elements listed, but may alternatively include other steps or elements not listed, or inherent to such process, method, article, or apparatus.

The terms "first" and "second" in the present application are used for identification purposes only and are not to be construed as indicating or implying a sequential relationship, relative importance, or implicitly indicating the number of technical features indicated. "plurality" means two or more. "and/or" describes the association relationship of the associated objects, meaning that there may be three relationships, e.g., a and/or B, which may mean: a exists alone, A and B exist simultaneously, and B exists alone. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship.

Reference throughout this specification to "one embodiment" or "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one embodiment of the present application. Thus, the appearances of the phrases "in one embodiment" or "in an embodiment" in various places throughout this specification are not necessarily all referring to the same embodiment. It should be noted that the embodiments and features of the embodiments may be combined with each other without conflict.

Fig. 1 is a flowchart of an embodiment of a pedestrian target tracking method according to the present invention. The method provided by the embodiment can be executed by the terminal device, and also can be executed by a network side device such as a server. As shown in fig. 1, the method for tracking a pedestrian target provided by the present embodiment may include:

s101, acquiring continuous video images shot by a plurality of cameras, wherein each camera is preset with an optimal shooting view field.

The continuous video images in the embodiment can be acquired through a plurality of cameras installed in a scene needing to track the pedestrian target, and the plurality of cameras can acquire the video of the scene without blind areas. For example, in order to realize the video acquisition without blind areas, cameras can be installed at a plurality of different positions in an unmanned store for video acquisition, and the cameras can be installed at a plurality of different positions in a monitoring area for video monitoring in intelligent security.

In order to avoid the problem of overlapping of the visual fields between adjacent cameras, the optimal shooting visual field can be preset for each camera. The best shooting visual field can be determined according to a network topology structure formed by the installation positions of the cameras and the installation positions of the cameras. And no view overlapping exists between the optimal shooting views of the cameras.

S102, determining the pedestrian target in the continuous video image collected by each camera by adopting a convolutional neural network model for detecting the target pedestrian.

In the embodiment, for each frame of the continuous video images acquired by each camera, a convolutional neural network model for detecting pedestrians is adopted to determine a pedestrian target. The convolutional neural network model in this embodiment needs to be trained in advance according to an application scenario. For example, if the camera is installed at a high position in a scene and is a top view angle, a pedestrian target can be determined according to the head information, and then the convolutional neural network model can be trained according to the head information in the top view video; if the camera is installed at the middle position in the scene and is in a head-up view angle, the pedestrian target can be determined according to the human body shape and the face information, and then the convolutional neural network model can be trained according to the human body shape and the face information in the head-up video.

In this embodiment, the pedestrian target in the continuous video image acquired by each camera may be the pedestrian target in the optimal shooting view field of the camera, that is, the pedestrian target may be determined only for the optimal shooting view field of the camera. The number of pedestrian targets is not limited in the embodiment, and the pedestrian targets may be a single pedestrian target or a plurality of pedestrian targets.

S103, matching and tracking pedestrian targets in the continuous video images collected by the multiple cameras.

In this embodiment, the pedestrian targets in the continuous video images collected by the multiple cameras are matched and tracked, and the pedestrian targets crossing the cameras can be determined and tracked by matching the pedestrian targets in the optimal shooting fields of the cameras.

According to the pedestrian target tracking method provided by the embodiment, the continuous video images shot by the multiple cameras are obtained, the optimal shooting visual field is preset for each camera, the pedestrian target in the continuous video image collected by each camera is determined by adopting the convolutional neural network model for detecting the target pedestrian, the pedestrian target in the continuous video image collected by the multiple cameras is matched and tracked, and the accurate tracking of the pedestrian target under the condition of crossing the cameras is realized. The optimal shooting visual field is preset for each camera, so that the visual fields of the cameras are prevented from being overlapped, re-identification and tracking of pedestrian targets in an overlapped area are avoided, the calculation cost is reduced, and the tracking speed is increased.

In some embodiments, on the basis of the above embodiments, before acquiring the continuous video images captured by the multiple cameras, the method provided by this embodiment may further include: and performing field division on cameras with overlapped fields of view in the plurality of cameras to determine the optimal shooting field of view of each camera, so that the same pedestrian target is only positioned in the optimal shooting field of view of one camera.

In this embodiment, the optimal shooting view of each camera can be determined according to a network topology structure formed by the installation positions of the cameras and the installation positions of the multiple cameras. Taking an application in an unmanned shop as an example, each camera may be set to a top view angle. Fig. 2 is a schematic diagram of an optimal shooting view according to an embodiment of the present invention. As shown in fig. 2, a solid-line rectangular frame in the figure may represent an indoor sectional view of an unmanned store in which four cameras numbered 1, 2, 3, and 4 are installed, and the specific positions thereof are as shown in fig. 2, and a dotted line in the figure is a boundary line dividing the best shooting fields of the four cameras. The closed region formed by the broken line and the solid line is the best shooting view field of each camera. As shown in fig. 2, the optimal photographing fields of view of the four cameras are independent of each other, and there is no overlapping area, and therefore, the same pedestrian target can be located in the optimal photographing field of view of only one camera.

It should be noted that fig. 2 only shows one optimal shooting view division manner, and in practical applications, the optimal shooting view may be divided differently according to specific situations such as the shape of the field, the number of cameras, and the network topology of the cameras.

In some embodiments, on the basis of the above embodiments, the method provided by this embodiment may further include, before determining the pedestrian target in the continuous video images acquired by each camera by using a convolutional neural network model for target pedestrian detection: acquiring a plurality of video image samples, wherein the video image samples comprise position information of a pedestrian target in the video image samples; and training the pedestrian target head detection model according to the plurality of video image samples to obtain a neural network model.

The video image samples in the present embodiment are video image samples in which the position information of a pedestrian target has been marked. For a top view scene, the position information of the pedestrian object may be identified by the head position of the pedestrian.

Optionally, the position information of the pedestrian target in the video image sample may be represented by a position frame, and the position frame may be in a shape of rectangle, circle, ellipse, or the like.

Fig. 3 is a schematic diagram of a video image sample according to an embodiment of the invention. As shown in fig. 3, in the top view scene, the position information of the pedestrian is represented by a circumscribed rectangle at the position of the head of the pedestrian. The size of the circumscribed rectangle that identifies different pedestrian location information may be different.

The specific process of training the pedestrian target head detection model according to the video image sample in this embodiment may be to use the video image frame as an input of the pedestrian target head detection model, use the position information of the marker as an expected output, and iteratively train the model according to the position information of the marker and a position design loss function output by the pedestrian target head detection model. The model formulation can be trained by back-propagation error learning, for example, based on a convolutional neural network, using the loss function values as a guide. The pedestrian target head detection model in the present embodiment may be, for example, SSD, YOLO, fast-RCNN, or the like.

In the embodiment, the pedestrian target head detection model is trained by adopting the head position information, and the pedestrian target can be more accurately determined by the trained model due to the small shielding probability of the head position and the small shape change.

Optionally, an implementation manner of training the pedestrian target head detection model according to the plurality of video image samples may be: and training the pedestrian target head detection model by using a transfer learning algorithm.

In this embodiment, the pedestrian head samples can be collected and labeled according to a specific application scenario, and the pedestrian target head detection model is finely adjusted by using the transfer learning algorithm, so that the adaptability of the pedestrian target head detection model to the specific application scenario is greatly improved.

Adopt the transfer learning algorithm to train pedestrian's target head detection model in this embodiment, not only can improve the speed of model training, reduce the quantity demand to training the sample, the model of training out accords with the scene demand moreover more, can provide more accurate positional information.

In some embodiments, one implementation of using a convolutional neural network model of target pedestrian detection to determine pedestrian targets in successive video images acquired by each camera may be:

In this embodiment, one frame of the video image is used as an input of the neural network model, and then the positions of all the heads of the pedestrians included in the video image are output.

Taking the application scene and the optimal shooting view shown in fig. 2 as an example, if the neural network model is used to determine that A, B and C three pedestrians exist in the video image acquired by the camera 1, the three pedestrians are respectively located in the optimal shooting views of the camera 1, the camera 2 and the camera 4, and the optimal shooting view of the camera 1 is used for filtering, so that the position of the pedestrian target in the video image acquired by the camera 1 can be determined as the position of the pedestrian target a.

Alternatively, the neural network model may include a global neural network model and a local neural network model. The global neural network model is used for determining the head position of the pedestrian in the global range of the video image; the local neural network model is used for determining the head position of the pedestrian within a preset range of the determined position of the previous frame.

Optionally, both the global neural network model and the local neural network model need to be trained by using pre-labeled samples. For the global neural network model, the complete video image marked with the pedestrian position can be adopted for training, and for the local neural network model, the local video image marked with the pedestrian position can be adopted for training.

In order to avoid the problems of missing detection, false detection and the like caused by background pixels occupying a great proportion in the training sample, the detection position of the previous frame is utilized in the embodiment to search the most probable area of the head in the local range of the detection position, and the detection position is replaced by the area; similarly, in the next frame, the area obtained in the previous frame represents the detection position, and the optimal position is searched in the local range of the detection position, so that the position updating is completed. Fig. 4 is a schematic diagram illustrating a sample manufacturing process of a local neural network model according to an embodiment of the present invention. As shown in fig. 4, the range can be expanded outward by 2 times based on the position of the external rectangular frame of the pedestrian target determined in the previous frame, and a preset number of position frames can be randomly selected as the training sample of the local neural network model under the condition that the external rectangular frame of the pedestrian target can be covered.

In some embodiments, one implementation of using a neural network model to determine the position of the pedestrian's head in each video image may be:

one implementation of determining the position of the pedestrian target according to the position of the head of the pedestrian and the optimal shooting view corresponding to the video image may be:

In some embodiments, one implementation of matching and tracking pedestrian targets in consecutive video images acquired by multiple cameras may be:

The Hungarian algorithm is based on Hall's theorem, and can realize rapid and accurate matching tracking of pedestrian targets in continuous video images acquired by each camera by searching for an augmentation path.

Optionally, the hungarian algorithm is used for matching and tracking the pedestrian target in the continuous video images acquired by each camera, and specifically, the method may include:

wherein, O_iRepresenting the ith position box in the first set of target positions, N representing the total number of position boxes in the first set of target positions, O_jRepresenting the jth position box in the second set of target positions, and M representing the total number of position boxes in the second set of target positions. O is_i∩O_jDenotes the intersection of the areas of the ith and jth position boxes, O_i∪O_jIndicates the ith position box andthe union of the j location box areas.

the first preset condition and the second preset condition are determined according to the moment when the pedestrian leaves and enters the best shooting visual field and the track interval. The pedestrian moving speed can be determined according to the moment and the track interval of the occurrence of the event that the pedestrian leaves and enters the optimal shooting visual field, and the first preset condition and the second preset condition are required to ensure that the speed is within a reasonable range.

As an example of the application scenario and the optimal shooting view shown in fig. 2, if the pedestrian target a moves from the optimal shooting view of the camera No. 1 into the optimal shooting view of the camera No. 2, a non-matching position frame will appear in the first target position set of the camera No. 2, and if the moving speed of the camera a determined according to the time when the camera a leaves the optimal shooting view of the camera No. 1, the time when the camera a enters the optimal shooting view of the camera No. 2, and the trajectory distance of the camera a is reasonable, it can be determined that the pedestrian a corresponding to the position frame just enters the optimal shooting view of the camera No. 2.

Fig. 5 is a schematic structural diagram of an embodiment of a pedestrian target tracking device provided in the present invention. As shown in fig. 5, the pedestrian target tracking apparatus 50 provided by the present embodiment may include: an acquisition module 501, a determination module 502 and a tracking module 503.

The acquiring module 501 is configured to acquire continuous video images captured by multiple cameras, where each camera has a preset optimal capturing view.

A determining module 502, configured to determine a pedestrian target in the continuous video images acquired by each camera by using a convolutional neural network model for target pedestrian detection.

And the tracking module 503 is configured to perform matching tracking on pedestrian targets in the continuous video images acquired by the multiple cameras.

The apparatus of this embodiment may be used to implement the technical solution of the method embodiment shown in fig. 1, and the implementation principle and the technical effect are similar, which are not described herein again.

Optionally, the pedestrian target tracking apparatus 50 further includes a dividing module, configured to perform field division on cameras with overlapped fields of view in the multiple cameras to determine an optimal shooting field of view of each camera before acquiring the continuous video images shot by the multiple cameras, so that the same pedestrian target is located in the optimal shooting field of view of only one camera.

Optionally, the pedestrian target tracking apparatus 50 further includes a training module, configured to obtain a plurality of video image samples before determining a pedestrian target in the continuous video images acquired by each camera by using a convolutional neural network model for target pedestrian detection, where the video image samples include position information of the pedestrian target in the video image samples; and training the pedestrian target head detection model according to the plurality of video image samples to obtain a neural network model.

Optionally, one implementation manner of training the pedestrian target head detection model according to the plurality of video image samples may be to train the pedestrian target head detection model by using a transfer learning algorithm.

Optionally, the determining module 502 may be specifically configured to determine the position of the head of the pedestrian in each video image by using a neural network model; and determining the position of the pedestrian target according to the position of the head of the pedestrian and the optimal shooting visual field corresponding to the video image.

Optionally, the neural network model may include a global neural network model and a local neural network model; the global neural network model is used for determining the head position of the pedestrian in the global range of the video image; the local neural network model is used for determining the head position of the pedestrian within a preset range of the determined position of the previous frame.

Optionally, the determining module 502 may be specifically configured to,

determining a first position set of the head of the pedestrian in each video image by adopting a global neural network model; determining a second set of positions of the pedestrian's head in each video image using a local neural network model;

determining a first target position set of each video image according to the first position set of the video image and the optimal shooting visual field corresponding to the video image; and determining a second target position set of each video image according to the second position set of the video image and the optimal shooting visual field corresponding to the video image.

Optionally, the tracking module 503 may be specifically configured to perform matching tracking on the pedestrian target in the continuous video images acquired by each camera by using the hungarian algorithm.

Optionally, the tracking module 503 may be specifically configured to perform matching tracking on the pedestrian target according to similarity between position frames in the first target position set and the second target position set of the video image acquired by each camera by using a hungarian algorithm;

wherein, O_iRepresenting the ith position box in the first set of target positions, N representing the total number of position boxes in the first set of target positions, O_jRepresenting the jth position box in the second set of target positions, and M representing the total number of position boxes in the second set of target positions. O is_i∩O_jDenotes the intersection of the areas of the ith and jth position boxes, O_i∪O_jRepresents the union of the areas of the ith and jth position boxes.

Optionally, the tracking module 503 may be further specifically configured to, if the unmatched position frame in the first target position set meets a first preset condition, determine that a pedestrian corresponding to the position frame just enters the best shooting view corresponding to the first target position set;

Fig. 6 is a schematic structural diagram of an embodiment of an electronic device provided in the present invention. The electronic device provided by the embodiment includes, but is not limited to, a computer, a single server, a server group composed of a plurality of servers, or a cloud composed of a large number of computers or servers based on cloud computing, wherein the cloud computing is one of distributed computing and is a super virtual computer composed of a group of loosely coupled computers. As shown in fig. 6, the electronic device 60 may include:

at least one processor 602 and memory 606;

the memory 606 stores computer-executable instructions;

the at least one processor 602 executes the computer-executable instructions stored by the memory 606, causing the at least one processor 602 to perform the pedestrian target tracking method described above.

For a specific implementation process of the processor 602, reference may be made to the above-mentioned method embodiment of the pedestrian target tracking method, which has similar implementation principle and technical effect, and this embodiment is not described herein again. The processor 602 and the memory 606 may be connected by a bus 603.

The embodiment of the present invention further provides a computer-readable storage medium, in which computer-executable instructions are stored, and when the computer-executable instructions are executed by a processor, the computer-readable storage medium is configured to implement any one of the above pedestrian target tracking methods.

In the above embodiments, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described device embodiments are merely illustrative, and for example, the division of the modules is only one logical division, and in actual implementation, there may be other divisions, for example, multiple modules may be combined or may be integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or modules, and may be in an electrical, mechanical or other form.

The modules described as separate parts may or may not be physically separate, and parts displayed as modules may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment.

In addition, functional modules in the embodiments of the present invention may be integrated into one processing unit, or each module may exist alone physically, or two or more modules are integrated into one unit. The unit formed by the modules can be realized in a hardware form, and can also be realized in a form of hardware and a software functional unit.

The integrated module implemented in the form of a software functional module may be stored in a computer-readable storage medium. The software functional module is stored in a storage medium and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a network device) or a processor (processor) to execute some steps of the methods according to the embodiments of the present application.

It should be understood that the Processor may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of a method disclosed in connection with the present invention may be embodied directly in a hardware processor, or in a combination of the hardware and software modules within the processor.

The memory may comprise a high-speed RAM memory, and may further comprise a non-volatile storage NVM, such as at least one disk memory, and may also be a usb disk, a removable hard disk, a read-only memory, a magnetic or optical disk, etc.

The bus may be an Industry Standard Architecture (ISA) bus, a Peripheral Component Interconnect (PCI) bus, an Extended ISA (EISA) bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, the buses in the figures of the present application are not limited to only one bus or one type of bus.

The storage medium may be implemented by any type or combination of volatile or non-volatile memory devices, such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks. A storage media may be any available media that can be accessed by a general purpose or special purpose computer.

An exemplary storage medium is coupled to the processor such the processor can read information from, and write information to, the storage medium. Of course, the storage medium may also be integral to the processor. The processor and the storage medium may reside in an Application Specific Integrated Circuits (ASIC). Of course, the processor and the storage medium may reside as discrete components in a terminal or server.

Those of ordinary skill in the art will understand that: all or a portion of the steps of implementing the above-described method embodiments may be performed by hardware associated with program instructions. The program may be stored in a computer-readable storage medium. When executed, the program performs steps comprising the method embodiments described above; and the aforementioned storage medium includes: various media that can store program codes, such as ROM, RAM, magnetic or optical disks.

Finally, it should be noted that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present invention.

Claims

1. A pedestrian target tracking method, comprising:

matching and tracking pedestrian targets in continuous video images acquired by a plurality of cameras;

the method for determining the pedestrian target in the continuous video image acquired by each camera by adopting the convolutional neural network model for target pedestrian detection specifically comprises the following steps:

determining the position of the head of the pedestrian in each video image by adopting the neural network model;

determining the position of the pedestrian target according to the position of the head of the pedestrian and the optimal shooting visual field corresponding to the video image;

the neural network model comprises a global neural network model and a local neural network model;

the global neural network model is used for determining the position of the head of the pedestrian in the global range of the video image;

the local neural network model is used for determining the head position of the pedestrian within a preset range of the position determined in the previous frame;

before the continuous video images shot by the plurality of cameras are obtained, the method further comprises:

and performing field division on cameras with overlapped fields of view in the plurality of cameras to determine the optimal shooting field of view of each camera, so that the same pedestrian target is only positioned in the optimal shooting field of view of one camera, and no field of view overlapping exists between the optimal shooting fields of view of all the cameras.

2. The method of claim 1, wherein prior to determining the pedestrian target in the successive video images captured by each camera using the convolutional neural network model for target pedestrian detection, further comprising:

and training a pedestrian target head detection model according to the plurality of video image samples to obtain the neural network model.

3. The method of claim 2, wherein training a pedestrian target head detection model from the plurality of video image samples comprises:

4. The method of claim 1, wherein said determining the position of the pedestrian's head in each of said video images using said neural network model comprises:

determining a first set of positions of the pedestrian's head in each of the video images using the global neural network model;

determining a second set of positions of the pedestrian's head in each of the video images using the local neural network model;

the determining the position of the pedestrian target according to the position of the head of the pedestrian and the optimal shooting visual field corresponding to the video image comprises the following steps:

5. The method of claim 4, wherein the matching and tracking of pedestrian targets in successive video images captured by a plurality of cameras comprises:

6. The method according to claim 5, wherein the matching and tracking of the pedestrian target in the continuous video images acquired by each camera by using the Hungarian algorithm specifically comprises:

matching and tracking the pedestrian target by utilizing the Hungarian algorithm according to the similarity of position frames in the first target position set and the second target position set of the video image acquired by each camera;

wherein the similarity of the position frames is determined according to the following formula:

i＝1,2,3,…N,j＝1,2,3,…M

wherein, O_iRepresenting the ith position box in the first set of target positions, N representing the total number of position boxes in the first set of target positions, O_jRepresents the jth position box in the second target position set, and M represents the total number of position boxes in the second target position set.

7. The method of claim 6, wherein the matching and tracking of pedestrian targets in successive video images captured by a plurality of cameras further comprises:

if the unmatched position frame in the first target position set meets a first preset condition, determining that the pedestrian corresponding to the position frame just enters the best shooting view corresponding to the first target position set;

if the position frame in the second target position set meets a second preset condition, determining that the pedestrian corresponding to the position frame leaves the best shooting view field corresponding to the second target position set;

the first preset condition and the second preset condition are determined according to the moment when the pedestrian leaves and enters the best shooting visual field and the track distance.

8. A pedestrian target tracking device, comprising:

the device comprises an acquisition module, a display module and a display module, wherein the acquisition module is used for acquiring continuous video images shot by a plurality of cameras, and each camera is preset with an optimal shooting view;

the tracking module is used for matching and tracking the pedestrian target in the continuous video images acquired by the cameras;

the determining module is specifically configured to determine the position of the head of the pedestrian in each of the video images by using the neural network model; determining the position of the pedestrian target according to the position of the head of the pedestrian and the optimal shooting visual field corresponding to the video image;

the device further comprises:

the dividing module is used for dividing the visual fields of the cameras with overlapped visual fields in the multiple cameras to determine the best shooting visual field of each camera before acquiring the continuous video images shot by the multiple cameras, so that the same pedestrian target is only positioned in the best shooting visual field of one camera, and the visual fields of the cameras are not overlapped.

9. An electronic device, comprising: at least one processor and memory;

the memory stores computer-executable instructions;

the at least one processor executing the computer-executable instructions stored by the memory causes the at least one processor to perform the pedestrian object tracking method of any one of claims 1-7.

10. A computer-readable storage medium having computer-executable instructions stored thereon, which when executed by a processor, are configured to implement the pedestrian target tracking method of any one of claims 1-7.