CN115511886B

CN115511886B - Method, device and storage medium for realizing remote target statistics by using GPU

Info

Publication number: CN115511886B
Application number: CN202211462183.XA
Authority: CN
Inventors: 刘伟; 杜文华; 李彪; 曹伟
Original assignee: Yantai Xintong Semiconductor Technology Co ltd
Current assignee: Nanjing Sietium Semiconductor Co ltd
Priority date: 2022-11-17
Filing date: 2022-11-17
Publication date: 2023-04-28
Anticipated expiration: 2042-11-17
Also published as: CN115511886A

Abstract

The embodiment of the invention discloses a method, a device and a storage medium for realizing remote target statistics by using a GPU, wherein the method can comprise the following steps: receiving network image data of a mobile object acquired by a front-end camera to generate a network image message queue, and distributing the network image message queue to a corresponding GPU in a load balancing mode; performing image analysis and identification on the network image message queue by adopting GPU parallel processing to acquire a key frame message queue and marking a system time stamp on each key frame in the key frame message queue; reading each key frame in the key frame message queue to obtain the statistical count of the mobile object, and calculating the running speed and the acceleration of the mobile object according to the time difference of the two key frames; and if the running speed and the acceleration of the moving object exceed the set threshold values, giving an alarm or giving a statistical result.

Description

Method, device and storage medium for realizing remote target statistics by using GPU

Technical Field

The embodiment of the invention relates to the technical field of software processing of signals, in particular to a method, a device and a storage medium for realizing remote target statistics by using a GPU.

Background

For some working devices, after leaving the factory, the device does not have the function of counting the number of moving objects or counting the rotating speed of the moving device, and an independent external device is required to be installed for measurement. The common way is to capture with a camera and then identify the target to give statistics. At present, two common methods exist, one method is to install a camera at the front end of a measured object to shoot, send the shot result to a terminal computer through a network, and use an image serial processing mode to perform object identification at the terminal computer and give an alarm or statistics result. The method has the advantages of low cost, and only one host and one network camera are needed; the disadvantage is that the host computer needs a large amount of calculation for image recognition, the time is long, and the processing is not timely, so that the effective target is lost, and the result is inaccurate, especially for some high-speed targets. The other mode is to install an embedded device at the tested device end, directly acquire serial digital interface (Serial Digital Interface, SDI) signals or high-definition multimedia interface (High Definition Multimedia Interface, HMDI) signals by using an image acquisition module, then use a field programmable gate array (Field Programmable Gate Array, FPGA) for image recognition, and then transmit key frames to a terminal computer through a network. The method has the advantages of high efficiency in time and almost no target loss; the disadvantage is that there is a need to add equipment at the front end, many times there is not enough space to add such equipment, especially for field work environments, and more front end equipment is needed as more points of detection are available.

Disclosure of Invention

In view of this, the embodiments of the present invention expect to provide a method, an apparatus, and a storage medium for implementing remote target statistics by using a GPU, which can improve the efficiency of network image data identification, and reduce the hardware cost investment and the packet loss rate of the network image data.

The technical scheme of the embodiment of the invention is realized as follows:

in a first aspect, an embodiment of the present invention provides a method for implementing remote target statistics using a GPU, including:

receiving network image data of a mobile object acquired by a front-end camera to generate a network image message queue, and distributing the network image message queue to a corresponding GPU in a load balancing mode;

performing image analysis and identification on the network image message queue by adopting GPU parallel processing to acquire a key frame message queue and marking a system time stamp on each key frame in the key frame message queue;

reading each key frame in the key frame message queue to obtain the statistical count of the mobile object, and calculating the running speed and the acceleration of the mobile object according to the time difference of the two key frames;

and if the running speed and the acceleration of the moving object exceed the set threshold values, giving an alarm or giving a statistical result.

In a second aspect, an embodiment of the present invention provides an apparatus for implementing remote target statistics using a GPU, the apparatus comprising: a receiving part, an identifying part, a calculating part and an alarming part; wherein,

the receiving part is configured to receive the network image data of the mobile object acquired by the front-end camera to generate a network image message queue, and distribute the network image message queue to the corresponding GPU in a load balancing mode;

the identification part is configured to carry out image analysis and identification on the network image message queue by adopting GPU parallel processing so as to acquire a key frame message queue and mark each key frame in the key frame message queue with a system time stamp;

the calculating part is configured to read each key frame in the key frame message queue to acquire the statistical count of the mobile object, and calculate the running speed and the acceleration of the mobile object according to the time difference of the two key frames;

the alarm part is configured to alarm or give a statistical result if the running speed and the acceleration of the moving object exceed set thresholds.

In a third aspect, embodiments of the present invention provide a computing device, the computing device comprising: a communication interface, a memory and a processor; the components are coupled together by a bus system; wherein,

The communication interface is used for receiving and transmitting signals in the process of receiving and transmitting information with other external network elements;

the memory is used for storing a computer program capable of running on the processor;

the processor is configured to execute the steps of the method for implementing remote target statistics using a GPU according to the first aspect when running the computer program.

In a fourth aspect, an embodiment of the present invention provides a computer storage medium storing a program for implementing remote target statistics using a GPU, where the program for implementing remote target statistics using a GPU implements the steps of the method for implementing remote target statistics using a GPU according to the first aspect when executed by at least one processor.

The embodiment of the invention provides a method, a device and a storage medium for realizing remote target statistics by using a GPU, wherein a high-speed network camera is arranged at the front end of a mobile object, and shot network image data is transmitted through a tera-megaswitch and a tera-meganetwork card of a communication layer; and receiving the network image data at a terminal device, generating a network image message queue, carrying out image analysis and identification on the network image message queue by adopting a GPU parallel processing mode to obtain a key frame message queue, reading each key frame in the key frame message queue to obtain the statistical count of a mobile object, calculating the running speed and the running acceleration of the mobile object according to the time difference of the two key frames, comparing the calculated running speed and the running acceleration of the mobile object with a preset threshold, and giving an alarm or giving a statistical result if the running speed and the running acceleration of the mobile object exceed the preset threshold. The method does not need to add extra equipment, and can be suitable for monitoring of a larger range of environment; the algorithm is convenient to maintain and update, and if more monitoring points are to be added, a display card is only required to be added or updated in the terminal equipment; compared with the serial image processing algorithm of the traditional CPU, the network image data can be rapidly identified by utilizing a large number of work items in the GPU for parallel processing, and the time and the packet loss rate of network image data identification of terminal equipment are reduced, so that the efficiency of network image data identification is improved, and the hardware cost investment and the packet loss rate of the network image data are reduced.

Drawings

FIG. 1 is a schematic diagram of a statistical system according to an embodiment of the present invention;

fig. 2 is a schematic diagram of a computer system of a terminal device according to an embodiment of the present invention;

FIG. 3 is a block diagram illustrating an example implementation of the CPU, GPU, and system memory in FIG. 2;

FIG. 4 is a schematic diagram illustrating a computing unit of a GPU according to an embodiment of the present invention;

FIG. 5 is a flowchart of a method for implementing remote target statistics using a GPU according to an embodiment of the present invention;

FIG. 6 is a flowchart illustrating an operation of implementing remote target statistics using a GPU according to an embodiment of the present invention;

FIG. 7 is a schematic diagram of an apparatus for implementing remote target statistics using a GPU according to an embodiment of the present invention;

fig. 8 is a schematic diagram of another apparatus for implementing remote target statistics using a GPU according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention.

Referring to fig. 1, which shows a schematic diagram of the statistical system composition that enables embodiments of the present invention, it is noted that the system shown in fig. 1 is just one example of a possible system, and that embodiments of the present disclosure may be implemented in any of a variety of systems as desired.

As shown in fig. 1, the statistical system includes: a camera 11, a switch 12, and a terminal device 13, wherein the camera 11 may include a high-speed network camera, and is mainly used for collecting network image data of a moving object or network image data of a rotating speed of the mobile device; the switch 12 is used for transmitting network image data shot by the camera 11; the terminal device 13 may include, but is not limited to, a network card, a central processing unit (Central Processing Unit, CPU), a graphics processor (Graphic Processing Unit, GPU) and a memory, primarily for receiving and identifying the network image data.

Referring to FIG. 2, which shows a computer system composition 100 capable of implementing the embodiments of the present invention, it is noted that the illustrated system is merely one example of a possible system, and that embodiments of the present invention may be implemented in any of a variety of systems as desired. The computer system component 100 may be any type of computing device including, but not limited to, a desktop computer, a server, a workstation, a laptop computer, a computer-based emulator, a wireless device, a mobile or cellular telephone (including so-called smart phones), a Personal Digital Assistant (PDA), a video game console (including a video display, a mobile video game device, a mobile video conferencing unit), a laptop computer, a desktop computer, a television set-top box, a tablet computing device, an electronic book reader, a fixed or mobile media player, and the like. As shown in FIG. 1, computer system assembly 100 may include CPU 10, graphics processor GPU 20, memory 30, display processor 40, display 41, and communication interface 50. Display processor 40 may be part of the same integrated circuit (Integrated Circuit, IC) as GPU 20, may be external to one or more ICs comprising GPU 20, or may be formed in an IC external to an IC comprising GPU 20.

In particular, CPU 10 may include a general-purpose or special-purpose processor that controls the operation of computer system composition 100, configured to process instructions of a computer program for execution. A user may communicate via the communication interface 50 with another input device (not shown) coupled to the computer system composition 100, such as: a trackball, keyboard, mouse, microphone, touch pad, touch screen, and other types of devices, such as a switch interface, provide input to CPU 10 in computer system component 100 to cause CPU 10 to execute instructions of one or more software applications. Applications executing on CPU 10 may include graphical user interface (Graphic User Interface, GUI) applications, operating systems, portable graphics applications, computer-aided design for engineering or artistic applications, video game applications, word processor applications, email applications, spreadsheet applications, media player applications, or rendering applications using 2D, 3D graphics, etc., with embodiments of the present invention taking the execution of graphics rendering applications as examples. In addition, the rendering application executing on CPU 10 may include one or more graphics rendering instructions (which may also be understood as including one or more of the graphics in the frame of the picture to be rendered) that may conform to a graphics application programming interface (Application Programming Interface, API), such as an open graphics library API (OpenGL API), an open graphics library embedded system (OpenGLES) API, a Direct3D API, an X3D API, a RenderMan API, a WebGL API, an open computing language (OpenCLTM), a RenderScript, or any other heterogeneous computing API, such as an OpenCL API, or any other common or proprietary standard graphics or computing API, as will be described in the following description of the invention.

GPU 20 may be configured to perform graphics operations to render one or more graphics primitives to display 41 for presentation. It will be appreciated that CPU 10 translates rendering instructions into rendering commands readable by GPU 20 by controlling GPU driver 13, and that GPU 20 then renders and presents one or more graphics primitives on display 41 based on the received one or more graphics rendering commands, including, but not limited to, graphics commands and graphics data that may include rendering commands, state information, primitive information, texture information, etc., such that GPU 20 executes some or all of the graphics rendering commands. In some cases, GPU 20 may be built with a highly parallel structure that provides for more efficient processing of complex graphics-related operations than CPU 10. For example, GPU 20 may include a plurality of processing elements configured to operate on a plurality of vertices or pixels in a parallel manner. In some cases, the highly parallel nature of GPU 20 allows GPU 20 to draw graphical images (e.g., GUIs and two-dimensional (2D) and/or three-dimensional (3D) graphical scenes) onto display 41 more quickly than using CPU 10. In some cases, GPU 20 may be integrated into the motherboard of the target device. In other cases, GPU 20 may reside on a graphics card that is installed in a port in the motherboard of the target apparatus, or may be otherwise incorporated within a peripheral device configured to interoperate with the target apparatus. GPU 20 may include one or more processors, such as one or more microprocessors, application Specific Integrated Circuits (ASICs), field Programmable Gate Arrays (FPGAs), digital Signal Processors (DSPs), or other equivalent integrated or discrete logic circuitry. GPU 20 may also include one or more processor cores, such that GPU 20 may be referred to as a multi-core processor.

Memory 30 is configured to store application instructions capable of running on CPU 10, graphics data required for execution by GPU 20, and execution result data thereof. For example, GPU 20 may store the fully formed image in memory 30. Memory 30 may include one or more volatile or nonvolatile memory or storage devices such as Random Access Memory (RAM), static RAM (SRAM), dynamic RAM (DRAM), erasable Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), flash memory, magnetic data media, or optical storage media. Display processor 40 may retrieve the image from memory 30 and output values that illuminate pixels of display 41 to display the image. Display 41 may be a display of computer system 100 that displays graphical image content generated by GPU 20. The display 41 may be a Liquid Crystal Display (LCD), an organic light emitting diode display (OLED), a Cathode Ray Tube (CRT) display, a plasma display, or another type of display device.

FIG. 3 further illustrates, in conjunction with FIG. 2, a block diagram of an implementation of key components CPU 10, GPU 20, and memory 30 in computer system component 100. As shown in FIG. 2, a block diagram of an implementation of an embodiment of the present invention primarily includes, but is not limited to, CPU 10, GPU 20, memory 30, and their corresponding internal components. Wherein CPU 10 includes application 11, graphics API 12, GPU driver 13, wherein each of graphics API 12 and GPU driver 13 may serve one or more applications, in some examples graphics API 12 and GPU driver 13 may be implemented as hardware units of CPU 10, and GPU driver 13 may compile one or more graphics rendering instructions of CPU 10 into application commands executable by GPU 20. GPU 20 internal structures include, but are not limited to, graphics memory 21, processor cluster 22. In embodiments of the present invention, graphics memory 21 may be part of GPU 20. Thus, GPU 20 may read data from graphics memory 21 and write data to graphics memory 21 without using a bus. In other words, GPU 20 may process data locally using a local storage device rather than off-chip memory, such graphics memory 21 may be referred to as on-chip memory. This allows GPU 20 to operate in a more efficient manner by eliminating the need for GPU 20 to read and write data via a bus, which may experience heavy bus traffic. In some examples, GPU 20 may not include separate memory, but rather utilize external memory 30 via a bus. The type of the graphic memory 21 refers to the type of the aforementioned memory 30, and will not be described here again. Processor cluster 22 is used to execute the graphics processing pipeline to decode the graphics rendering commands and to configure the graphics processing pipeline to perform the operations specified in the graphics rendering commands. Memory 30 may include a system memory 31 and a display memory 32. The memory 32 may be part of the system memory 31 or may be separate from the system memory 31. The video memory 32, also referred to as a memory or frame buffer, may store rendered image data, such as pixel data, that is specifically stored as red, green, blue, alpha (RGBA) components for each pixel, where the "RGB" components correspond to color values and the "a" components correspond to destination alpha values (e.g., opacity values for image synthesis). In some examples, the video memory 32 may also be referred to as a Frame buffer (Frame buffer) or an output buffer, so as to preview the rendering effect of the Frame, or it may be understood that the rendering effect of the Frame may be achieved by a Frame buffer mechanism, where the Frame buffer is a driver interface in the kernel space, and does not have any capability of computing data, and does not process data in the middle, but needs support of a real graphics card driver.

As for the architecture of the GPU supporting the open operation language (Open Computing Language, openCL), single instruction multithreading (Single Instruction Multi Thread, SIMT) is a main mode of GPU parallel operation, that is, multiple multithreading executes the same operation instruction at the same time, or the data of each thread is different, but the executed operations are identical. The OpenCL platform consists of two parts, namely a host and OpenCL equipment. The host is generally a CPU, and plays an organizer role, wherein the roles include defining Kernel function Kernel, assigning context for Kernel function Kernel, defining NDRange and queue, etc.; the OpenCL device refers to a computing device called by an OpenCL program, and may be a CPU, a GPU, or a digital signal processor DSP, and any other processor supported by an OpenCL developer. In the embodiment of the present invention, a GPU is taken as an example of an OpenCL device, where the GPU may be a computing device of the same vendor platform, or may be a computing device of a different vendor platform, and the OpenCL device may include, but is not limited to, one computing device, or may be multiple computing devices, because the more GPUs provide more parallel computing units. The OpenCL-based GPU is divided from a hardware perspective into one or more computing units, each comprising one or more processing units, where each processing unit has an independent program counter. In particular, in connection with processor cluster 22 in FIG. 3, reference is made to FIG. 4, which illustrates a schematic diagram of a computing unit 400 of a GPU capable of implementing the teachings of embodiments of the present invention, in some examples, the computing unit 400 is capable of implementing one of a general purpose processing cluster in an array of processor clusters for highly parallel computing as a GPU to implement executing a large number of threads in parallel, each thread being an instance of a program. In this computing unit 400, multiple thread processors, also referred to as processing units, each of which may correspond to a thread, may be included that are organized into thread bundles warp. In some examples, the corresponding processing unit may be implemented as a work item, which is the most basic computing unit. The computing unit 400 may contain J warp 404-1 through 404-J, each warp having K processing units 406-1 through 406-K. In some examples, the warp 404-1 through 404-J may be further organized into one or more thread blocks (blocks) 402. In some examples, the thread blocks may be implemented as a workgroup, i.e., one workgroup made up of N multiple workitems.

In some examples, each warp may have 32 processing units; in other examples, each warp may have 4 processing units, 8 processing units, 16 processing units, or more than several tens of thousands of processing units; as shown in fig. 4, the embodiment of the present invention is described by taking the case that each warp has 16 processing units (i.e. k=16) as an example, it is to be understood that the above setting is only used for the description of the technical solution, and not limiting the protection scope of the technical solution, and those skilled in the art can easily adapt the technical solution described based on the above setting to other situations, which is not repeated herein. In some alternative examples, the computing unit 400 may organize the processing units into only warp, omitting the organization level of the thread block. Further, the computing unit 400 may also include a pipeline control unit 408, a shared memory 410, and an array of local memories 412-1 through 412-J associated with the warp 404-1 through 404-J. Pipeline control unit 408 distributes tasks to the various warp 404-1 through 404-J via data bus 414. Pipeline control unit 408 creates, manages, schedules, executes, and provides mechanisms to synchronize warp 404-1 through 404-J. With continued reference to the computing unit 400 shown in fig. 4, the processing units within the warp execute in parallel with each other. The warp 404-1 through 404-J communicate with the shared memory 410 through the memory bus 416. The warp 404-1 through 404-J communicates with local memories 412-1 through 412-J, respectively, through local buses 418-1 through 418-J. Such as that shown in fig. 4, warp 404-J to communicate over local bus 418-J to utilize local memory 412-J. Some embodiments of the computing unit 400 allocate a shared portion of the shared memory 410 to each thread block 402 and allow access to the shared portion of the shared memory 410 by all of the warp within the thread block 402. Some embodiments include warp that uses only local memory. Many other embodiments include warp that balances the use of local memory and shared memory 410.

It should be noted that, the OpenCL is a first open and free standard for parallel programming for general purposes of heterogeneous systems, and is also a unified programming environment, so that software developers can write efficient and light codes for high-performance computing servers, desktop computing systems and handheld devices, and the OpenCL is widely applicable to multiple CPU, GPU, cell architectures and other parallel processors such as DSPs, and has wide development prospects in various fields such as games, entertainment, scientific research and medical treatment.

The kernel is a parallel program executing on a computing device; the Kernel function Kernel is an entry function for executing operation on an OpenCL device program and is called at a host end; the Context defines the running environment of the whole OpenCL, including Kernel functions Kernel, device devices, program objects and memory objects; the queue controls how and when Kernel functions Kernel execute.

Based on the description, the camera is installed at the front end of the mobile object, the shot image is sent to the terminal equipment through the network, the host CPU of the terminal equipment adopts a serial image processing mode to carry out target identification, and an alarm or statistical result is given. For some high-speed images, because a large amount of calculation is needed for image identification in a CPU, the processing time is long, and the loss of effective images of moving objects can be caused by untimely processing, so that the statistics result is inaccurate. Based on this, referring to fig. 5, an embodiment of the present invention provides a method for implementing remote target statistics by using a GPU, the method includes:

S501: receiving network image data of a mobile object acquired by a front-end camera to generate a network image message queue, and distributing the network image message queue to a corresponding GPU in a load balancing mode;

s502: performing image analysis and identification on the network image message queue by adopting GPU parallel processing to acquire a key frame message queue and marking a system time stamp on each key frame in the key frame message queue;

s503: reading each key frame in the key frame message queue to obtain the statistical count of the mobile object, and calculating the running speed and the acceleration of the mobile object according to the time difference of the two key frames;

s504: and if the running speed and the acceleration of the moving object exceed the set threshold values, giving an alarm or giving a statistical result.

According to the description of the scheme, the embodiment of the invention adopts the mode that the front end of the mobile object is provided with the high-speed network camera, and the shot network image data is transmitted through the tera-megaswitch and the tera-meganetwork card of the communication layer; and receiving the network image data through a network card at a terminal device, generating a network image message queue, carrying out image analysis and identification on the network image message queue by adopting a GPU parallel processing mode to obtain a key frame message queue, reading each key frame in the key frame message queue to obtain the statistical count of a mobile object, calculating the running speed and the running acceleration of the mobile object according to the time difference of the two key frames, comparing the calculated running speed and the running acceleration of the mobile object with a preset threshold, and giving an alarm or giving a statistical result if the calculated running speed and the running acceleration of the mobile object exceed the preset threshold. The method does not need to add extra equipment, and can be suitable for monitoring of a larger range of environment; the algorithm is convenient to maintain and update, and if more monitoring points are to be added, a display card is only required to be added or updated in the terminal equipment; compared with the serial image processing algorithm of the traditional CPU, the network image data can be rapidly identified by utilizing a large number of work items in the GPU for parallel processing, and the time and the packet loss rate of network image data identification of terminal equipment are reduced, so that the efficiency of network image data identification is improved, and the hardware cost investment and the packet loss rate of the network image data are reduced.

For the technical solution shown in fig. 5, before processing the network image data received from the network card, image parameter information that needs to identify the network image is written into the memory of the GPU in advance, and in some examples, the method further includes:

loading image parameter information required by network image identification and creating a kernel function running on the GPU according to the kernel file; wherein, the image parameter information at least comprises the length-width ratio of the image and the maximum and minimum area information;

initializing equipment information of the GPU according to the image parameter information.

For the above example, the loading network image identifies the required image parameter information and creates a kernel function running on the GPU from the kernel file, specifically, the image parameter information includes aspect ratio of the image, maximum and minimum area information, image correction coordinate points, three-dimensional convolution kernel information for image denoising, and a binarization threshold. The three-dimensional convolution kernel information of the image denoising is 3*3 kernel, a user can manually input convolution kernel data during debugging, and the data can be stored into a GPU memory after being determined for subsequent use.

The kernel functions running on the GPU are created according to the kernel files, specifically, the programs running on the GPU are called kernel functions, but for writing the kernel functions, openCL is written in a separate file, and the file suffix is. In some examples, the OpenCL program is preferably built by calling a clBuildProgram function; and returning the construction information of each OpenCL device, namely the GPU, by calling the clGetProgrammBuildInfo function.

For the above example, the initializing the device information of the GPU according to the image parameter information, specifically, the initializing the device information of the GPU includes: the method comprises the steps of calling an OpenCL API to create an OpenCL device, creating an OpenCL context, creating an OpenCL memory object through a clCreateBuffer according to the OpenCL context, creating a message queue through a clCreateCommandQueue, loading an established OpenCL kernel file according to an algorithm required by image detection, and starting a key frame processing thread, a network image processing thread and a network image receiving thread in sequence.

For the technical solution shown in fig. 5, in some possible implementations, the receiving network image data of a mobile object acquired by a front-end camera to generate a network image message queue, and distributing the network image message queue to a corresponding GPU in a load balancing manner includes:

capturing the network image data through a network card and writing the network image data into a network image message queue;

and distributing the network image message queues to the corresponding GPUs in a load balancing mode according to the number of the network image message queues and the GPU.

For the above implementation manner, in some examples, the capturing the network image data through a network card and writing the network image data into a network image message queue, specifically, installing a high-speed network camera at a front end position of the mobile object, and transmitting the captured network image data, that is, an image of each frame, to a terminal device through a front end image acquisition and transmission module; the image acquisition and transmission module comprises a tera-megaswitch and a tera-meganetwork card and is used for transmitting and receiving network image data of a measured object; the terminal equipment receives the network image data from the network card through a terminal image receiving module, wherein the network image data is used as an original image data source and is used for inputting subsequent network image data processing. The network image data may be pixel data whose specific data information is red, green, blue, alpha (RGBA) components of each pixel, where the "RGB" components correspond to color values and the "a" components correspond to destination alpha values (e.g., opacity values for image synthesis).

For the above implementation manner, in some examples, the network image message queues are allocated to the corresponding GPUs in a load balancing manner according to the number of the network image message queues and the GPUs, specifically, for example, assuming that there are 100 frames of images in the network image message queues, the terminal device includes 2 OpenCL devices, that is, 2 GPUs, and each GPU processes 50 frames of images respectively.

For the technical solution shown in fig. 5, in some possible implementations, performing image analysis and recognition on the network image message queue by using GPU parallel processing to obtain a key frame message queue and system time stamping each key frame in the key frame message queue includes:

writing the network image message queue into a GPU memory as a memory object;

the kernel function for processing the network image message queue is issued to one or more GPUs in the OpenCL device queue through the OpenCL device message command queue;

and calling a network image message queue in the GPU memory, analyzing and identifying by adopting GPU parallel processing to obtain a key frame message queue, and marking a system time stamp on each key frame in the key frame message queue.

For the above implementation, in some examples, the network image message queue is written into the GPU memory as a memory object, specifically, the implementation is that the memory object may be created by a clCreateBuffer function and the network image message queue is bound to the corresponding memory object; and realizing a write command by calling a clEnqueWriteBuffer function, and writing the memory object into a memory of the OpenCL device, namely a GPU memory.

It should be noted that, the memory object is a variable required by the computing device to execute the OpenCL program, and may also be understood that the memory object is OpenCL data, where the data is generally stored in the OpenCL device memory, and can be written to or read from, and includes a cache object and an image object. The cache objects are stored in the continuous memory blocks in sequence and can be accessed directly by means of pointers, determinant and the like; the image object is a two-dimensional or three-dimensional memory object that can only be read by a function read_image or write_image, and that can be readable or writable, but not both.

For the above implementation manner, in some examples, the kernel function that processes the network image message queue is issued to one or more GPUs in the OpenCL device queue through the OpenCL device message command queue, specifically, the network image message queue is downloaded to a memory of the OpenCL device through an OpenCL download interface, and the composed image processing kernel function is issued to one or more GPUs in the OpenCL device queue through the OpenCL device message command queue for subsequent parallel packet operations.

For the above implementation, in some examples, the invoking the network image message queue in the GPU memory uses GPU parallel processing to analyze and identify to obtain a key frame message queue and system time stamp each key frame in the key frame message queue, specifically, using image convolution as an example, the implementation steps of using GPU parallel processing for the network image message queue include: 1. acquiring a platform message; 2. acquiring a device message; 3. creating a context; 4. creating a device message command queue for an OpenCL device with a device identification ID of 0; 5. creating a cache object; 6. loading a kernel file and creating a convolution computing kernel; 7. setting kernel parameters; 8. and performing convolution kernel to output an image after the background noise is removed by convolution. The specific code implementation mode is as follows:

step 1: platform information is acquired;

cl_int status = clGetPlatformIDs(0, NULL, &num_platform);

if (status != CL_SUCCESS) {

std::cout << "error: Getting platforms failed." << std::endl;

return nullptr;

}

cl_platform_id *platforms = nullptr;

if (num_platform > 0) {

platforms =

(cl_platform_id * )malloc(num_platform * sizeof(cl_platform_id));

clGetPlatformIDs(num_platform, platforms, NULL);

}

return platforms;

it should be noted that OpenCL implementations of different vendors define different OpenCL platforms, through which a computer can interoperate with an OpenCL device, where the OpenCL platforms include AMD, nvdia, and Intel. OpenCL uses an installable client driver model that can support platforms of different vendors to coexist in the system.

Step 2: acquiring equipment information;

cl_device_id *devices = NULL;

cl_int status = clGetDeviceIDs(platform, CL_DEVICE_TYPE_GPU, 0, NULL, &num_devices);

if (num_devices > 0) {

devices = (cl_device_id *)malloc(num_devices * sizeof(cl_device_id));

status = clGetDeviceIDs(platform, CL_DEVICE_TYPE_GPU, num_devices, devices, NULL);

}

return devices;

it should be noted that, the Device refers to a computing Device called by an OpenCL program, and may be a CPU, a GPU, or a DSP, and any other processor supported by an OpenCL developer.

Step 3: creating an OpenCL device context;

cl_context context = clCreateContext(NULL, 1, devices, NULL, NULL, NULL);

the parameter devices may be a plurality of OpenCL devices, or may be understood as a plurality of GPUs.

Step 4: creating a device message command queue for an OpenCL device with a device identification ID of 0;

cl_command_queue commandQueue = clCreateCommandQueue(context, device, 0, NULL);

step 5: creating a cache object;

input_buffer = clCreateBuffer(context, CL_MEM_READ_WRITE, imgsize, NULL, &err);

the input_buffer object is used for storing each image in the network image message queue, and calculating the network image message queue as an original image.

output_grayguss = clCreateBuffer( context, CL_MEM_WRITE_ONLY,width * height,NULL, &err);

The output_granguss cache object is used to store the convolutionally computed image.

Step 6: loading a kernel file and creating a convolution computing kernel;

cl_program program = CLR2::load_program(context, "smooth.cl", 1);

cl_kernel kernel_guss = clCreateKernel(program, "kernel_guss", &err);

step 7: setting kernel parameters;

clSetKernelArg(kernel_guss, 0, sizeof(cl_mem), &output_graybuff);

clSetKernelArg(kernel_guss, 1, sizeof(cl_mem), &output_grayguss);

clSetKernelArg(kernel_guss, 2, sizeof(cl_mem), &filter_in);

clSetKernelArg(kernel_guss, 3, sizeof(cl_mem), &imgheight);

clSetKernelArg(kernel_guss, 4, sizeof(cl_mem), &imgwidth);

where output_granbuff is the gray converted image and filter_in is the convolution kernel.

Step 8: performing convolution kernel to output an image with the background noise removed by convolution;

size_t localThreads [2] = {32, 4 };// arrangement of work items in workgroup

size_t globalThreads[2] = {((width+localThreads[0] - 1) / localThreads[0]) *localThreads[0],

((height + localThreads[1] - 1) / localThreads[1]) *localThreads[1]

(v/overall arrangement)

err=clEnqueueNDRangeKernel (commandQueue, kernel_guss,// launch kernel)

2, 0, globalThreads, localThreads,

0, NULL, & evt), and after execution of the core// COMPLETEs, the evt is set to successful or COMPLETE, and the state is expressed as CL_SUCCESS/CL_COMPLETE

clWaitForEvents (1, & evt);// wait for Command event to occur

clReleaseEvent (evt);// Release event

It should be noted that, NDRange is a main interface of a Kernel function Kernel of an OpenCL device for a common host CPU, and actually, other implementations exist, where NDRange is mainly used for packet operation. The dimension of the thread index space and the workgroup size need to be set first. The Kernel function Kernel is put in a queue through the function cl enquerjd range Kernel, but the Kernel function Kernel is not guaranteed to be executed immediately, the OpenCL driver manages the queue, the execution of the Kernel function Kernel is scheduled, the cl enquerjd range Kernel function puts the Kernel function Kernel to be executed in a designated command queue, the cl aitforvents function waits for command events to occur, and the parameter event_wait_list can be used for selecting some event events, and only after the event events are executed, the Kernel function Kernel can be executed, that is, synchronization among different Kernel functions Kernel can be realized through an event mechanism.

Since most OpenCL resources are pointers, they need to be released when not in use. These resources are also automatically released when the program is shut down, where the function of releasing the resources is: clRelease { Resource }, such as a clReleaseProgram function or a clReleaseMemObject function, and a clReleaseEvent function.

The following is a partial code implementation of the convolution processing gpu_kernel:

__kernel void kernel_guss(__global uchar * grayImage,

__global uchar * gussImage,

__global float* filter_in,

__global unsigned * const p_height,

__global unsigned * const p_width)

{

int icol = get_global_id(0);

int irow = get_global_id(1);

int height = *p_height;

int width = *p_width;

if((icol < width && icol > 0)&& (irow < (height -1)&& irow >0))

{

int lineBytes = width * 1;

int iprevious = (irow - 1) * lineBytes;

int current = irow * lineBytes;

int next = (irow + 1) * lineBytes;

}

for the solution shown in fig. 5, in some possible implementations, the reading each key frame in the key frame message queue to obtain a statistical count of the mobile object, and calculating the running speed and the acceleration of the mobile object according to the time difference between the two key frames includes:

reading each key frame in the key frame message queue to obtain the statistical count of the mobile object and the system time stamp of each key frame;

calculating the time difference of two key frames according to the system time stamp of the key frames;

and calculating the running speed and the acceleration of the moving object according to the time difference of the two key frames.

For the above implementation, in some examples, the reading each key frame in the key frame message queue to obtain the statistics count of the mobile object and the system timestamp of each key frame, specifically, the key frame is a frame of image with target information, and is obtained by identifying network image data; after the key frames are acquired, different algorithms can be adopted for analysis according to different scene requirements, and each key frame is stamped with a system time stamp in the network image processing process, so that the time is accurate to milliseconds. Specifically, assuming that the target is set to have a motion speed v, the method for counting the number of moving objects is as follows:

Step 1: the number of targets is increased by 1 after the first key frame is obtained, and the pixel distance between the current target pixel coordinate and the pixel coordinate of the target end point in the image is calculated

The time t= =. The time needed for the current target to move out of the field of view of the camera can be calculated

* M )/v；

Where M is the transform coefficient of the hypothetical pixel distance to the spatial distance.

Step 2: starting a timer, wherein the timing time is t, and the second key frame received in the timing time t is considered as a current target and is not processed;

step 3: and (5) repeating the step (1) after the timing is finished.

For the above implementation, in some examples, the step of calculating the running speed and acceleration of the mobile object according to the time difference between the two key frames, specifically, calculating the running speed and acceleration of the mobile object is as follows:

step 1: calibrating the camera, namely establishing a mapping relation between the coordinates of the pixels of the image and the coordinates of corresponding points in the space;

step 2: acquiring the running speed and the acceleration of a moving object;

in detail, the operation speed of the moving object is obtained as follows:

V = (

* M )/

the acceleration is:

/

wherein ,

representing the pixel distance of the moving object in key frames n+1 and n; m represents a conversion coefficient from a pixel distance to a spatial distance;

Representing the time difference between two key frames;

representing the difference in operating speed between two keyframes;

the pixel distance

=

-

wherein ,

representing a moving object in a key frame n+1 (n>0) Pixel coordinates of (a);

representing a moving object in a keyframe n (n>0) The pixel distance of the moving object in the key frames n+1 and n can be obtained through vector subtraction.

For the technical solution shown in fig. 5, in some possible implementations, if the running speed and the acceleration of the moving object exceed the set thresholds, an alarm is given or a statistical result is given, specifically, the calculated running speed and acceleration of the moving object are compared with the preset thresholds, if the running speed and the acceleration of the moving object exceed the preset thresholds, and if the running speed and the acceleration of the moving object need to be counted, the alarm is given; if the number of the moving objects needs to be counted, a counting result is given.

According to the description of the technical solution described in fig. 5, referring to fig. 6, an operation flow 600 for implementing remote target statistics by using a GPU according to an embodiment of the present invention is shown, in order to accelerate the processing speed to the maximum extent, an accurate result is obtained, where the operation flow 600 includes four threads, which are a main thread, a network image receiving thread, a network image processing thread, and a key frame processing thread, respectively. The processing sequence among the four threads is also executed according to the sequence.

The processing of the main thread is to load image parameter information required by network image recognition, and initialize GPU equipment information according to the image parameter information. The specific operation steps of the main thread processing are as follows:

s601: creating an OpenCL device;

an OpenCL device queue is created by calling an enqueue_kernel function.

S602: whether the OpenCL device is successfully created; if yes, go to step S603;

s603: creating an OpenCL device context;

creating the OpenCL Device context by calling a clCreateContext function, wherein the OpenCL Device context is an operating environment defining the whole OpenCL, and comprises a Kernel function Kernel, a Device, a program object and a memory object.

S604: creating image processing parameters through a clCreateBuffer function;

and taking image parameter information such as the length-width ratio, the maximum area information and the minimum area information of an image, the correction coordinate point of the image, the three-dimensional convolution kernel information of image denoising and a binarization threshold value as input, creating a cache object through a clCreateBuffer function, and storing the image processing parameters into the cache object.

S605: creating a message command queue through clCreateCommandQueue;

an OpenCL message command queue is created by calling the clCreateCommandQueue function.

S606: creating a cl_program according to the kernel file;

taking a source code cache creation program as an example, the specific implementation manner is as follows:

creating a program object: creating the program object by calling a clCreateProgramWithSource function or a clCreateProgramWithBinary function;

constructing an OpenCL program: the OpenCL program is constructed by calling the clBuildProgram function, and the construction information of each OpenCL device is returned by calling the clGetProgrammBuildInfo function;

creating a kernel object: creating one of the kernel objects by calling a clcreatekearel function;

creating a cache object: creating the cache object by calling a clCreateBuffer function;

writing the input buffer into the equipment end: and implementing a write command by calling a clEnqueWriteBuffer function, and writing the input cache into the OpenCL device queue.

S607: starting a key frame processing thread;

s608: starting a network image processing thread;

s609: and starting a network image receiving thread.

After the operation based on the steps is finished, the network image data acquired by the front-end camera is received from the network card through a network image receiving thread, and the specific operation steps are as follows:

s6090: monitoring network images;

And monitoring the network image data for receiving the monitored network image data.

S6091: whether a network image is monitored;

s6092: if the network image is monitored, the network image is written into a network image message queue.

Based on the network image message queue written by the network image receiving thread, the network image message queue is processed by the network image processing thread to obtain a key frame message queue. It may also be understood that the network image processing thread analyzes and identifies the network image of each frame to obtain the target in the image, where the target is a key frame. The specific operation steps of the network image processing thread are as follows:

s6080: storing the network image message queue as input to the network image processing thread to a GPU memory;

s6081: acquiring a network RGB image;

the network RGB diagram may be understood as pixel data of each frame of the network image message queue, and specific data information of the pixel data is red, green and blue of each pixel, wherein an "RGB" component corresponds to a color value.

S6082: the network image message queue is used as input, and is written into an OpenCL equipment queue through a clEnqueueWriteBuffer function;

S6083: creating a network image correction kernel through a clcreateKernel function so as to acquire corrected images;

firstly, setting kernel parameters of a network image correction kernel; the kernel parameters of the image correction kernel comprise image correction coordinate information, an input image memory object, an output image memory object, an image width and an image height; and secondly, calling a clEnquendRangeKernel function to send an instruction to an OpenCL device message command queue so as to acquire a corrected image. Specifically, the clenqueuerndrange Kernel function places Kernel function Kernel to be executed in a specified device message command queue, the size of the parameter global thread index space global_work_size must be specified, while the size of the parameter local line Cheng Suoyin space local_work_size may be specified or may be empty, if empty, the system will automatically select the appropriate size according to the hardware. The setting of the parameters is referred to as follows:

local_work_size: {32,4}

global_work_size { ((image width +32-1)/32) ×32, ((image height +4-1)/4) ×4}.

S6084: converting the corrected image into a gray scale image;

creating an image gray level conversion kernel and setting kernel parameters of the image gray level conversion kernel; the kernel parameters of the image gray level conversion kernel comprise an input image, an output image, an image width and an image height; and outputting a gray level image according to the kernel parameters of the image gray level conversion kernel.

S6085: denoising through OpenCL Gaussian filtering according to the gray level map to obtain a noiseless image; and based on the gray level map, creating a denoising kernel and setting kernel parameters of the denoising kernel to acquire a noise-free image.

S6086: according to the noiseless image, the background noise is removed through OpenCL convolution, so that the image without the background noise is obtained;

creating a background noise removing kernel and setting kernel parameters of the background noise removing kernel; the kernel parameters of the background kernel removal comprise an input image, an output image, an image width, an image height and a 3*3 convolution box; and outputting an image without background noise according to the kernel parameters without background noise.

S6087: according to the image without background noise, binarizing according to a threshold value to obtain a binarized image;

according to the image without background noise, a binarization kernel is created and kernel parameters of the binarization kernel are set; the kernel parameters of the binarization kernel comprise an input image, an output image, an image width, an image height and a binarization threshold value; and outputting a binarized image according to the kernel parameters of the binarized kernel.

S6088: according to the binarized image, the aspect ratio and the maximum and minimum area information of the image are obtained through OpenCL connected segmentation and feature extraction;

Establishing a connected split kernel according to the binarized image and setting kernel parameters of the connected split kernel; the kernel parameters of the connected segmentation kernel comprise input images, output images, image widths, image heights, input target lengths and widths, and maximum and minimum area information; and grabbing a target image according to the kernel parameters and outputting the target length-width ratio, the maximum area information and the minimum area information.

S6089: determining a key frame according to the target length-width ratio and the maximum and minimum area information;

and comparing the output target length-width ratio and the maximum and minimum area information with the length-width ratio and the maximum and minimum area information in the image parameter information required by the loading network image identification, and judging whether a key frame exists or not.

S608A: whether or not a key frame is grabbed; if yes, go to step S608B; if not, jumping to step S6081, and repeatedly executing the operation;

S608B: if the key frame exists, the key frame is written into a key frame message queue by stamping the current system time stamp.

Specifically, if the key frame exists, the key frame message queue is written in combination with the current system time. The key frames refer to pictures with target information, and each key frame is marked with a system time stamp by a network image processing thread, so that the time is accurate to milliseconds.

Based on a key frame message queue output by a network image processing thread, the key frame processing thread reads key frames in the key frame message queue and analyzes the key frames by adopting different algorithms according to different scene requirements. Calculating the time difference of two key frames according to the time stamp of each key frame by the image processing thread, and calculating the running speed and the acceleration of the moving object according to the time difference; and comparing the running speed and the acceleration of the moving object with the set threshold values, and giving an alarm or a statistical result. The specific operation steps are as follows:

s6070: taking the key frame message queue as an input of a key frame processing thread;

s6071: reading each key frame in the key frame message queue;

s6072: counting the moving objects according to the key frames;

s6073: calculating target running speed and acceleration according to the time difference of the two key frames;

the calculation method of the target running speed and the target acceleration is described in the foregoing, and will not be described herein.

S6074: and comparing the target running speed and the acceleration with a set threshold value, and informing whether to alarm or not.

Based on the same inventive concept as the previous technical solution, referring to fig. 7, an apparatus 700 for implementing remote target statistics by using a GPU is shown, where the apparatus 700 includes: a receiving section 701, an identifying section 702, a calculating section 703, and an alarm section 704; wherein,

The receiving part 701 is configured to receive network image data of the moving object collected by the front-end camera to generate a network image message queue, and distribute the network image message queue to the corresponding GPU in a load balancing manner;

the identifying part 702 is configured to perform image analysis and identification on the network image message queues by adopting GPU parallel processing so as to obtain key frame message queues and timestamp each key frame in the key frame message queues;

the calculating part 703 is configured to read each key frame in the key frame message queue to obtain a statistical count of the moving object, and calculate an operation speed and an acceleration of the moving object according to a time difference of the two key frames;

the alarm portion 704 is configured to alarm or give a statistical result if the running speed and acceleration of the moving object exceed set thresholds.

In some examples, the receiving portion 701 is configured to:

In some examples, the identification portion 702 is configured to:

writing the network image message queue into a GPU memory as a memory object;

In some examples, the computing portion 703 is configured to:

Referring to fig. 8, another apparatus 700 for implementing remote target statistics using a GPU is shown, in some examples, the apparatus 700 further comprising a creation portion 705 and an initialization portion 706, wherein,

The creating part 705 is configured to load image parameter information required for network image recognition and create a kernel function running on the GPU according to the kernel file; wherein, the image parameter information at least comprises the length-width ratio of the image and the maximum and minimum area information;

the initializing section 706 is configured to initialize device information of the GPU according to the image parameter information.

It should be understood that the exemplary technical solution of the apparatus 700 for implementing remote target statistics using a GPU described above belongs to the same concept as the technical solution of the method for implementing remote target statistics using a GPU described above, and therefore, details of the technical solution of the apparatus 700 for implementing remote target statistics using a GPU described above, which are not described in detail, may be referred to the description of the technical solution of the method for implementing remote target statistics using a GPU described above. The embodiments of the present invention will not be described in detail.

It will be appreciated that the technical solution shown in fig. 5 and the examples thereof may be implemented in the form of hardware or in the form of software functional modules, and the embodiments of the present invention are implemented in the form of software functional modules. If implemented as software functional parts, rather than being sold or used as a separate product, may be stored on a computer readable storage medium, based on the understanding that the technical solution of the present embodiment is essentially or partly contributing to the prior art or that all or part of the technical solution may be embodied in the form of a software product stored on a storage medium, comprising instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) or processor (processor) to perform all or part of the steps of the method described in the present embodiment. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes. Accordingly, the present embodiment provides a computer storage medium storing a program for implementing remote target statistics using a GPU, where the program for implementing remote target statistics using a GPU implements the steps of the method for implementing remote target statistics using a GPU in the above technical solution when executed by at least one processor.

The foregoing is merely illustrative of the present invention, and the present invention is not limited thereto, and any person skilled in the art will readily recognize that variations or substitutions are within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A method for implementing remote target statistics using a GPU, the method comprising:

performing image analysis and identification on the network image message queue by adopting GPU parallel processing through an OpenCL API so as to acquire a key frame message queue and marking a system time stamp on each key frame in the key frame message queue;

2. The method according to claim 1, characterized in that the method further comprises:

3. The method according to claim 1, wherein the receiving network image data of the moving object collected by the front-end camera to generate a network image message queue, and distributing the network image message queue to the corresponding GPU by means of load balancing, includes:

4. The method of claim 1, wherein performing image analysis and recognition on the network image message queue through the OpenCL API using GPU parallel processing to obtain a key frame message queue and system time stamping each key frame in the key frame message queue comprises:

Writing the network image message queue into a GPU memory as a memory object;

5. The method of claim 1, wherein said reading each key frame in said key frame message queue to obtain a statistical count of said moving object and calculating an operating speed and acceleration of said moving object based on a time difference of two said key frames comprises:

6. An apparatus for implementing remote target statistics using a GPU, the apparatus comprising: a receiving part, an identifying part, a calculating part and an alarming part; wherein,

the identification part is configured to perform image analysis and identification on the network image message queue through an OpenCL API by adopting GPU parallel processing so as to acquire a key frame message queue and mark each key frame in the key frame message queue with a system time stamp;

7. The apparatus of claim 6, further comprising a creation portion and an initialization portion, wherein,

the creation part is configured to load image parameter information required by network image identification and create a kernel function running on the GPU according to the kernel file; wherein, the image parameter information at least comprises the length-width ratio of the image and the maximum and minimum area information;

The initialization section is configured to initialize device information of the GPU according to the image parameter information.

8. The apparatus of claim 6, wherein the identification portion is configured to:

writing the network image message queue into a GPU memory as a memory object;

9. A computing device, the computing device comprising: a communication interface, a processor, a memory; the components are coupled together by a bus system; wherein,

the processor, when executing the computer program, performs the steps of the method for implementing remote target statistics using a GPU as claimed in any of claims 1 to 5.

10. A computer storage medium storing a program for implementing remote target statistics using a GPU, the program for implementing remote target statistics using a GPU implementing the steps of the method for implementing remote target statistics using a GPU as claimed in any of claims 1 to 5 when executed by at least one processor.