CN112906478A

CN112906478A - Target object identification method, device, equipment and storage medium

Info

Publication number: CN112906478A
Application number: CN202110088893.XA
Authority: CN
Inventors: 龚震霆; 徐志敏
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2021-01-22
Filing date: 2021-01-22
Publication date: 2021-06-04
Anticipated expiration: 2041-01-22
Also published as: CN112906478B

Abstract

The disclosure discloses a target object identification method, a target object identification device and a storage medium, and relates to the technical field of artificial intelligence, in particular to the technical field of computer vision and deep learning. The specific implementation method comprises the following steps: acquiring a video frame image; identifying a target object according to a detection frame of the target object detected in the video frame image, the embedded feature corresponding to the detection frame, a preset embedded feature buffer pool and a preset detection frame buffer pool; the embedding characteristics corresponding to the detection frame are obtained according to the detection frame and a classification network model with an embedding layer. The method and the device can realize more accurate and intelligent identification of the target object.

Description

Target object identification method, device, equipment and storage medium

Technical Field

The present disclosure relates to the field of artificial intelligence technology, and more particularly, to the field of computer vision and deep learning technology.

Background

Canine management is a common problem in urban treatment and is one of important means for urban safety development. In the related technology, dog management work is mostly finished by manual monitoring and routing inspection and people reporting, the efficiency is low, the coverage rate is not high, and therefore relevant dog identification requirements are provided in relevant projects of smart cities.

In the related technology, the display of the detection frame of the dog after detection is not intelligent enough, and the requirement of dog management work cannot be met.

Disclosure of Invention

The disclosure provides a target object identification method, a target object identification device and a storage medium.

According to an aspect of the present disclosure, there is provided a target object identification method, including:

acquiring a video frame image;

identifying a target object according to a detection frame of the target object detected in the video frame image, the embedded feature corresponding to the detection frame, a preset embedded feature buffer pool and a preset detection frame buffer pool; the embedding characteristics corresponding to the detection frame are obtained according to the detection frame and a classification network model with an embedding layer.

According to another aspect of the present disclosure, there is provided an apparatus for identifying a target object, including:

the acquisition module is used for acquiring a video frame image;

the processing module is used for identifying a target object according to a detection frame of the target object detected in the video frame image, the embedded feature corresponding to the detection frame, a preset embedded feature buffer pool and a preset detection frame buffer pool; the embedding characteristics corresponding to the detection frame are obtained according to the detection frame and a classification network model with an embedding layer.

According to another aspect of the present disclosure, there is provided an electronic device including:

at least one processor; and a memory communicatively coupled to the at least one processor;

wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of identifying a target object provided by the present disclosure.

According to another aspect of the present disclosure, there is provided a non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform the method of identifying a target object provided by the present disclosure.

According to another aspect of the present disclosure, a computer program product is provided, comprising a computer program which, when executed by a processor, implements the method of identification of a target object provided by the present disclosure.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

fig. 1 is a schematic flow chart of a target object identification method provided by the present disclosure;

FIG. 2 is a block diagram of a target object recognition apparatus provided by the present disclosure;

fig. 3 is a block diagram of an electronic device for implementing a method of identifying a target object of an embodiment of the present disclosure.

Detailed Description

The following description of the exemplary embodiments of the present application, taken in conjunction with the accompanying drawings, includes various details of the embodiments of the application for the understanding of the same, which are to be considered exemplary only. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present application. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

AI (Artificial Intelligence) is a technical science that studies and develops theories, methods, techniques and application systems for simulating, extending and expanding human Intelligence. At present, the AI technology has the advantages of high automation degree, high accuracy and low cost, and is widely applied.

CV (Computer Vision) is a science for researching how to make an artificial system "perceive" from images or multidimensional data, wherein the science is that a camera and a Computer are used for replacing human eyes to perform machine Vision such as identification, tracking and measurement on targets, and further performing graphic processing, so that the Computer processing becomes images more suitable for human eyes to observe or transmit to an instrument to detect.

DL (Deep Learning) is a new research direction in the field of ML (Machine Learning), and is an intrinsic rule and an expression level for Learning sample data, so that a Machine can have an analysis Learning capability like a human, can recognize data such as characters, images and sounds, and is widely applied to speech and image recognition.

Referring to fig. 1, fig. 1 is a method for identifying a target object according to the present disclosure, including:

s101, acquiring a video frame image; the video frame image can be extracted from a video of the camera equipment for monitoring a scene;

s102, identifying a target object according to a detection frame of the target object detected in the video frame image, an embedded feature corresponding to the detection frame, a preset embedded feature buffer pool and a preset detection frame buffer pool; the embedding characteristics corresponding to the detection frame are obtained according to the detection frame and a classification network model with an embedding layer.

According to the embodiment, the embedding characteristics of the detection frame of the video frame image are obtained by adding the classification network model of the embedding layer embedding, so that the target object can be identified more accurately, and especially, the detection frame after the dog is detected can be displayed more intelligently.

In the above embodiments of the present disclosure, the classification network model with the embedded layer is obtained by training according to a data set, where the data set includes a detection box of a target object.

Optionally, the data set includes: and (4) cutting pictures of the real outlines corresponding to the canine target objects.

In an alternative embodiment of the present disclosure, the canine public Data set Stanford Dogs Datset is used as the Data _ SDD. Monitoring videos of different monitoring places such as city streets, highways and the like are collected, and videos with dogs are screened to be used as Data _ videos. Since each video is 90% similar picture and there is a large amount of redundancy, a small fixed number of image frame Data sets are randomly extracted for each video as Data _ Frames. And manually checking and screening the Data set Data _ Frames, and reserving picture Frames containing dogs to obtain the Data set Data _ wDog. Manually labeling the group route label of each dog for each picture in the Data set Data _ wDog. For each picture in the Data set Data _ wDog, a crop picture corresponding to a ground channel (real contour) is cut out as Data _ gtprogress loop, and meanwhile, a rectangular crop picture with an IOU (Intersection over Unit) value smaller than 0.15 in the background picture of the monitored scene and the ground channel (a standard for measuring the accuracy of detecting a corresponding object in a specific Data set) is randomly cut out as embedded Data _ bgloop.

In an alternative embodiment of the disclosure, the classification network model with an embedding layer is a residual network with an embedding layer, which is located before a fully connected layer of the residual network.

Here, based on the everything classification pre-training model, a 512-dimensional Embedding layer is added before the last full-connection layer, and the classification network model Embedding512_ Net is obtained. Of course, the dimension of the Embedding layer is not limited to 512, and may also be 256, 128, etc., and the model Embedding512_ Net performs fine tuning training on Data _ SDD by selecting the corresponding dimension according to the actual situation. And (3) taking a Data set Data _ gtDogCrop obtained from a monitoring scene as a positive sample, taking Data _ bgRandCrop as a negative sample, and finely tuning the training model Embelling 512_ Net. The training model here may be, but is not limited to, a Resnet residual network model.

In an optional embodiment of the present disclosure, step S102 may include:

s1021, maintaining the embedded feature buffer pool with the first duration and the detection frame buffer pool with the second duration;

s1022, setting the Nth time period T_NPlacing the Nth detection frame of the target object detected in the video frame image into the detection frame buffer pool;

s1023, the N detection frame is placed into the embedding feature buffer pool by using the N embedding feature obtained by the classification network model with the embedding layer;

s1024, acquiring next time period T_N+1An N +1 th embedded feature obtained by processing an N +1 th detection frame of a target object detected in an inner video frame image by using the classification network model with the embedded layer;

s1025, obtaining the similarity between the N +1 th embedded feature and all embedded features in the embedded feature buffer pool;

and S1026, identifying the target object according to the similarity, wherein N is an integer greater than or equal to 1, and repeating the process until the target object is finally identified.

In S1026, identifying the target object according to the similarity includes:

s10261, if the embedded feature buffer pool has the embedded feature with the similarity to the (N + 1) th embedded feature larger than a preset value, determining that the target object appears, and not placing the (N + 1) th embedded feature into the embedded feature buffer pool;

s10262, if there is no embedding feature with similarity greater than a preset value with the N +1 th embedding feature in the embedding feature buffer pool, determining that the target object does not appear, placing the N +1 th embedding feature into the embedding feature buffer pool, and identifying the target object according to the intersection ratio of the N +1 th detection frame and the detection frames in the detection frame buffer pool.

In step S10262, identifying the target object according to the intersection ratio between the N +1 th detection frame and the detection frame in the detection frame buffer pool includes:

s102621, acquiring the intersection ratio of the (N + 1) th detection frame;

s102622, if the merging ratio of the target detection frame in the detection frame buffer pool is greater than the merging ratio of the N +1 th detection frame, determining that the target object corresponding to the N +1 th detection frame and the target object corresponding to the target detection frame are the same target object, not placing the N +1 th detection frame in the detection frame buffer pool, otherwise, placing the N +1 th detection frame in the detection frame buffer pool.

In the above embodiment, repeating the above process until the target object is finally identified includes:

when the first time length of the embedded feature buffer pool is reached, emptying the embedded feature buffer pool, and repeating the operation of whether to place the (N + 1) th embedded feature into the embedded feature buffer pool or not;

and when the second duration of the detection frame buffer pool is reached, emptying the detection frame buffer pool, and repeating the operation of whether the (N + 1) th detection frame is placed in the detection frame buffer pool or not until the target object is finally identified.

An example implementation of this embodiment is as follows:

1) and maintaining two buffer pools: 3 seconds of embedding _ pool and 5 seconds of bbox _ pool (detection frame buffer pool), wherein 3 seconds of embedding _ pool are not persistent, 5 seconds of bbox _ pool are persistent (bbox natural failure time is 5 seconds), where 3 seconds is the above first duration, 5 seconds is the second duration, the first duration is not limited to 3 seconds, and the second duration is not limited to 5 seconds;

2) and extracting one frame every second for the video of the monitoring scene.

3) For the first second, if a dog1 is detected, the detection frame is used as a dog1_ bbox, the detection layer of the model embed 512_ Net is used for extracting characteristic dog1_ embed from crop picture dog1_ bbox crop corresponding to dog1_ bbox, dog1_ embed is put into the embed _ pool, and dog1_ bbox is put into the bbox _ pool.

4) For the second, if a dog2 is detected, the detection frame is used as a dog2_ bbox, the Embedding layer of the model Embedding512_ Net is used for extracting a characteristic dog2_ Embedding from a crop picture dog2_ bbox crop corresponding to the dog2_ bbox, and all the Embedding in the dog2_ Embedding and the Embedding _ pool are compared in similarity:

if the pool has embedding with similarity of > 80%, the dog2 is considered to be only appeared, and the dog2_ embedding is not put into the embedding _ pool, wherein 80% is only for illustration and is not limited to 80%;

if the imbedding with similarity of more than 80% does not exist in the pool, the dog2 is considered to be theoretically unavailable, and the dog2_ imbedding is put into the imbedding _ pool, and then the following operations are carried out:

comparing the IOU of all bboxes in the pool of dog2_ bbox and bbox _ pool, if there is a bbox with IOU greater than 0.8, then let dog2_ bbox theoretically be the same dog as the bbox with IOU greater than 0.8, dog2_ bbox not enter bbox _ pool (here, there is a defect, dog2_ bbox may be a new dog and filtered out, but this probability is small), and at the same time giving a bbox with IOU greater than 0.8 a continuous hit (resetting the natural duration of bbox, i.e. counting 5 seconds from this second again);

if there is no bbox with IOU (cross-over ratio) greater than 0.8 in the pool, then dog2_ bbox is considered to be a new dog or a previously detected dog but has walked out of the previous location, dog2_ bbox is added to the bbox _ pool; the 0.8 is not limited to 0.8, and may be other values.

5) Third second, if a dog, dog3, is detected, the operation is the same as second.

6) Fourth second, empty imbedding _ pool; if a dog4 is detected, dog4_ embedding is added into an embedding _ pool, then all the bboxes in dog4_ bbox and bbox _ pool are compared, and the subsequent operation of dog4_ bbox is the same as the second

7) Fifth second, if a dog is detected as dog5, the operation is the same as second.

8) Sixth, empty bbox not reset natural dead time in bbox _ pool; if a dog is detected 5, then dog5_ embedding and dog5_ bbox are performed as the second. The above operation is then repeated.

The embodiment of the disclosure is applied to a smart city project and has great help for dog management in planning. The method can more accurately identify the target object, particularly the canine target object, and can intelligently display the detection frame of the canine target object.

As shown in fig. 2, an embodiment of the present disclosure further provides an apparatus 200 for identifying a target object, including:

an obtaining module 201, configured to obtain a video frame image;

a processing module 202, configured to identify a target object according to a detection frame of the target object detected in the video frame image, an embedded feature corresponding to the detection frame, and a preset embedded feature buffer pool and a preset detection frame buffer pool; the embedding characteristics corresponding to the detection frame are obtained according to the detection frame and a classification network model with an embedding layer.

Optionally, the classification network model with the embedded layer is obtained by training according to a data set, where the data set includes a detection box of a target object.

Optionally, the classification network model with an embedding layer is a residual network with an embedding layer, and the embedding layer is located before a fully-connected layer of the residual network.

Optionally, the processing module 202 includes:

the first processing unit is used for maintaining an embedded characteristic buffer pool with a first duration and a detection frame buffer pool with a second duration;

a second processing unit for converting the Nth time period T_NPlacing the Nth detection frame of the target object detected in the video frame image into the detection frame buffer pool;

a third processing unit, configured to place the nth embedded feature, obtained by processing the nth detection frame using the classification network model with the embedded layer, into the embedded feature buffer pool;

a fourth processing unit for obtaining the next time period T_N+1An N +1 th embedded feature obtained by processing an N +1 th detection frame of a target object detected in an inner video frame image by using the classification network model with the embedded layer;

a fifth processing unit, configured to obtain similarities between the N +1 th embedded feature and all embedded features in the embedded feature buffer pool;

and the sixth processing unit is used for identifying the target object according to the similarity, wherein N is an integer greater than or equal to 1, and the process is repeated until the target object is finally identified.

Optionally, the sixth processing unit includes: the first processing subunit is configured to, if the embedded feature buffer pool has an embedded feature with a similarity greater than a preset value to the (N + 1) th embedded feature, determine that the target object has appeared, and not place the (N + 1) th embedded feature in the embedded feature buffer pool; and the second processing subunit is configured to, if there is no embedded feature in the embedded feature buffer pool whose similarity to the (N + 1) -th embedded feature is greater than a preset value, determine that the target object has not appeared, place the (N + 1) -th embedded feature into the embedded feature buffer pool, and identify the target object according to an intersection-and-merge ratio of the (N + 1) -th detection frame and a detection frame in the detection frame buffer pool.

Optionally, the second processing subunit is specifically configured to: acquiring the intersection ratio of the (N + 1) th detection frame; if the intersection ratio of the target detection frames in the detection frame buffer pool is larger than the intersection ratio of the (N + 1) th detection frame, determining that the target object corresponding to the (N + 1) th detection frame and the target object corresponding to the target detection frame are the same target object, not placing the (N + 1) th detection frame in the detection frame buffer pool, otherwise, placing the (N + 1) th detection frame in the detection frame buffer pool.

Optionally, the sixth processing unit includes: a third processing subunit, configured to, when the first duration of the embedded feature buffer pool arrives, empty the embedded feature buffer pool, and repeat an operation of whether to place an N +1 th embedded feature in the embedded feature buffer pool; and the fourth processing subunit is configured to, when the second duration of the detection frame buffer pool reaches, empty the detection frame buffer pool, and repeat an operation of whether to place the (N + 1) th detection frame in the detection frame buffer pool until the target object is finally identified.

It should be noted that the apparatus is an apparatus corresponding to the above method, and all implementations of the above method are applicable to the embodiment of the apparatus, and the same technical effects can be achieved.

The present disclosure also provides an electronic device, comprising:

wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of the above embodiments.

The present disclosure also provides a non-transitory computer readable storage medium having computer instructions stored thereon and a computer program product according to an embodiment of the present disclosure.

FIG. 3 illustrates a schematic block diagram of an example electronic device 300 that can be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 3, the apparatus 300 includes a computing unit 301 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM)302 or a computer program loaded from a storage unit 308 into a Random Access Memory (RAM) 303. In the RAM 303, various programs and data required for the operation of the device 300 can also be stored. The calculation unit 301, the ROM 302, and the RAM 303 are connected to each other via a bus 304. An input/output (I/O) interface 305 is also connected to bus 304.

Various components in device 300 are connected to I/O interface 305, including: an input unit 306 such as a keyboard, a mouse, or the like; an output unit 307 such as various types of displays, speakers, and the like; a storage unit 308 such as a magnetic disk, optical disk, or the like; and a communication unit 309 such as a network card, modem, wireless communication transceiver, etc. The communication unit 309 allows the device 300 to exchange information/data with other devices via a computer network such as the internet and/or various telecommunication networks.

The computing unit 301 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of the computing unit 301 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The calculation unit 301 executes the respective methods and processes described above, such as the identification method of the target object. For example, in some embodiments, the identification method of the target object may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 308.

In some embodiments, part or all of the computer program may be loaded and/or installed onto device 300 via ROM 302 and/or communication unit 309. When the computer program is loaded into RAM 303 and executed by the computing unit 301, one or more steps of the method 308 described above may be performed. Alternatively, in other embodiments, the computing unit 301 may be configured to perform the identification method of the target object by any other suitable means (e.g. by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), the internet, and blockchain networks.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The service end can be a cloud Server, also called a cloud computing Server or a cloud host, and is a host product in a cloud computing service system, so as to solve the defects of high management difficulty and weak service expansibility in the traditional physical host and VPS service (Virtual Private Server, or VPS for short). The server may also be a server of a distributed system, or a server incorporating a blockchain.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel or sequentially or in different orders, and are not limited herein as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved.

The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.

Claims

1. A method of identifying a target object, comprising:

acquiring a video frame image;

2. The method for identifying a target object according to claim 1, wherein the classification network model with the embedded layer is trained according to a data set, and the data set comprises a detection box of the target object.

3. The target object identification method of claim 2, wherein the data set comprises: and (4) cutting pictures of the real outlines corresponding to the canine target objects.

4. The method of identifying a target object according to claim 1 or 2, wherein the classification network model with an embedding layer is a residual network with an embedding layer, the embedding layer preceding a fully connected layer of the residual network.

5. The method for identifying the target object according to claim 1, wherein identifying the target object according to the detection frame of the target object detected in the video frame image, the embedded feature corresponding to the detection frame, and the preset embedded feature buffer pool and detection frame buffer pool comprises:

maintaining an embedded feature buffer pool of a first duration and a detection frame buffer pool of a second duration;

the Nth time period T_NPlacing the Nth detection frame of the target object detected in the video frame image into the detection frame buffer pool;

placing the Nth embedded feature obtained by processing the Nth detection frame by using the classification network model with the embedded layer into the embedded feature buffer pool;

obtaining the next time period T_N+1An N +1 th embedded feature obtained by processing an N +1 th detection frame of a target object detected in an inner video frame image by using the classification network model with the embedded layer;

acquiring the similarity between the N +1 th embedded feature and all embedded features in the embedded feature buffer pool;

and identifying the target object according to the similarity, wherein N is an integer greater than or equal to 1, and repeating the process until the target object is finally identified.

6. The target object identification method of claim 5, wherein identifying the target object according to the similarity comprises:

if the embedded feature buffer pool has embedded features with the similarity with the (N + 1) th embedded feature being greater than a preset value, determining that the target object appears, and not placing the (N + 1) th embedded feature into the embedded feature buffer pool;

if the embedding feature with the similarity larger than a preset value with the N +1 th embedding feature does not exist in the embedding feature buffer pool, determining that the target object does not exist, placing the N +1 th embedding feature into the embedding feature buffer pool, and identifying the target object according to the intersection and combination ratio of the N +1 th detection frame and the detection frames in the detection frame buffer pool.

7. The method for identifying the target object according to claim 6, wherein identifying the target object according to the intersection-to-parallel ratio of the N + 1-th detection frame and the detection frames in the detection frame buffer pool comprises:

acquiring the intersection ratio of the (N + 1) th detection frame;

if the intersection ratio of the target detection frames in the detection frame buffer pool is larger than the intersection ratio of the (N + 1) th detection frame, determining that the target object corresponding to the (N + 1) th detection frame and the target object corresponding to the target detection frame are the same target object, not placing the (N + 1) th detection frame in the detection frame buffer pool, otherwise, placing the (N + 1) th detection frame in the detection frame buffer pool.

8. The target object identification method of claim 6, wherein repeating the above process until the target object is finally identified comprises:

9. An apparatus for identifying a target object, comprising:

the acquisition module is used for acquiring a video frame image;

10. The apparatus of claim 9, wherein the classification network model with embedded layers is trained from a data set that includes detection boxes for target objects.

11. The apparatus of claim 10, wherein the data set comprises: and (4) cutting pictures of the real outlines corresponding to the canine target objects.

12. The apparatus of claim 9 or 10, wherein the classification network model with an embedding layer is a residual network with an embedding layer that precedes a fully connected layer of the residual network.

13. The apparatus of claim 9, wherein the processing module comprises:

14. The apparatus of claim 13, wherein the sixth processing unit comprises:

the first processing subunit is configured to, if the embedded feature buffer pool has an embedded feature with a similarity greater than a preset value to the (N + 1) th embedded feature, determine that the target object has appeared, and not place the (N + 1) th embedded feature in the embedded feature buffer pool;

and the second processing subunit is configured to, if there is no embedded feature in the embedded feature buffer pool whose similarity to the (N + 1) -th embedded feature is greater than a preset value, determine that the target object has not appeared, place the (N + 1) -th embedded feature into the embedded feature buffer pool, and identify the target object according to an intersection-and-merge ratio of the (N + 1) -th detection frame and a detection frame in the detection frame buffer pool.

15. The apparatus according to claim 14, wherein the second processing subunit is specifically configured to:

acquiring the intersection ratio of the (N + 1) th detection frame;

16. The apparatus of claim 14, wherein the sixth processing unit comprises:

a third processing subunit, configured to, when the first duration of the embedded feature buffer pool arrives, empty the embedded feature buffer pool, and repeat an operation of whether to place an N +1 th embedded feature in the embedded feature buffer pool;

and the fourth processing subunit is configured to, when the second duration of the detection frame buffer pool reaches, empty the detection frame buffer pool, and repeat an operation of whether to place the (N + 1) th detection frame in the detection frame buffer pool until the target object is finally identified.

17. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1 to 8.

18. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1 to 8.

19. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any one of claims 1 to 8.