CN113095434A

CN113095434A - Target detection method and device, electronic equipment and storage medium

Info

Publication number: CN113095434A
Application number: CN202110462394.2A
Authority: CN
Inventors: 李搏; 窦浩轩; 甘伟豪
Original assignee: Shenzhen Sensetime Technology Co Ltd
Current assignee: Shenzhen Sensetime Technology Co Ltd
Priority date: 2021-04-27
Filing date: 2021-04-27
Publication date: 2021-07-09

Abstract

The embodiment of the disclosure discloses a target detection method and device, electronic equipment and a storage medium. The method comprises the following steps: detecting the obtained image to be detected by adopting the updated target detection model to obtain a target detection result; the updated target detection model is obtained by training the target detection model by adopting a scene image which corresponds to the result predicted image and is subjected to manual labeling processing; the target detection model is obtained by training the initial detection model based on a preset training set, and the result predicted image is a scene image which is detected by the target detection model and comprises prediction annotation data; the preset training set is a plurality of sample images of a preset category selected from the data set; the data set is a plurality of images marked with target objects, and the target objects of a preset number of images in the plurality of images belong to different categories. By the method and the device, the accuracy of the target detection result can be improved.

Description

Target detection method and device, electronic equipment and storage medium

Technical Field

The present disclosure relates to image processing technologies, and in particular, to a target detection method and apparatus, an electronic device, and a storage medium.

Background

Large-scale city-level cat and dog detection plays an important role in smart cities and other aspects, and compared with conventional human/vehicle detection, cat and dog detection has more challenges due to the fact that the cat and dog are small and samples of the cat and dog are difficult to collect.

The detection models for detecting cats and dogs in the related technology are not high in precision, so that the detection effect is poor.

Disclosure of Invention

The embodiment of the disclosure provides a target detection method and device, an electronic device and a storage medium, which can improve the accuracy of a target detection result.

The technical scheme of the embodiment of the disclosure is realized as follows:

the embodiment of the disclosure provides a target detection method, which includes: acquiring an image to be detected; carrying out target detection on the image to be detected by adopting the updated target detection model to obtain a target detection result; the updated target detection model is obtained by training the target detection model by adopting a scene image which corresponds to the result predicted image and is subjected to manual labeling processing; the target detection model is obtained by training an initial detection model based on a preset training set, and the result prediction image is a scene image which is detected by the target detection model and comprises prediction annotation data; the artificial labeling processing is used for verifying the prediction labeling data and correctly labeling a target object in a scene image corresponding to the result prediction image according to a verification result; the preset training set is a plurality of sample images which are selected from the data set and belong to a preset category; the data set is a plurality of images marked with target objects, and the target objects of a preset number of images in the plurality of images belong to different categories.

In the above method, before the step of performing target detection on the image to be detected by using the updated target detection model to obtain a target detection result, the method further includes: performing target detection on a plurality of scene images by adopting a target detection model to obtain a result predicted image; acquiring the scene image which corresponds to the result predicted image and is subjected to manual labeling processing; and taking the scene image subjected to the manual labeling processing as a training sample, and training the target detection model to obtain the updated target detection model.

In the above method, the scene image subjected to the artificial labeling processing includes: the scene image with the target labeling data and the scene image without the target labeling data.

In the above method, the training the target detection model with the scene image subjected to the manual labeling processing as a training sample to obtain the updated target detection model includes: determining a scene image with target labeling data as a positive sample, and determining a scene image without target labeling data as a negative sample, wherein the positive sample and the negative sample are the training samples; and training the target detection model according to the training sample to obtain the updated target detection model.

In the above method, the training the target detection model with the scene image subjected to the manual labeling processing as a training sample to obtain the updated target detection model includes: and under the condition that the number of the scene images subjected to the artificial labeling processing is greater than or equal to a preset threshold value, taking the scene images subjected to the artificial labeling processing as training samples, and training the target detection model to obtain the updated target detection model.

In the above method, before the performing target detection on a plurality of scene images by using the target detection model to obtain the result predicted image, the method further includes: intercepting video frames in a scene video stream at preset time intervals according to attribute parameters of the video frames to obtain a preset number of video frames; determining the preset number of video frames as the plurality of scene images.

In the method, the scene image with the target marking data is a scene image marked with at least one of a cat and a dog; the scene image without the target marking data is the scene image without at least one of the cat and the dog.

In the above method, the preset categories include: at least one of a cat and a dog.

In the above method, the image to be detected includes: a plurality of sub-images to be detected; the target object includes: at least one of a cat and a dog; the target detection result comprises: a first result image noting a location area of the at least one of a cat and a dog, or a second result image not noting a location area of the at least one of a cat and a dog; and adopting the updated target detection model to carry out target detection on the image to be detected to obtain a target detection result, and comprising the following steps of: performing target detection on each sub-image to be detected by adopting the updated target detection model to obtain at least one region of interest of each sub-image to be detected and posterior probability corresponding to each region of interest in the at least one region of interest; under the condition that the posterior probability corresponding to any one of the at least one interested area is greater than or equal to a preset value, marking the position area of at least one of the cat and the dog in each sub-image to be detected to obtain a first result image; and under the condition that the posterior probabilities corresponding to the at least one region of interest are all smaller than the preset value, not marking the position region of at least one of the cat and the dog in each sub-image to be detected to obtain the second result image.

In the above method, the target detection model includes: the system comprises a convolution layer, a region generation network, a pooling layer, a full-link layer and a normalization index layer; the performing target detection on each sub-image to be detected by using the updated target detection model to obtain at least one region of interest of each sub-image to be detected and posterior probability corresponding to each region of interest in the at least one region of interest, includes: performing convolution processing on each sub-image to be detected by using the convolution layer to obtain a characteristic diagram corresponding to each sub-image to be detected; adopting the area generation network to identify the interested areas of the feature map to obtain at least one interested area of the feature map; performing pooling treatment on each region of interest in the at least one region of interest by using the pooling layer to obtain corresponding feature vectors; converting the feature vectors into corresponding two-dimensional vectors by adopting the full-connection layer; and carrying out normalization processing on the two-dimensional vector by adopting the normalization index layer to obtain the posterior probability of each interested area.

The embodiment of the present disclosure provides a target detection apparatus, including: the image acquisition module is used for acquiring an image to be detected; the target detection module is used for carrying out target detection on the image to be detected by adopting the updated target detection model to obtain a target detection result; the updated target detection model is obtained by training the target detection model by adopting a scene image which corresponds to the result predicted image and is subjected to manual labeling processing; the target detection model is obtained by training an initial detection model based on a preset training set, and the result prediction image is a scene image which is detected by the target detection model and comprises prediction annotation data; the artificial labeling processing is used for verifying the prediction labeling data and correctly labeling a target object in a scene image corresponding to the result prediction image according to a verification result; the preset training set is a plurality of sample images which are selected from the data set and belong to a preset category; the data set is a plurality of images marked with target objects, and the target objects of a preset number of images in the plurality of images belong to different categories.

The above-mentioned device still includes: the detection module is used for performing target detection on the image to be detected by adopting the updated target detection model to obtain a target detection result, and performing target detection on a plurality of scene images by adopting the target detection model to obtain a result prediction image; the sample acquisition module is used for acquiring the scene image which corresponds to the result predicted image and is subjected to manual labeling processing; and the updating module is used for taking the scene image subjected to the manual labeling processing as a training sample, training the target detection model and obtaining the updated target detection model.

In the above apparatus, the scene image subjected to the artificial labeling processing includes: the scene image with the target labeling data and the scene image without the target labeling data.

In the above apparatus, the update module is further configured to determine a scene image with target labeling data as a positive sample, and determine a scene image without target labeling data as a negative sample, where the positive sample and the negative sample are the training samples; and training the target detection model according to the training sample to obtain the updated target detection model.

In the above apparatus, the updating module is further configured to, when the number of the scene images subjected to the manual labeling processing is greater than or equal to a preset threshold, train the target detection model by using the scene images subjected to the manual labeling processing as a training sample, so as to obtain the updated target detection model.

In the above apparatus, the detection module is further configured to, before the target detection is performed on multiple scene images by using the target detection model to obtain the result predicted image, capture video frames in a scene video stream at preset time intervals according to attribute parameters of the video frames to obtain a preset number of video frames; determining the preset number of video frames as the plurality of scene images.

In the device, the scene image with the target marking data is a scene image marked with at least one of a cat and a dog; the scene image without the target marking data is the scene image without at least one of the cat and the dog.

In the above apparatus, the preset categories include: at least one of a cat and a dog.

In the above apparatus, the image to be detected includes: a plurality of sub-images to be detected; the target object includes: at least one of a cat and a dog; the target detection result comprises: a first result image noting a location area of the at least one of a cat and a dog, or a second result image not noting a location area of the at least one of a cat and a dog; the target detection module is further configured to perform target detection on each sub-image to be detected by using the updated target detection model to obtain at least one region of interest of each sub-image to be detected and a posterior probability corresponding to each region of interest in the at least one region of interest; under the condition that the posterior probability corresponding to any one of the at least one interested area is greater than or equal to a preset value, marking the position area of at least one of the cat and the dog in each sub-image to be detected to obtain a first result image; and under the condition that the posterior probabilities corresponding to the at least one region of interest are all smaller than the preset value, not marking the position region of at least one of the cat and the dog in each sub-image to be detected to obtain the second result image.

In the above apparatus, the target detection model includes: the system comprises a convolution layer, a region generation network, a pooling layer, a full-link layer and a normalization index layer; the target detection module is further configured to perform convolution processing on each sub-image to be detected by using the convolution layer to obtain a feature map corresponding to each sub-image to be detected; adopting the area generation network to identify the interested areas of the feature map to obtain at least one interested area of the feature map; performing pooling treatment on each region of interest in the at least one region of interest by using the pooling layer to obtain corresponding feature vectors; converting the feature vectors into corresponding two-dimensional vectors by adopting the full-connection layer; and carrying out normalization processing on the two-dimensional vector by adopting the normalization index layer to obtain the posterior probability of each interested area.

An embodiment of the present disclosure provides an electronic device, including: a memory for storing an executable computer program; a processor for implementing the above object detection method when executing the executable computer program stored in the memory.

The embodiment of the present disclosure provides a computer-readable storage medium storing a computer program for causing a processor to execute the method for detecting the target.

By adopting the technical scheme, before the updated target detection model is adopted to perform target detection on the image to be detected, the target detection model is trained by using the sample image which belongs to the preset category and is selected from the preset data set, then the scene image is mined through the target detection model to obtain the scene image with the prediction annotation data, and the trained target detection model is continuously trained according to the original scene image with correct artificial annotation data corresponding to the obtained scene image with the prediction annotation data to obtain the updated target detection model, so that the obtained updated target detection model is more suitable for the detection scene in practical application, the accuracy of the updated target detection model is higher, and under the condition of performing target detection by using the updated target detection model, the obtained target detection result is more accurate, and the accuracy of the target detection result is improved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and, together with the description, serve to explain the principles of the disclosure.

Fig. 1 is an alternative architecture diagram of a target detection system provided by an embodiment of the present disclosure;

fig. 2A is a schematic structural diagram of a first terminal according to an embodiment of the present disclosure;

fig. 2B is a schematic structural diagram of a second terminal according to an embodiment of the present disclosure;

fig. 3 is a schematic structural diagram of an electronic device provided in an embodiment of the present disclosure;

fig. 4 is an alternative flow chart of a target detection method provided by the embodiment of the disclosure;

fig. 5 is an alternative flow chart of a target detection method provided by the embodiment of the disclosure;

FIG. 6 is a schematic flowchart illustrating an exemplary process of target detection of a scene image by a target detection model according to an embodiment of the present disclosure;

fig. 7 is an alternative flow chart of a target detection method provided by the embodiment of the disclosure;

FIG. 8A is an exemplary image of a scene with target annotation data provided by embodiments of the present disclosure;

FIG. 8B is an exemplary image of a scene without target annotation data provided by an embodiment of the disclosure;

fig. 9 is an alternative flow chart of a target detection method provided by the embodiment of the disclosure;

fig. 10 is a schematic flowchart of an application scenario of the target detection method according to the embodiment of the present disclosure.

Detailed Description

For the purpose of making the purpose, technical solutions and advantages of the present disclosure clearer, the present disclosure will be described in further detail with reference to the accompanying drawings, the described embodiments should not be construed as limiting the present disclosure, and all other embodiments obtained by a person of ordinary skill in the art without making creative efforts shall fall within the protection scope of the present disclosure.

In the following description, reference is made to "some embodiments" which describe a subset of all possible embodiments, but it is understood that "some embodiments" may be the same subset or different subsets of all possible embodiments, and may be combined with each other without conflict.

In the following description, references to the terms "first \ second \ third" are only to distinguish similar objects and do not denote a particular order, but rather the terms "first \ second \ third" are used to interchange specific orders or sequences, where permissible, so that the disclosed embodiments described herein can be practiced in other than the order shown or described herein.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this disclosure belongs. The terminology used herein is for the purpose of describing embodiments of the disclosure only and is not intended to be limiting of the disclosure.

Before further detailed description of the embodiments of the present disclosure, terms and expressions referred to in the embodiments of the present disclosure are explained, and the terms and expressions referred to in the embodiments of the present disclosure are applied to the following explanations.

1) Target detection: in general, target detection needs to extract features of a picture through a target detection network, identify a foreground and a background of the extracted features, and classify a correct foreground as a target object; object detection focuses on a specific object, requiring that category information and location information of this object be obtained simultaneously.

2) Training a sample: the target detection network extracts the corresponding characteristics of the target object through learning the positive sample and the negative sample aiming at the target object in the training sample, and correctly classifies the foreground and the background of the target object, thereby realizing target detection.

At present, large-scale city-level cat and dog detection plays a very important role in smart cities and security scenes, and compared with conventional human/vehicle detection, cat and dog detection has more challenges. The study finds that the difficulties in detecting cats and dogs are as follows: (1) the difficulty is the same as that of most outdoor detection, the cats and dogs are small, and the detection effect on small samples (the area of an image area occupied by a target is small in the whole image area) is poor; (2) in addition, there are fewer cats and dogs outdoors and samples are more difficult to collect. Therefore, the method similar to other detection tasks, such as collecting cat and dog data → sending label → training, is adopted, the accuracy of the obtained model is low, and the detection effect can hardly meet the requirement.

The embodiment of the disclosure provides a target detection method and device, an electronic device and a storage medium, which can improve the accuracy of a target detection result. An exemplary application of the object detection device provided by the embodiments of the present disclosure is described below, and the device provided by the embodiments of the present disclosure may be implemented as various types of user terminals such as a notebook computer with an image capture device, a tablet computer, a desktop computer, a set-top box, a mobile device (e.g., a mobile phone, a portable music player, a personal digital assistant, a dedicated messaging device, a portable game device), and the like. In the following, an exemplary application will be explained when the device is implemented as a terminal.

Referring to fig. 1, fig. 1 is an alternative architecture diagram of an object detection system 100 provided in the embodiment of the present disclosure, where the object detection system 100 includes a first terminal (model training device) 400-1 and a second terminal (object detection device) 400-2; to support an object detection application, the first terminal 400-1 and the second terminal 400-2 are connected to the server 200 via the network 300, and the network 300 may be a wide area network or a local area network, or a combination thereof.

The first terminal 400-1 is configured to perform target detection on multiple scene images by using a target detection model to obtain a result predicted image; the target detection model is obtained by training the initial detection model based on a preset training set; the preset training set is a plurality of sample images which are selected from the data set and belong to a preset category; the data set is a plurality of images marked with target objects, and the target objects of a preset number of images in the plurality of images belong to different categories; the resulting predicted image includes: predicting the annotation data; the scene image which corresponds to the result prediction image and is subjected to manual labeling processing is obtained; the artificial labeling processing is used for verifying the prediction labeling data and correctly labeling the target object in the scene image corresponding to the result prediction image according to the verification result; taking the scene image subjected to manual labeling processing as a training sample, training a target detection model to obtain an updated target detection model, wherein the updated target detection model is used for performing target detection on the image to be detected; and sends the obtained updated target detection model to the server 200, and forwards the updated target detection model to the second terminal 400-2 through the server 200. The second terminal 400-2 is used for acquiring an image to be detected; performing target detection on an image to be detected by using the updated target detection model obtained by the first terminal 400-1 to obtain a target detection result, displaying the target detection result on the graphical interface 4001, sending the detection result to the server 200, and storing the detection result in the database 500 by the server 200; and the database 500 is further configured to store a preset training set used by the first terminal 400-1 for training the target detection model.

According to the foregoing, in the target detection system 100, under the condition that the first terminal and the second terminal are disposed on different electronic devices, the electronic device where the second terminal is located may perform target detection on the image to be detected by using the updated target detection model obtained by training the electronic device where the first terminal is located, so as to obtain the result of target detection.

In some embodiments of the present disclosure, the first terminal and the second terminal may also be disposed on the same electronic device, that is, the training of the target detection model and the use in the online scene may be performed by the same electronic device. In this embodiment, the electronic device may not need to interact with the server 200, so that after the updated target detection model is obtained through training, the updated target detection model may be used to perform target detection on the image to be detected.

Illustratively, in the application scenario of a smart city, the database 500 stores preset training sets for various target objects, such as fireworks, banners, cats, dogs, etc. Under the condition that cats and dogs in a city need to be detected or cats and dogs need to be detected at the same time, the first terminal 400-1 may obtain a preset training set of the cats and dogs from the database 500 through the server 200, and train the target detection network based on the preset training set to obtain the target detection model.

In some embodiments, the server 200 may be an independent physical server, may also be a server cluster or a distributed system formed by a plurality of physical servers, and may also be a cloud server providing basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a network service, cloud communication, a middleware service, a domain name service, a security service, a CDN, and a big data and artificial intelligence platform. The terminal 400 may be, but is not limited to, a smart phone, a tablet computer, a notebook computer, a desktop computer, a smart speaker, a smart watch, and the like. The terminal and the server may be directly or indirectly connected through wired or wireless communication, and the embodiment of the present disclosure is not limited thereto.

Fig. 2A and 2B are schematic structural diagrams of a first terminal and a second terminal in a case where the first terminal and the second terminal belong to different electronic devices according to an embodiment of the present disclosure; fig. 3 is a schematic structural diagram of an electronic device provided in an embodiment of the present disclosure, in a case that a first terminal and a second terminal belong to the same electronic device.

Referring to fig. 2A, fig. 2A is a schematic structural diagram of a first terminal 400-1 according to an embodiment of the present disclosure, where the first terminal 400-1 shown in fig. 2A includes: at least one processor 410, memory 450, at least one network interface 420, and a user interface 430. The various components in the first terminal 400-1 are coupled together by a bus system 440. It is understood that the bus system 440 is used to enable communications among the components. The bus system 440 includes a power bus, a control bus, and a status signal bus in addition to a data bus. For clarity of illustration, however, the various buses are labeled as bus system 440 in fig. 2A.

The Processor 410 may be an integrated circuit chip having Signal processing capabilities, such as a general purpose Processor, a Digital Signal Processor (DSP), or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, or the like, wherein the general purpose Processor may be a microprocessor or any conventional Processor, or the like.

The user interface 430 includes one or more output devices 431, including one or more speakers and/or one or more visual displays, that enable the presentation of media content. The user interface 430 also includes one or more input devices 432, including user interface components that facilitate user input, such as a keyboard, mouse, microphone, touch screen display, camera, other input buttons and controls.

The memory 450 may be removable, non-removable, or a combination thereof. Exemplary hardware devices include solid state memory, hard disk drives, optical disk drives, and the like. Memory 450 optionally includes one or more storage devices physically located remote from processor 410.

The memory 450 includes either volatile memory or nonvolatile memory, and may include both volatile and nonvolatile memory. The nonvolatile Memory may be a Read Only Memory (ROM), and the volatile Memory may be a Random Access Memory (RAM). The memory 450 described in embodiments of the present disclosure is intended to comprise any suitable type of memory.

In some embodiments, memory 450 is capable of storing data, examples of which include programs, modules, and data structures, or a subset or superset thereof, to support various operations, as exemplified below.

An operating system 451, including system programs for handling various basic system services and performing hardware-related tasks, such as a framework layer, a core library layer, a driver layer, etc., for implementing various basic services and handling hardware-based tasks;

a network communication module 452 for communicating to other computing devices via one or more (wired or wireless) network interfaces 420, exemplary network interfaces 420 including: bluetooth, wireless compatibility authentication (WiFi), and Universal Serial Bus (USB), etc.;

a presentation module 453 for enabling presentation of information (e.g., user interfaces for operating peripherals and displaying content and information) via one or more output devices 431 (e.g., display screens, speakers, etc.) associated with user interface 430;

an input processing module 454 for detecting one or more user inputs or interactions from one of the one or more input devices 432 and translating the detected inputs or interactions.

In some embodiments, the apparatus provided by the embodiments of the present disclosure may be implemented in software, and fig. 2A illustrates a model training apparatus 455 stored in a memory 450, which may be software in the form of programs and plug-ins, and the like, and includes the following software modules: a detection module 4551, a sample acquisition module 4552 and an update module 4553, which are logical and thus may be arbitrarily combined or further split depending on the functions implemented. The functions of the respective modules will be explained below.

Referring to fig. 2B, fig. 2B is a schematic structural diagram of a second terminal 400-2 according to an embodiment of the present disclosure, where the second terminal 400-2 shown in fig. 2B includes: at least one processor 510, memory 550, at least one network interface 520, and a user interface 530. The various components in the second terminal 400-2 are coupled together by a bus system 540. It is understood that the bus system 540 is used to enable communications among the components. The bus system 540 includes a power bus, a control bus, and a status signal bus in addition to a data bus. But for clarity of illustration the various buses are labeled as bus system 540 in figure 2B.

The processor 510 may be an integrated circuit chip having signal processing capabilities, such as a general purpose processor, a digital signal processor, or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or the like, wherein the general purpose processor may be a microprocessor or any conventional processor or the like.

The user interface 530 includes one or more output devices 531 enabling presentation of media content, including one or more speakers and/or one or more visual display screens. The user interface 530 also includes one or more input devices 532, including user interface components to facilitate user input, such as a keyboard, mouse, microphone, touch screen display, camera, other input buttons and controls.

The memory 550 may be removable, non-removable, or a combination thereof. Exemplary hardware devices include solid state memory, hard disk drives, optical disk drives, and the like. Memory 550 optionally includes one or more storage devices physically located remote from processor 510.

The memory 550 may comprise volatile memory or nonvolatile memory, and may also comprise both volatile and nonvolatile memory. The non-volatile memory may be a read-only memory and the volatile memory may be a random access memory. The memory 550 described in embodiments of the present disclosure is intended to comprise any suitable type of memory.

In some embodiments, memory 550 can store data to support various operations, examples of which include programs, modules, and data structures, or subsets or supersets thereof, as exemplified below.

An operating system 551 including system programs for processing various basic system services and performing hardware-related tasks, such as a framework layer, a core library layer, a driver layer, etc., for implementing various basic services and processing hardware-based tasks;

a network communication module 552 for communicating to other computing devices via one or more (wired or wireless) network interfaces 520, exemplary network interfaces 520 including: bluetooth, wireless compatibility authentication, universal serial bus, and the like;

a presentation module 553 for enabling presentation of information (e.g., a user interface for operating peripherals and displaying content and information) via one or more output devices 531 (e.g., a display screen, speakers, etc.) associated with the user interface 530;

an input processing module 554 to detect one or more user inputs or interactions from one of the one or more input devices 532 and to translate the detected inputs or interactions.

In some embodiments, the apparatus provided by the embodiments of the present disclosure may be implemented in software, and fig. 2B shows a detection apparatus 555 stored in a memory 550, which may be software in the form of programs and plug-ins, and the like, and includes the following software modules: an image acquisition module 5551 and an object detection module 5552, which are logical and thus may be arbitrarily combined or further separated depending on the functionality implemented. The functions of the respective modules will be explained below.

Referring to fig. 3, fig. 3 is a schematic structural diagram of an electronic device 600 provided in an embodiment of the present disclosure, where the electronic device 600 shown in fig. 3 includes: at least one processor 610, memory 650, at least one network interface 620, and a user interface 630. The various components in electronic device 600 are coupled together by a bus system 640. It is understood that bus system 640 is used to enable communications among the components. Bus system 640 includes a power bus, a control bus, and a status signal bus in addition to a data bus. For clarity of illustration, however, the various buses are labeled as bus system 640 in fig. 3.

The processor 610 may be an integrated circuit chip having signal processing capabilities such as a general purpose processor, a digital signal processor, or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc., wherein the general purpose processor may be a microprocessor or any conventional processor, etc.

The user interface 630 includes one or more output devices 631 including one or more speakers and/or one or more visual displays that enable the presentation of media content. The user interface 630 also includes one or more input devices 632, including user interface components to facilitate user input, such as a keyboard, mouse, microphone, touch screen display, camera, other input buttons and controls.

The memory 650 may be removable, non-removable, or a combination thereof. Exemplary hardware devices include solid state memory, hard disk drives, optical disk drives, and the like. Memory 650 optionally includes one or more storage devices physically located remote from processor 610.

The memory 650 includes volatile memory or nonvolatile memory, and may include both volatile and nonvolatile memory. The non-volatile memory may be a read-only memory and the volatile memory may be a random access memory. The memory 650 described in embodiments of the present disclosure is intended to comprise any suitable type of memory.

In some embodiments, memory 650 can store data to support various operations, examples of which include programs, modules, and data structures, or subsets or supersets thereof, as exemplified below.

An operating system 651 including system programs for handling various basic system services and performing hardware-related tasks, such as a framework layer, a core library layer, a driver layer, etc., for implementing various basic services and for handling hardware-based tasks;

a network communication module 652 for reaching other computing devices via one or more (wired or wireless) network interfaces 620, exemplary network interfaces 620 including: bluetooth, wireless compatibility authentication, universal serial bus, and the like;

a presentation module 653 for enabling presentation of information (e.g., a user interface for operating peripherals and displaying content and information) via one or more output devices 631 (e.g., display screens, speakers, etc.) associated with the user interface 630;

an input processing module 654 for detecting one or more user inputs or interactions from one of the one or more input devices 632 and translating the detected inputs or interactions.

In some embodiments, the apparatus provided by the embodiments of the present disclosure may be implemented in software, and fig. 3 shows the object detection apparatus 655 stored in the memory 650, which may be software in the form of programs and plug-ins, etc., and includes the following software modules: the image acquisition module 6551, the object detection module 6552, the detection module 6553, the sample acquisition module 6554 and the update module 6555, which are logical and thus can be arbitrarily combined or further separated depending on the functions implemented. The functions of the respective modules will be explained below.

In other embodiments, the model training Device, the target detection Device, and the detection Device provided in the embodiments of the present disclosure may be implemented in hardware, and the model training Device and the target detection Device provided in the embodiments of the present disclosure may be, for example, a processor in the form of a hardware decoding processor, which is programmed to perform the image processing method provided in the embodiments of the present disclosure, for example, the processor in the form of the hardware decoding processor may be implemented by one or more Application Specific Integrated Circuits (ASICs), DSPs, Programmable Logic Devices (PLDs), Complex Programmable Logic Devices (CPLDs), Field Programmable Gate Arrays (FPGAs), or other electronic components.

Hereinafter, the object detection method provided by the embodiment of the present disclosure will be described in conjunction with exemplary applications and implementations of the above-mentioned electronic device 600 provided by the embodiment of the present disclosure.

FIG. 4 is a schematic flow chart diagram illustrating an alternative method for detecting an object provided by an embodiment of the present disclosure; as shown in fig. 4, the method includes:

and S101, acquiring an image to be detected.

The image to be detected can be an image directly acquired aiming at a specific environment, and can also be each frame image obtained by performing frame processing on a video stream acquired by performing real-time video acquisition aiming at the specific environment; for example, when detecting a cat or a dog in a certain scene, the image to be detected may be an image of the scene captured at a preset frequency, or may be an image of each frame obtained by performing frame processing on a video stream of the scene acquired in real time.

The image to be detected may include a target object, where the target object refers to an object to be detected in a specific environment, such as a cat or a dog in a certain scene, or a pedestrian at a certain traffic intersection; the image to be detected may also not contain the target object.

The image to be detected may be an image acquired by the electronic device through an image acquisition device of the electronic device, for example, a camera, or may be an image input into the electronic device by an external image acquisition device, which is not limited in this disclosure.

S102, carrying out target detection on an image to be detected by adopting the updated target detection model to obtain a target detection result; the updated target detection model is obtained by training the target detection model by adopting a scene image which corresponds to the result predicted image and is subjected to manual labeling processing; the target detection model is obtained by training the initial detection model based on a preset training set, and the result predicted image is a scene image which is detected by the target detection model and comprises prediction annotation data; the artificial labeling processing is used for verifying the prediction labeling data and correctly labeling the target object in the scene image corresponding to the result prediction image according to the verification result; the preset training set is a plurality of sample images which are selected from the data set and belong to a preset category; the data set is a plurality of images marked with target objects, and the target objects of a preset number of images in the plurality of images belong to different categories.

The electronic device may perform target detection on the image to be detected by using the updated target detection model obtained through training, so as to obtain an image in which the target object is marked by using the bounding box and a category corresponding to the target object.

Here, the data set is composed of a plurality of images in which the target objects are marked, and the target objects of a preset number of images in the plurality of images belong to different categories; illustratively, the data set may be an academic collection including a plurality of different categories of sample images in which the target object is labeled. Here, the prediction annotation data may be a region of a bounding box that marks the position of the target object in the image, or a region that does not include any bounding box, and correspondingly, the resultant prediction image may be an image of a scene in which the target object is marked with a bounding box, or an image of a scene in which no target object is marked with a bounding box. In some embodiments, the number of the plurality of sample images belonging to the preset category selected from the data set may be 200, may also be 500, and the like, which is not limited by the embodiments of the present disclosure.

In some embodiments, the preset category is a traveler, for example, may include at least one of a cat and a dog. The electronic equipment can select a preset number of cat or dog labeled sample images from the academic set in advance, or add the cat and dog labeled sample images into a preset training set to form a training set for training the initial detection model; correspondingly, the resulting predictive image can be a predictive image for at least one of a cat and a dog. Here, for target objects to be recognized which are less frequent in a video scene and belong to a preset category, a plurality of images marked with the target objects to be recognized are selected from an academic collection to serve as sample images, training data for training corresponding to an initial detection model can be enriched, so that a needed cold start model can be obtained by utilizing the enriched training data for quick training, collection of a large number of training samples is not needed, and the efficiency of model training is improved.

In some embodiments, the preset training set may include, in addition to the images belonging to the preset category selected from the academic set, pre-acquired scene images belonging to the preset category.

In some embodiments, the scene images may be a plurality of video scene images obtained in advance, for example, images of a scene taken by the electronic device at a preset frequency by using the image capturing device; or a plurality of video frames obtained by the electronic device performing frame cutting operation on the scene video stream of some scenes. In other embodiments, the scene image may also be obtained by the electronic device from other apparatuses, which is not limited by the embodiments of the present disclosure. It should be noted that some scenes may be indoor scenes in a city, outdoor scenes in a city, and the like, and this is not limited in the embodiments of the present disclosure.

It should be noted that the "target detection model" appearing below refers to the model obtained by training the initial detection model based on the preset training set; the "updated target detection model" appearing separately refers to a model obtained by training the target detection model with a scene image corresponding to the result predicted image and subjected to manual labeling processing.

Fig. 5 is an optional flowchart of the object detection method provided in the embodiment of the present disclosure, and as shown in fig. 5, S102 in fig. 4 may be implemented by S1021-S1023, which will be described with reference to the steps shown in fig. 5.

S1021, performing target detection on each sub-image to be detected by adopting the updated target detection model to obtain at least one region of interest of each sub-image to be detected and posterior probability corresponding to each region of interest in the at least one region of interest; the image to be detected includes: a plurality of sub-images to be detected; the target object includes: at least one of a cat and a dog; the target detection result comprises: a first result image with a location area of at least one of a cat and a dog noted, or a second result image with a location area of at least one of a cat and a dog not noted.

The electronic device may perform target detection on the obtained sub-images to be detected by using the updated target detection model, and for each sub-image to be detected, at least one region of interest (RoI) and a posterior probability corresponding to each region of interest in the at least one region of interest may be obtained. It should be noted that the target detection model may be any network model capable of performing target detection, and this is not limited in the embodiment of the present disclosure.

In some embodiments of the present disclosure, the multiple sub-images to be detected may be multiple continuous images, and the electronic device may perform target detection on each image according to the sequence of the multiple continuous images; in other embodiments of the present disclosure, the multiple sub-images to be detected may also be multiple discontinuous images. It should be noted that, in the embodiment of the present disclosure, the target detection model may detect one image at a time, or may detect multiple images at a time, which is not limited in the embodiment of the present disclosure.

In the embodiment of the present disclosure, one image to be detected may be an image containing a cat or an image containing a dog, or may be an image containing both a cat and a dog; accordingly, the target object to be detected from each image to be detected is a cat or a dog, or a cat and a dog.

In the embodiment of the present disclosure, each input sub-image to be detected may be an image including at least one of a cat and a dog, or an image including neither a cat nor a dog, and therefore, for one sub-image to be detected, after performing object detection by using the updated object detection model, the obtained object detection result may include: a first result image noting a location area of at least one of a cat and a dog, and a category of a target object in the location area; or a second result image in which the location area of at least one of the cat and the dog is not marked. For example, when the target object to be detected is a cat and an input image B to be detected includes a cat, the target detection result output by the updated target detection model is the image B to be detected (the first result image) with the bounding box, and the bounding box indicates the position area of the cat in the image B to be detected and the category of the indicated target object: "Cat"; and under the condition that an input sub-image B to be detected does not contain a cat, the target detection result output by the updated target detection model is the sub-image B to be detected (second result image) without any label.

S1022, under the condition that the posterior probability corresponding to any one of the at least one interested area is larger than or equal to a preset value, the position area of at least one of the cat and the dog is marked in each sub-image to be detected, and a first result image is obtained.

And S1023, under the condition that the posterior probabilities corresponding to the at least one interested area are smaller than a preset value, not marking the position area of at least one of the cat and the dog in each sub-image to be detected to obtain a second result image.

The electronic equipment marks the region of interest with the posterior probability being greater than or equal to the preset value on the corresponding sub-image to be detected by adopting a boundary frame through the target detection model and outputs a first result image with the boundary frame; and for the sub-images to be detected with the posterior probabilities of all the interested areas smaller than the preset value, the target detection model does not mark the sub-images to be detected, and a second result image without any bounding box is output.

Here, the preset value can be arbitrarily set as required, wherein, under the condition that the preset value is set to be lower, a difficult sample with characteristics of blurring or dim light and the like, or an untrained new sample and the like can be obtained; under the condition that the preset value is set to be higher, positive samples with different sizes, different postures and different illumination, different environments and the like can be obtained, and the specific numerical value of the preset value is not limited in the embodiment of the disclosure.

In some embodiments, the target detection model may be a target detection model under a Region-based full volumetric Network (RFCN); the target detection model may include: convolutional layers, Region pro-nodal networks (RPNs), pooling layers, full-connectivity layers, and normalized exponent layers.

In the embodiment of the disclosure, after the target detection model is obtained, the electronic device further excavates the scene image through the target detection model to obtain the scene image with the prediction annotation data, and continues to train the trained target detection model according to the original scene image with the correct artificial annotation data corresponding to the obtained scene image with the prediction annotation data, so as to obtain the updated target detection model, so that the obtained updated target detection model is more suitable for the detection scene in practical application, so that the accuracy of the updated target detection model is higher, when the updated target detection model is used for performing target detection on each sub-image to be detected, the obtained first result image or second result image is more accurate, and the accuracy of the target detection result is improved.

In an embodiment of the present disclosure, the above S1021 may be implemented by S11-S15:

and S11, performing convolution processing on each sub-image to be detected by adopting the convolution layer to obtain a characteristic diagram corresponding to each sub-image to be detected.

For each sub-image to be detected, the electronic device can perform feature extraction on the sub-image to be detected through the convolution layer to obtain a feature map (feature map) corresponding to the sub-image to be detected; for example, the convolutional layer may be composed of two ordinary convolutional layers and two dense convolutional layers, and the electronic device may perform feature extraction on the sub-image to be detected through the two ordinary convolutional layers and the two dense convolutional layers, so as to obtain the feature map of the sub-image to be detected.

S12, adopting the region generation network to identify the region of interest of the feature map, and obtaining at least one region of interest of the feature map.

The electronic device may use the RPN network to identify the region of interest of the feature map, and obtain, for the sub-image to be detected, at least one region of interest corresponding to the sub-image to be detected. It should be noted that the determined characterization of the region of interest may be a region of the target object, and the electronic device may subsequently know whether each region of interest is a region where the target object is located by comparing the posterior probability of each region of interest with the probability threshold.

And S13, performing pooling treatment on each region of interest in the at least one region of interest by using a pooling layer to obtain corresponding feature vectors.

The electronic device may employ a pooling layer to pool (pool) each region of interest to obtain a feature vector of each region of interest. For example, the electronic device may perform an average pooling process on each region of interest, thereby obtaining a feature vector corresponding to each region of interest.

And S14, converting the feature vectors into corresponding two-dimensional vectors by adopting a full connection layer.

And S15, carrying out normalization processing on the two-dimensional vectors by adopting the normalization index layer to obtain the posterior probability of each interested area.

After obtaining the feature vector of each region of interest, the electronic device may use the full connection layer to convert each feature vector into a corresponding two-dimensional vector, and use the normalization index layer to perform normalization processing (softmax) on each feature vector, thereby obtaining a posterior probability of each region of interest. Here, the posterior probability of each region of interest is a fractional value between 0 and 1, and in the case where the posterior probability of one region of interest is 0, it is characterized that the region of interest is not the region where the target object is located, and in the case where the posterior probability of one region of interest is 1, it is characterized that the region of interest is the region where the target object is located.

Fig. 6 is a schematic flowchart of a process of performing target detection on a scene image by using an exemplary target detection model according to an embodiment of the present disclosure. As shown in fig. 6, after convolution processing of the convolutional layer, the sub-image a to be detected obtains a corresponding feature map, after the feature map passes through the RPN network (not shown in fig. 6), a plurality of RoIs are obtained, the pooling layer, the full link layer, and the normalization index layer perform pooling processing on each RoI to obtain a corresponding feature vector, convert each feature vector into a corresponding two-dimensional vector, and perform normalization processing on each two-dimensional vector to obtain a posterior probability of each RoI.

In the embodiment of the disclosure, for each sub-image to be detected, all the regions of interest of the sub-image to be detected and the posterior probability of each region of interest in all the regions of interest can be obtained by using the method, so that the electronic device can determine whether a target object exists in the sub-image to be detected subsequently by judging the magnitude relation between the posterior probability of each region of interest of the sub-image to be detected and a preset value, so as to obtain a target detection result, thereby improving the target detection efficiency.

Fig. 7 is an optional schematic flow chart of the target detection method provided in the embodiment of the present disclosure, as shown in fig. 7, before S102 in fig. 4, S201-S203 may also be included, which will be described with reference to the steps shown in fig. 7.

S201, performing target detection on a plurality of scene images by adopting a target detection model to obtain a result predicted image.

The electronic device may first train an initial detection model using a preset training set (cold start data set) including a plurality of sample images selected from the data set and belonging to a preset category to obtain a target detection model (cold start model), and then perform target detection on a plurality of scene images using the target detection model to obtain a plurality of resulting predicted images including prediction annotation data.

S202, acquiring a scene image which corresponds to the result predicted image and is subjected to artificial labeling processing.

Here, after obtaining each of the resultant predicted images, the electronic device may obtain a scene image that corresponds to each of the resultant predicted images and that has undergone artificial labeling processing. Here, the scene image after the manual annotation processing may or may not carry the manual annotation. The manual labeling processing is to check and confirm a target area marked out from a result prediction image output by the target detection model, manually label the target object on the corresponding original scene image (to obtain a positive sample image) when the target area marked out from the result prediction image is the target object needing to be detected, and not manually label the corresponding original scene image (to obtain a negative sample image) when the target area marked out from the result prediction image is the target object needing no detection (namely, an object detected by mistake), so as to obtain the scene image subjected to the manual labeling processing; and checking and confirming the result predicted image which is output by the target detection model and does not carry the marking data, manually marking the target object on the corresponding original scene image (obtaining a positive sample image) under the condition that the confirmation result predicted image actually contains the target object to be detected, and not manually marking the corresponding original scene image (obtaining a negative sample image) under the condition that the confirmation result predicted image does not contain any target object to be detected, thereby obtaining the scene image subjected to manual marking processing.

In some embodiments, the scene image after the manual annotation processing may include: the scene image with the target labeling data and the scene image without the target labeling data; for example, fig. 8A is an exemplary image of a scene with target annotation data provided by an embodiment of the disclosure; for fig. 8A, in the case that the target object to be detected is a cat, the scene image with the annotation data after the manual annotation processing is the scene image in which the position area of the cat is annotated by using the annotation frame; the scene image without the annotation data is a scene image without a cat, for example, an image of a dog or an image of a car. For example, fig. 8B is an exemplary image of a scene without target annotation data provided by an embodiment of the disclosure; as for fig. 8B, when the target object to be detected is a dog, the scene image without the annotation data after the manual annotation processing is a scene image without the dog, for example, an image of a cat or a car shown in fig. 8B; the scene image with the annotation data after the manual annotation processing is the scene image of the position area of the dog marked by the annotation frame. Therefore, the accurate negative sample and positive sample used for training the target detection model can be obtained, and the accuracy of the obtained positive sample and negative sample is improved.

In other embodiments, the scene image after the manual annotation processing may also only include the scene image with the target annotation data, so that a correctly annotated positive sample may be obtained; or, the scene image after the manual annotation processing may only include the scene image without the target annotation data, so that a negative sample of the correct annotation can be obtained.

In some embodiments, since the acquired scene image after the manual annotation processing includes both the scene image with the target annotation data and the scene image without the target annotation data, therefore, the correct positive samples and the correct negative samples can be obtained, so that in the process of training the target detection model by the electronic equipment by adopting the obtained correct positive samples and the correct negative samples, not only can the suppression of the target detection model to false alarm under a real outdoor scene be optimized through the negative sample, but also the target detection model can be quickly adapted to the size scale or the posture of a target object in a scene image, other factors such as illumination, environment and the like through the positive sample, therefore, the accuracy of the updated target detection model in the actual scene is higher, and the accuracy of the obtained detection result when the updated target detection model is adopted for target detection is improved.

In some embodiments of the present disclosure, in a case that the target object to be detected is at least one of a cat and a dog, the scene image with the target annotation data is a scene image in which at least one of a cat and a dog is annotated; the scene image without the target labeling data is a scene image without at least one of the cat and the dog.

And S203, taking the scene image subjected to the manual labeling processing as a training sample, and training the target detection model to obtain an updated target detection model.

After the electronic device obtains the scene image subjected to the manual labeling processing, the electronic device may determine the scene image with the target labeling data as a positive sample, and simultaneously determine the scene image without the target labeling data as a negative sample, so as to train the target detection model according to the training samples by using the positive sample and the negative sample as training samples, thereby obtaining an updated target detection model.

In the embodiment of the disclosure, the scene image with the target annotation data is determined as a positive sample, and meanwhile, the scene image without the target annotation data is determined as a negative sample, the positive sample and the negative sample are used as training samples, and the target detection model is trained, so that in the process of training the target detection model by adopting the obtained positive sample and the negative sample, not only can the suppression of the target detection model to false alarm under a real outdoor scene be optimized through the negative sample, but also the target detection model can be quickly adapted to the size scale or the posture of a target object in a scene image, other factors such as illumination, environment and the like through the positive sample, therefore, the accuracy of the updated target detection model in the actual scene is higher, and the accuracy of the obtained detection result when the updated target detection model is adopted for target detection is improved.

In some embodiments, in order to further improve the accuracy of the target detection model, after obtaining the updated target detection model, or in the process of training the target detection model by using the scene image subjected to the artificial labeling processing, the electronic device may continue to perform target detection on the newly obtained scene image to obtain a new result predicted image, so as to obtain a new scene image subjected to the artificial labeling processing according to the new result predicted image, thereby continuing to train and update the updated target detection model again to obtain a new updated target detection model until the required target detection model is obtained.

Because the sample image selected from the data set is different from the scene image in the real scene, the problem of poor detection effect may exist when the target detection model is trained by using the sample image selected from the data set, so that the original scene image corresponding to the excavated scene image to be artificially labeled (namely, the result predicted image) is used as the sample image after being artificially labeled and used for training the target detection model, the trained model can be more suitable for the actual scene, and the accuracy of the detection result obtained when the target detection is performed on the small sample is improved.

In some embodiments, prior to S201 or S102, the method provided by embodiments of the present disclosure further includes S21-S22:

and S21, capturing the video frames in the scene video stream at preset time intervals according to the attribute parameters of the video frames to obtain a preset number of video frames.

The attribute parameters of the video frame may be acquisition parameters of the video frame, such as day or night, and an acquisition time period, or may also be environmental parameters of the video frame, such as a specific location, a scene, and the like, which is not limited in this disclosure. The preset time interval may be set according to actual needs, and for example, may be 5 seconds, or 10 seconds, and the like, which is not limited in this disclosure as well.

And S22, determining the preset number of video frames as a plurality of scene images.

The electronic equipment can intercept video frames from scene video streams of some monitoring scenes at preset time intervals according to attribute parameters of the video frames, and obtain a preset number of video frames after a period of time; for example, the electronic device may intercept one video frame every 5 seconds, and obtain a plurality of video frames after 24 hours, so that target detection may be performed according to the obtained video frames to obtain a result predicted image.

In the embodiment of the disclosure, by adopting the scene image acquisition mode, scene images with different attribute parameters can be acquired, and the diversity of the acquired scene images is improved.

In some embodiments of the present disclosure, S201 in fig. 7 may be implemented by S31-S33:

s31, performing target detection on each scene image in the multiple scene images by adopting a target detection model to obtain at least one interested area of each scene image and posterior probability corresponding to each interested area in the at least one interested area.

The electronic device may perform target detection on each obtained scene image by using a target detection model, and for each scene image, obtain at least one region of interest and a posterior probability corresponding to each region of interest. It should be noted that the target detection model may be any network model capable of performing target detection, and this is not limited in the embodiment of the present disclosure.

S32, under the condition that the posterior probability corresponding to any one interested area in at least one interested area of a scene image is larger than or equal to a preset value, marking the position area of a target object in the scene image to obtain a first result predicted image; the resulting predicted image includes: a first resulting predicted image and a second resulting predicted image.

And S33, under the condition that the posterior probabilities corresponding to at least one interested area of one scene image are all smaller than a preset value, the position area of the target object is not marked in the scene image, and a second result predicted image is obtained.

The electronic equipment marks the region of interest with the posterior probability being greater than or equal to the preset value on the corresponding scene image by adopting a boundary box through a target detection model and outputs a first result predicted image with the boundary box; and for the scene images with the posterior probabilities of all the interested areas smaller than the preset value, the target detection model does not label the corresponding scene images and outputs a second result predicted image without any bounding box.

Illustratively, the above S31 may be implemented by S301 to S305:

s301, performing convolution processing on each scene image by using the convolution layer to obtain a characteristic diagram corresponding to each scene image.

The electronic equipment can perform feature extraction on each scene image through the convolution layer to obtain a feature map corresponding to each scene image; for example, the convolutional layer may be composed of two ordinary convolutional layers and two dense convolutional layers, and the electronic device may perform feature extraction on each scene image through the two ordinary convolutional layers and the two dense convolutional layers, so as to obtain a feature map of each scene image.

S302, adopting the region generation network to identify the region of interest of the feature map, and obtaining at least one region of interest of the feature map.

The electronic device may identify the region of interest of the feature map by using the RPN network, and obtain at least one region of interest for each scene image.

S303, performing pooling treatment on each region of interest in the at least one region of interest by using a pooling layer to obtain corresponding feature vectors.

The electronic device may perform pooling processing on each region of interest by using a pooling layer to obtain a feature vector of each region of interest. For example, the electronic device may perform an average pooling process for each region of interest.

And S304, converting the feature vectors into corresponding two-dimensional vectors by adopting a full connection layer.

S305, normalization processing is carried out on the two-dimensional vectors by adopting a normalization index layer, and the posterior probability of each region of interest is obtained.

After obtaining the feature vector of each region of interest, the electronic device may use the full connection layer to convert each feature vector into a corresponding two-dimensional vector, and use the normalization index layer to normalize each feature vector, thereby obtaining the posterior probability of each region of interest. Here, the posterior probability of each region of interest is a fractional value between 0 and 1, and in the case where the posterior probability of one region of interest is 0, it is described that the region of interest is not the region where the target object is located, and in the case where the posterior probability of one region of interest is 1, it is described that the region of interest is the region where the target object is located.

Illustratively, in the case that the sub-image a to be detected in fig. 6 is replaced with the scene image a, as shown in fig. 6, the scene image a is subjected to convolution processing of the convolution layer to obtain a corresponding feature map, the feature map is subjected to an RPN network (not shown in fig. 6) to obtain a plurality of RoIs, the pooling layer, the full-link layer and the normalization index layer perform pooling processing on each RoI to obtain a corresponding feature vector, each feature vector is converted into a corresponding two-dimensional vector, and each two-dimensional vector is subjected to normalization processing to obtain a posterior probability of each RoI.

In some embodiments, S203 may be implemented by S2031:

s2031, under the condition that the number of the scene images subjected to the manual labeling processing is larger than or equal to a preset threshold value, taking the scene images subjected to the manual labeling processing as training samples, and training the target detection model to obtain an updated target detection model.

The electronic equipment can take the obtained scene images subjected to manual labeling processing as training samples under the condition that a sufficient number of scene images subjected to manual labeling processing are obtained, and starts to train the target detection model so as to train and obtain an updated target detection model; therefore, the efficiency of each training can be improved, a high-precision target detection model can be obtained, and the training times of the model can be reduced.

In some embodiments, fig. 9 is an alternative flow diagram of a target detection method provided by embodiments of the present disclosure; s203 may be implemented by S401 to S402, and will be described with reference to the steps shown in fig. 9.

S401, adding the scene image subjected to the manual labeling processing into a preset training set to obtain an updated preset training set.

After the electronic equipment obtains the scene images processed by the manual labeling, the scene images processed by the manual labeling can be added into the preset training set to obtain a new preset training set, so that the number of positive and negative samples in the preset training set can be enriched, and more diversified training sets can be obtained.

S402, training the target detection model according to the updated preset training set to obtain the updated target detection model.

The adoption has been added and has been trained the target detection model through the preset training set of artifical mark processing, because the sample kind of training set is more various, and has included the sample image in the actual scene, so, can make the target detection model of training adapt to the actual scene fast, and the detection precision of the target detection model of training is higher.

Fig. 10 is a schematic flowchart of an application scenario of the target detection method provided in the embodiment of the present disclosure; as shown in fig. 10, the cold start link: training an initial detection model by adopting a cold start data set (a preset training set) which comprises a plurality of sample images belonging to cats and dogs selected from an academic set or simultaneously comprises the sample images of the cats and the dogs to obtain a target detection model (a cold start model); and (3) data mining: performing frame interception operation on the urban scene video stream, inputting the intercepted video frame into a target detection model, mining the video frame, and obtaining mining data (result prediction image); and (3) cold start data set updating: acquiring mining data with artificial labels (a result prediction image subjected to artificial labeling processing), and merging the mining data with the artificial labels into a cold start data set to obtain an updated cold start data set; model training and updating: continuing training the target detection model by adopting the updated cold start data set to obtain an updated target detection model; and repeating the data mining link, the cold start data set updating link and the model training updating link for preset times until the required target detection model is obtained.

Continuing with the exemplary structure in which model training device 455, detection device 555, and object detection device 655 provided by embodiments of the present disclosure are implemented as software modules, in some embodiments, as shown in fig. 2A, the software modules stored in model training device 455 of memory 450 may include: a detection module 4551, a sample acquisition module 4552 and an update module 4553; as shown in fig. 2B, the software modules stored in the detection device 555 of the memory 550 may include: an image acquisition module 5551 and a target detection module 5552; as shown in fig. 3, the software modules stored in the object detection means 655 of the memory 650 may include: an image acquisition module 6551, an object detection module 6552, a detection module 6553, a sample acquisition module 6554, and an update module 6555.

It should be noted that the functions of the detection module 4551 and the detection module 6553, the functions of the sample acquisition module 4552 and the sample acquisition module 6554, the functions of the update module 4553 and the update module 6555, the functions of the image acquisition module 5551 and the image acquisition module 6551, and the functions of the target detection module 5552 and the target detection module 6552 correspond to the same functions. The image acquisition module 6551, the object detection module 6552, the detection module 6553, the sample acquisition module 6554, and the update module 6555 will be described in detail below:

the image obtaining module 6551 is configured to obtain an image to be detected;

the target detection module 6552 is configured to perform target detection on the image to be detected by using the updated target detection model to obtain a target detection result; the updated target detection model is obtained by training the target detection model by adopting a scene image which corresponds to the result predicted image and is subjected to manual labeling processing; the target detection model is obtained by training an initial detection model based on a preset training set, and the result prediction image is a scene image which is detected by the target detection model and comprises prediction annotation data; the artificial labeling processing is used for verifying the prediction labeling data and correctly labeling a target object in a scene image corresponding to the result prediction image according to a verification result; the preset training set is a plurality of sample images which are selected from the data set and belong to a preset category; the data set is a plurality of images marked with target objects, and the target objects of a preset number of images in the plurality of images belong to different categories.

In some embodiments of the present disclosure, the detecting module 6553 is configured to, before performing target detection on the image to be detected by using the updated target detection model to obtain a target detection result, perform target detection on multiple scene images by using the target detection model to obtain the result predicted image; the sample obtaining module 6554 is configured to obtain the scene image that corresponds to the result predicted image and is subjected to manual annotation processing; the updating module 6555 is configured to train the target detection model by using the scene image subjected to the manual labeling processing as a training sample, so as to obtain the updated target detection model.

In some embodiments of the present disclosure, the updating module 6555 is further configured to add the scene image subjected to manual labeling processing into the preset training set to obtain an updated preset training set; and training the target detection model according to the updated preset training set to obtain the updated target detection model.

In some embodiments of the present disclosure, the scene image subjected to the artificial labeling processing includes: the scene image with the target labeling data and the scene image without the target labeling data.

In some embodiments of the present disclosure, the updating module 6555 is further configured to determine a scene image with target labeling data as a positive sample, and determine a scene image without target labeling data as a negative sample, where the positive sample and the negative sample are the training samples; and training the target detection model according to the training sample to obtain the updated target detection model.

In some embodiments of the present disclosure, the updating module 6555 is further configured to, when the number of the scene images subjected to the manual labeling processing is greater than or equal to a preset threshold, use the scene images subjected to the manual labeling processing as training samples to train the target detection model, so as to obtain an updated target detection model.

In some embodiments of the present disclosure, the detecting module 6553 is further configured to, before the target detection is performed on multiple scene images by using the target detection model to obtain the result predicted image, intercept, according to an attribute parameter of a video frame, video frames in a scene video stream at a preset time interval to obtain a preset number of video frames; determining the preset number of video frames as the plurality of scene images.

In some embodiments of the present disclosure, the scene image with the target labeling data is a scene image of at least one of a cat and a dog; the scene image without the target marking data is the scene image without at least one of the cat and the dog.

In some embodiments of the present disclosure, the image to be detected includes: a plurality of sub-images to be detected; the target object includes: at least one of a cat and a dog; the target detection result comprises: a first result image noting a location area of the at least one of a cat and a dog, or a second result image not noting a location area of the at least one of a cat and a dog; the target detection module 6552 is further configured to perform target detection on each sub-image to be detected by using the updated target detection model, so as to obtain at least one region of interest of each sub-image to be detected and a posterior probability corresponding to each region of interest in the at least one region of interest; under the condition that the posterior probability corresponding to any one of the at least one interested area is larger than or equal to a preset value, marking the position area of at least one of the cat and the dog in each sub-image to be detected to obtain a first result image; and under the condition that the posterior probabilities corresponding to the at least one region of interest are all smaller than the preset value, not marking the position region of at least one of the cat and the dog in each sub-image to be detected to obtain the second result image.

In some embodiments of the present disclosure, the object detection model comprises: the system comprises a convolution layer, a region generation network, a pooling layer, a full-link layer and a normalization index layer; the target detection module 6552 is further configured to perform convolution processing on each sub-image to be detected by using the convolution layer to obtain a feature map corresponding to each sub-image to be detected; adopting the area generation network to identify the interested areas of the feature map to obtain at least one interested area of the feature map; performing pooling treatment on each region of interest in the at least one region of interest by using the pooling layer to obtain corresponding feature vectors; converting the feature vectors into corresponding two-dimensional vectors by adopting the full-connection layer; and carrying out normalization processing on the two-dimensional vector by adopting the normalization index layer to obtain the posterior probability of each interested area.

In some embodiments of the present disclosure, the preset categories include: at least one of a cat and a dog.

Embodiments of the present disclosure provide a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The second processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to make the computer device execute the target detection method of the embodiment of the disclosure.

The disclosed embodiments provide a computer-readable storage medium having stored thereon executable instructions that, when executed by a processor, cause the processor to perform a method provided by the disclosed embodiments, for example, the method as illustrated in fig. 4, 5, 7, 9, 10.

In some embodiments, the computer-readable storage medium (the storage medium described above) may be a memory such as FRAM, ROM, PROM, EPROM, EEPROM, flash memory, magnetic surface memory, optical disk, or CD-RO M; or may be various devices including one or any combination of the above memories.

In some embodiments, executable instructions may be written in any form of programming language (including compiled or interpreted languages), in the form of programs, software modules, scripts or code, and may be deployed in any form, including as a stand-alone program or as a module, component, subroutine, or other unit suitable for use in a computing environment.

By way of example, executable instructions may correspond, but do not necessarily have to correspond, to files in a file system, and may be stored in a portion of a file that holds other programs or data, such as in one or more scripts in a hypertext Markup Language (HTML) document, in a single file dedicated to the program in question, or in multiple coordinated files (e.g., files that store one or more modules, sub-programs, or portions of code).

By way of example, executable instructions may be deployed to be executed on one computing device or on multiple computing devices at one site or distributed across multiple sites and interconnected by a communication network.

In summary, by adopting the technical implementation scheme, the sample images used for model training can be enriched, the target detection model can be optimized through the negative sample to suppress false alarm in a real outdoor scene, and the positive sample helps the target detection model to quickly adapt to the size scale or posture of the target object in the scene image, illumination, environment and other factors, so that the target detection model trained by adopting the scene image which is manually and correctly labeled and processed quickly adapts to the scene in practical application, and the trained target detection model has higher precision.

According to the technical scheme, from the existing academic set and the actual scene video stream, difficult samples which are difficult to detect and identify by a mining model and other types of samples are combined into the cold start data set for training the detection model, on one hand, a detection model can be obtained through fast training of data (sample images which are selected from the academic set and belong to preset categories) mined from the academic set, and on the other hand, a detection model which can adapt to the actual scene can be fast iterated through the data mined from the actual scene video stream.

According to the technical scheme, the training samples (cold start data) of the target categories are directly obtained by utilizing the existing label data in the academic data set, the cold start model can be quickly obtained, a large amount of cold start samples do not need to be collected, and the cost consumed in the process of collecting a large amount of cold start samples is reduced.

According to the technical scheme, potential high-value samples (such as difficult samples, new samples and other types of samples and the like which are difficult to detect and identify by the model) helpful for model improvement can be mined in a huge amount of unmarked scene images by using an active learning method, the performance of the model can be effectively improved in the environment of limited marking and computing resources, and manpower and computing cost required by a deep learning model in applying new services are greatly saved.

According to the technical scheme, the potential target detection model in the intelligent video analysis or intelligent scene can be quickly and iteratively improved on line, so that the required model detection precision can be quickly achieved with low labor and calculation cost, and the model performance can be continuously improved afterwards.

The above description is only an example of the present disclosure, and is not intended to limit the scope of the present disclosure. Any modification, equivalent replacement, and improvement made within the spirit and scope of the present disclosure are included in the protection scope of the present disclosure.

Claims

1. A method of object detection, comprising:

acquiring an image to be detected;

carrying out target detection on the image to be detected by adopting the updated target detection model to obtain a target detection result; the updated target detection model is obtained by training the target detection model by adopting a scene image which corresponds to the result predicted image and is subjected to manual labeling processing;

the target detection model is obtained by training an initial detection model based on a preset training set, and the result prediction image is a scene image which is detected by the target detection model and comprises prediction annotation data; the artificial labeling processing is used for verifying the prediction labeling data and correctly labeling a target object in a scene image corresponding to the result prediction image according to a verification result; the preset training set is a plurality of sample images which are selected from the data set and belong to a preset category; the data set is a plurality of images marked with target objects, and the target objects of a preset number of images in the plurality of images belong to different categories.

2. The method for detecting the target of claim 1, wherein before the step of using the updated target detection model to perform the target detection on the image to be detected to obtain the target detection result, the method further comprises:

performing target detection on a plurality of scene images by adopting the target detection model to obtain a result predicted image;

acquiring the scene image which corresponds to the result predicted image and is subjected to manual labeling processing;

and taking the scene image subjected to the manual labeling processing as a training sample, and training the target detection model to obtain the updated target detection model.

3. The object detection method according to claim 1 or 2, wherein the scene image subjected to the manual annotation processing comprises: the scene image with the target labeling data and the scene image without the target labeling data.

4. The method according to claim 2, wherein the training the target detection model with the scene image subjected to the manual labeling processing as a training sample to obtain the updated target detection model comprises:

determining a scene image with target labeling data as a positive sample, and determining a scene image without target labeling data as a negative sample, wherein the positive sample and the negative sample are the training samples;

and training the target detection model according to the training sample to obtain the updated target detection model.

5. The method according to claim 2, wherein the training the target detection model with the scene image subjected to the manual labeling processing as a training sample to obtain the updated target detection model comprises:

and under the condition that the number of the scene images subjected to the artificial labeling processing is greater than or equal to a preset threshold value, taking the scene images subjected to the artificial labeling processing as training samples, and training the target detection model to obtain the updated target detection model.

6. The method according to claim 2, wherein before said employing the object detection model to perform object detection on a plurality of scene images to obtain the result predicted image, the method further comprises:

intercepting video frames in a scene video stream at preset time intervals according to attribute parameters of the video frames to obtain a preset number of video frames;

determining the preset number of video frames as the plurality of scene images.

7. The object detection method according to claim 3 or 4, wherein the scene image with the object labeling data is a scene image in which at least one of a cat and a dog is labeled; the scene image without the target marking data is the scene image without at least one of the cat and the dog.

8. The object detection method of claim 1, wherein the preset categories include: at least one of a cat and a dog.

9. The object detection method according to any one of claims 1 to 8, wherein the image to be detected includes: a plurality of sub-images to be detected; the target object includes: at least one of a cat and a dog; the target detection result comprises: a first result image noting a location area of the at least one of a cat and a dog, or a second result image not noting a location area of the at least one of a cat and a dog;

and adopting the updated target detection model to carry out target detection on the image to be detected to obtain a target detection result, and comprising the following steps of:

performing target detection on each sub-image to be detected by adopting the updated target detection model to obtain at least one region of interest of each sub-image to be detected and posterior probability corresponding to each region of interest in the at least one region of interest;

under the condition that the posterior probability corresponding to any one of the at least one interested area is greater than or equal to a preset value, marking the position area of at least one of the cat and the dog in each sub-image to be detected to obtain a first result image;

and under the condition that the posterior probabilities corresponding to the at least one region of interest are all smaller than the preset value, not marking the position region of at least one of the cat and the dog in each sub-image to be detected to obtain the second result image.

10. The object detection method according to claim 9, wherein the object detection model comprises: the system comprises a convolution layer, a region generation network, a pooling layer, a full-link layer and a normalization index layer;

the performing target detection on each sub-image to be detected by using the updated target detection model to obtain at least one region of interest of each sub-image to be detected and posterior probability corresponding to each region of interest in the at least one region of interest, includes:

performing convolution processing on each sub-image to be detected by using the convolution layer to obtain a characteristic diagram corresponding to each sub-image to be detected;

adopting the area generation network to identify the interested areas of the feature map to obtain at least one interested area of the feature map;

performing pooling treatment on each region of interest in the at least one region of interest by using the pooling layer to obtain corresponding feature vectors;

converting the feature vectors into corresponding two-dimensional vectors by adopting the full-connection layer;

and carrying out normalization processing on the two-dimensional vector by adopting the normalization index layer to obtain the posterior probability of each interested area.

11. An object detection device, comprising:

the image acquisition module is used for acquiring an image to be detected;

the target detection module is used for carrying out target detection on the image to be detected by adopting the updated target detection model to obtain a target detection result; the updated target detection model is obtained by training the target detection model by adopting a scene image which corresponds to the result predicted image and is subjected to manual labeling processing;

12. An electronic device, comprising:

a memory for storing an executable computer program;

a processor for implementing the method of any one of claims 1 to 10 when executing an executable computer program stored in the memory.

13. A computer-readable storage medium, in which a computer program is stored for causing a processor, when executed, to carry out the method of any one of claims 1 to 10.