CN111767947A

CN111767947A - Target detection model, application method and related device

Info

Publication number: CN111767947A
Application number: CN202010571484.0A
Authority: CN
Inventors: 尚太章
Original assignee: Guangdong Oppo Mobile Telecommunications Corp Ltd
Current assignee: Guangdong Oppo Mobile Telecommunications Corp Ltd
Priority date: 2020-06-19
Filing date: 2020-06-19
Publication date: 2020-10-13

Abstract

The embodiment of the application discloses a target detection model, an application method and a related device, which comprise the following steps: the system comprises a feature extraction module, a feature fusion module and a target detection module, wherein the feature extraction module is connected with the feature fusion module and the target detection module; the characteristic extraction module is used for extracting characteristic graphs of the original image in multiple scales; the characteristic fusion module is used for carrying out characteristic fusion on at least three characteristic graphs in the characteristic graphs with multiple scales, wherein the at least three characteristic graphs comprise characteristic graphs with the same or different scales; and the target detection module is used for obtaining the target category and the target position of the target to be detected in the original image according to the feature map which is not subjected to feature fusion in the plurality of feature maps and the feature map after feature fusion. The embodiment of the application improves the target detection accuracy on the basis of not excessively increasing the complexity of the model, and is particularly suitable for small target detection of a light-weight target detection model.

Description

Target detection model, application method and related device

Technical Field

The application relates to the technical field of image processing, in particular to a target detection model, an application method and a related device.

Background

With the development of artificial intelligence technology, target detection is widely applied to intelligent terminals such as automatic driving, pedestrian detection, license plate recognition, mobile phones and AR glasses. Particularly, a large number of intelligent algorithms are integrated into the intelligent terminal at present, so that the terminal intelligence is further improved.

With the recent emergence of deep learning and the development of explosion type, many researchers are focusing on the light weight and miniaturization of models and improving the practical application value of the models in terminal devices. The target detection algorithm mainly comprises a Two-stage target detection algorithm, the Two-stage method is relatively high in precision, but the model time complexity is high, and the deployment of the Two-stage target detection algorithm on terminal equipment is difficult.

Disclosure of Invention

The embodiment of the application provides a target detection model, an application method and a related device, aiming at improving the target detection accuracy without excessively increasing the complexity of the model and being especially suitable for small target detection of a light-weight target detection model.

In a first aspect, an embodiment of the present application provides a target detection model, which includes a feature extraction module, a feature fusion module, and a target detection module, where the feature extraction module is connected to the feature fusion module and the target detection module, and the feature fusion module is connected to the target detection module;

the characteristic extraction module is used for extracting characteristic graphs of the original image in multiple scales;

the feature fusion module is used for performing feature fusion on at least three feature maps in the feature maps with multiple scales, wherein the at least three feature maps comprise feature maps with the same and different scales;

and the target detection module is used for obtaining the target category and the target position of the target to be detected in the original image according to the feature map which is not subjected to feature fusion in the plurality of feature maps and the feature map after feature fusion.

In a second aspect, an embodiment of the present application provides a method for training an object detection model, which is applied to the model according to the first aspect, and the method includes:

setting initial module parameters of the feature extraction module, the feature fusion module and the target detection module of the target detection model;

keeping the module parameters of the feature extraction module unchanged in a first stage, and training the target detection model to update the module parameters of the feature fusion module and the target detection module;

and training the target detection model in a second stage to synchronously update the module parameters of the feature extraction module, the feature fusion module and the target detection module to obtain the trained target detection model.

In a third aspect, an embodiment of the present application provides a method for configuring an object detection model, which is applied to the model according to the second aspect, and the method includes:

the server acquires a trained target detection model;

the server generates a model configuration file according to the target detection model;

and the server sends the model configuration file to a terminal to configure the target detection model.

In a fourth aspect, an embodiment of the present application provides an object detection method, which is applied to a terminal configured with the object detection model according to the second aspect, and the method includes:

acquiring an original image;

and processing the original image by using the target detection model to obtain the target type and the target position of the target to be detected in the original image.

In a fifth aspect, an embodiment of the present application provides a training apparatus for an object detection model, which is applied to the model according to the first aspect, and includes a setting unit, a first training unit, and a second training unit,

the setting unit is used for setting initial module parameters of the feature extraction module, the feature fusion module and the target detection module of the target detection model;

the first training unit is used for keeping the module parameters of the feature extraction module unchanged in a first stage and training the target detection module to update the module parameters of the feature fusion module and the target detection module;

the second training unit is configured to train the target detection model in a second stage to synchronously update the module parameters of the feature extraction module, the feature fusion module, and the target detection module, so as to obtain the trained target detection model.

In a sixth aspect, an embodiment of the present application provides an apparatus for configuring an object detection model, which is applied to the model according to the second aspect, and includes an obtaining unit, a configuring unit, and a sending unit,

the acquisition unit is used for acquiring a trained target detection model;

the configuration unit is used for generating a model configuration file according to the target detection model;

and the sending unit is used for sending the model configuration file to a terminal so as to configure the target detection model.

In a seventh aspect, an embodiment of the present application provides an object detection apparatus, which is applied to a terminal configured with the model as described in the second aspect, and includes an obtaining unit and a using unit,

the acquisition unit is used for acquiring an original image;

and the using unit is used for processing the original image by using the target detection model to obtain the target type and the target position of the target to be detected in the original image.

In an eighth aspect, an embodiment of the present application provides a server, including a processor, a memory, a communication interface, and one or more programs, where the one or more programs are stored in the memory and configured to be executed by the processor, and the program includes instructions for executing the steps in any of the methods of the second aspect or the third aspect of the embodiment of the present application.

In a ninth aspect, an embodiment of the present application provides a terminal, including a processor, a memory, a communication interface, and one or more programs, where the one or more programs are stored in the memory and configured to be executed by the processor, and the program includes instructions for executing the steps in any of the methods in the third aspect of the embodiments of the present application.

In a tenth aspect, the present application provides a computer-readable storage medium, where the computer-readable storage medium stores a computer program for electronic data exchange, where the computer program makes a computer perform some or all of the steps described in any one of the methods in the second aspect to the fourth aspect of the embodiments of the present application.

In an eleventh aspect, the present application provides a computer program product, where the computer program product includes a non-transitory computer-readable storage medium storing a computer program, where the computer program is operable to cause a computer to perform some or all of the steps of any one of the methods in the second aspect to the fourth aspect of the present application. The computer program product may be a software installation package.

It can be seen that, in the embodiment of the present application, the target detection model includes a feature extraction module, a feature fusion module and a target detection module, the feature extraction module is connected to the feature fusion module and the target detection module, and the feature fusion module is connected to the target detection module; the characteristic extraction module is used for extracting characteristic graphs of the original image in multiple scales; the characteristic fusion module is used for carrying out characteristic fusion on at least three characteristic graphs in the characteristic graphs with multiple scales, wherein the at least three characteristic graphs comprise characteristic graphs with the same or different scales; and the target detection module is used for obtaining the target category and the target position of the target to be detected in the original image according to the feature map which is not subjected to feature fusion in the plurality of feature maps and the feature map after feature fusion. Therefore, the target detection model can perform feature fusion on feature maps with the same scale and different scales at the same time, integrates part of high-level and low-level semantic information, improves the target detection accuracy on the basis of not excessively increasing the complexity of the model, and is particularly suitable for small target detection of a light-weight target detection model.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the embodiments or the prior art descriptions will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.

FIG. 1A is a schematic diagram of a framework structure of an object detection algorithm provided in an embodiment of the present application;

FIG. 1B is a system architecture 100 for model training and application provided by embodiments of the present application;

fig. 1C is a block diagram of a training device 120 or a terminal 200 according to an embodiment of the present disclosure;

FIG. 1D is a block diagram of a chip hardware configuration according to an embodiment of the present disclosure;

fig. 1E is a schematic architecture diagram of a terminal provided with an Android system according to an embodiment of the present application;

fig. 2A is a schematic structural diagram of a target detection model provided in an embodiment of the present application;

FIG. 2B is a schematic structural diagram of another object detection model provided in the embodiments of the present application;

fig. 2C is a schematic structural diagram of a first feature fusion module according to an embodiment of the present application;

fig. 2D is a schematic structural diagram of a second feature fusion module provided in the embodiment of the present application;

FIG. 3 is a schematic flowchart of a method for training a target detection model according to an embodiment of the present disclosure;

FIG. 4 is a schematic flowchart of a configuration method of an object detection model according to an embodiment of the present disclosure;

FIG. 5 is a schematic flow chart of a target detection method provided in an embodiment of the present application;

FIG. 6 is a block diagram illustrating functional units of a training apparatus for a target detection model according to an embodiment of the present disclosure;

FIG. 7 is a schematic structural diagram of another training apparatus for an object detection model according to an embodiment of the present disclosure;

FIG. 8 is a block diagram illustrating functional units of an apparatus for configuring a target detection model according to an embodiment of the present disclosure;

FIG. 9 is a schematic structural diagram of an apparatus for configuring an object detection model according to an embodiment of the present disclosure;

FIG. 10 is a block diagram illustrating functional units of an object detection apparatus according to an embodiment of the present disclosure;

fig. 11 is a schematic structural diagram of an object detection apparatus according to an embodiment of the present application.

Detailed Description

In order to make the technical solutions of the present application better understood, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The terms "first," "second," and the like in the description and claims of the present application and in the above-described drawings are used for distinguishing between different objects and not for describing a particular order. Furthermore, the terms "include" and "have," as well as any variations thereof, are intended to cover non-exclusive inclusions. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those steps or elements listed, but may alternatively include other steps or elements not listed, or inherent to such process, method, article, or apparatus.

Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the application. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is explicitly and implicitly understood by one skilled in the art that the embodiments described herein can be combined with other embodiments.

In order to better understand the scheme of the embodiment of the present application, the following first introduces the related terms and concepts of the neural network that the embodiment of the present application may relate to.

(1)Neural network

Neural Networks (NN) are complex network systems formed by a large number of simple processing units (called neurons) widely interconnected, reflect many basic features of human brain functions, and are highly complex nonlinear dynamical learning systems.

(2)SSD destination detection algorithm

The R-CNN algorithm divides a target detection task into two steps: candidate region generation and region classification, the SSD algorithm is based on an end-to-end thought, does not have the two steps, but also uses the Anchor point (Anchor) thought of fast R-CNN for reference, utilizes a convolution layer to generate a large number of preset candidate regions (not extracted independently, but regression operation is carried out), and then can simultaneously predict the position and classification of a boundary box only by using a single convolution neural network, and the key point is intensive sampling. The frame structure is shown in fig. 1A, image represents an original image, classifier Layer (one/scale) represents a classifier Layer of a single scale, detections represent detection, and boxes represent prediction frames.

The target detection algorithm is different from the classification algorithm, and not only is the class of the target identified, but also the position of the target in the picture is detected, so that the classification difficulty is higher. At present, the detection of small targets is more and more important in practical application. Particularly, under the use scene of augmented reality AR glasses, extremely high requirements are put on small target detection. When the mobile intelligent mobile phone is used as an intelligent terminal to realize target detection, a user can use the mobile phone to approach a target, so that the size of the target in a picture is increased, and the problem of small target detection can be directly skipped. However, when the AR glasses are used, if the target distance is long, the user cannot wear the glasses to directly approach the target in a short distance to realize target detection and identification, but if the user does not approach the object, the small target is difficult to detect, which greatly reduces the user experience of the AR glasses user. Therefore, compared with a mobile phone, in the use scene of the AR glasses, the requirement for detecting the small target is extremely strict.

The detection of small objects in the field of computer vision has always been a difficult problem. In the past, researchers have used morphological methods, wavelet transforms, and the like to perform small object detection. With the rapid development of deep learning, different model architectures such as Alexnet, VGG, inclusion, mobileNet and the like are in the endlessly. The deep learning technology has a dominant position in the field of computer vision, and a plurality of target detection algorithms are rapidly emerged and rapidly become a main method in the field of target detection. In general, the target detection algorithm can be mainly divided into two categories, namely a two-stage target detection algorithm and a single-stage one-stage target detection algorithm. The two-stage target detection algorithm is mainly represented by a Region-based convolutional neural network (RCNN) series, including RCNN, fast-RCNN, R-RCN, mask-RCNN and the like. The one-stage method mainly includes a Single Shot multi-box Detector (SSD) series and a yolo series. The Two-stage method is relatively high in precision, but the model time is high in complexity, and deployment on terminal equipment is difficult. Although the accuracy of the One-stage method is not so high compared with that of the two-stage method, the One-stage method is low in time complexity and high in practicability, and can be deployed in intelligent terminal equipment.

In the existing technical scheme for detecting the small deep learning target, several approaches are mainly provided. First, the relative size of the small object can be increased by increasing the resolution of the input picture, thereby improving the detection effect of the small object. And secondly, extracting the feature information by adopting a larger and deeper feature extraction network model, so that more effective feature information can be extracted, the characteristic capability of the features is improved, and the detection effect of the small target is improved by optimizing the features. And thirdly, by using the bottom layer characteristic information, the accuracy and recall rate of the small target detection are kept as much as possible. Although the higher-layer features can represent the upper-layer semantic information of the picture, the information of the small target is likely to be lost when the small target is in the upper layer, so the detection effect of the small target can be improved to a certain extent by using the bottom-layer feature information.

However, these approaches have different problems. First, increasing the resolution of the input pictures of the network model would mean increasing the parameters of the model, for example, the input pictures from 300 × 300 to 600 × 600. Then the parameter is correspondingly increased by 4 times, resulting in the model size being increased by 4 times, and the target detection speed being correspondingly decreased by 4 times. This may result in poor real-time performance in practical applications. Secondly, extracting the network model by using larger and deeper features also means increasing the size of the model, increasing the computational complexity, increasing the delay in the actual deployment of the intelligent terminal, and having poor real-time performance. Thirdly, the detection effect of the small target is improved by using the bottom layer feature information, although the bottom layer features theoretically should contain the feature information of the small target, the bottom layer features of the model generally contain little picture semantic information, and the position and the category of the small target are difficult to be directly and accurately obtained from the bottom layer features.

Therefore, a network model which is light in model architecture, small in time complexity and suitable for being deployed at a mobile intelligent terminal is urgently needed, and the model is required to have a good detection effect on small targets. Since a common problem of the target detection algorithm of the two-stage is that the time complexity is high, and the deployment in the mobile terminal cannot meet the real-time requirement, in this patent, a method for improving the small target detection effect is provided based on a one-stage target detection scheme.

The software and hardware operating environment of the technical scheme disclosed by the application is introduced as follows.

As shown in fig. 1B, the present application provides a system architecture 100 for model training and application, including an execution device 110, a training device 120, a database 130, a client device 140, a data storage system 150, and a data collection device 160. Wherein the data acquisition device 160 is used to acquire training data. For the target detection method of the embodiment of the application, the training data may include a training image and a target image, where the target image may be a manual pre-cropping image, and the training data may also be derived from a dedicated training data set in the field, such as an ImageNet data set.

After the training data is collected, data collection device 160 stores the training data in database 130, and training device 120 trains target models/rules 101, such as the target detection models described herein, based on the training data maintained in database 130. The target model/rule 101 can be used for implementing the target detection method of the embodiment of the application, that is, the target position and the target category of the target to be detected in the original image can be obtained by inputting the original image into the target model/rule 101 after relevant preprocessing. Target model/rules in embodiments of the present application

101 may specifically be a neural network model, such as a mobileNet-based model. It should be noted that, in practical applications, the training data maintained in the database 130 may not necessarily all come from the acquisition of the data acquisition device 160, and may also be received from other devices. It should be noted that, the training device 120 does not necessarily perform the training of the target model/rule 101 based on the training data maintained by the database 130, and may also obtain the training data from the cloud or other places for performing the model training.

The target model/rule 101 obtained by training according to the training device 120 may be applied to different systems or devices, for example, the execution device 110 shown in fig. 1B, where the execution device 110 may be a terminal, such as a mobile phone terminal, a tablet computer, a notebook computer, an Augmented Reality (AR)/Virtual Reality (VR) device, a vehicle-mounted terminal, and may also be a server or a cloud device. In fig. 1B, the execution device 110 configures an input/output (I/O) interface 112 for data interaction with an external device, and a user may input data to the I/O interface 112 through the client device 140, where the input data may include: raw images input by the client device.

The preprocessing module 113 and the preprocessing module 114 are configured to perform preprocessing according to input data (such as an original image) received by the I/O interface 112, and in this embodiment of the application, the preprocessing module 113 and the preprocessing module 114 may not be provided (or only one of the preprocessing modules may be provided), and the computing module 111 may be directly used to process the input data.

In the process that the execution device 110 preprocesses the input data or in the process that the calculation module 111 of the execution device 110 executes the calculation or other related processes, the execution device 110 may call the data, the code, and the like in the data storage system 150 for corresponding processes, and may store the data, the instruction, and the like obtained by corresponding processes in the data storage system 150.

Finally, the I/O interface 112 returns the processing result, such as the classification result of the original image obtained as described above, to the client device 140, thereby providing it to the user.

It should be noted that the training device 120 may generate corresponding target models/rules 101 for different targets or different tasks based on different training data, and the corresponding target models/rules 101 may be used to achieve the targets or complete the tasks, so as to provide the user with the required results. In practical applications, the training device 120 may be a server or a terminal, and is not limited herein.

In the case shown in fig. 1B, the user may manually give the input data, which may be operated through an interface provided by the I/O interface 112. Alternatively, the client device 140 may automatically send the input data to the I/O interface 112, and if the client device 140 is required to automatically send the input data to obtain authorization from the user, the user may set the corresponding permissions in the client device 140. The user can view the result output by the execution device 110 at the client device 140, and the specific presentation form can be display, sound, action, and the like. The client device 140 may also serve as a data collection terminal, collecting input data of the input I/O interface 112 and output results of the output I/O interface 112 as new sample data, and storing the new sample data in the database 130. Of course, the input data inputted to the I/O interface 112 and the output result outputted from the I/O interface 112 as shown in the figure may be directly stored in the database 130 as new sample data by the I/O interface 112 without being collected by the client device 140.

It should be noted that fig. 1B is only a schematic diagram of a system architecture provided in the embodiment of the present application, and the position relationship between the devices, modules, and the like shown in the diagram does not constitute any limitation, for example, in fig. 1B, the data storage system 150 is an external memory with respect to the execution device 110, and in other cases, the data storage system 150 may also be disposed in the execution device 110. Client device 140 may also be integrated with execution device 110 as a single device.

As shown in fig. 1B, a target model/rule 101 is obtained according to training of the training device 120, where the target model/rule 101 may be a neural network model in the present application in this embodiment, and specifically, the neural network provided in this embodiment may be CNN, Deep Convolutional Neural Networks (DCNN), Recurrent Neural Networks (RNNS), and the like.

As shown in fig. 1C, the embodiment of the present application provides a block diagram of a structure of the training device 120 or the terminal 200. The terminal 200 may be a communication-capable electronic device, which may include various handheld devices having wireless communication functions, vehicle-mounted devices, wearable devices, computing devices or other processing devices connected to a wireless modem, as well as various forms of User Equipment (UE), Mobile Station (MS), terminal Equipment (terminal device), and so on. The terminal 200 in the present application may include one or more of the following components: a processor 210, a memory 220, and an input-output device 230.

Processor 210 may include one or more processing cores. The processor 210 connects various parts within the overall terminal 200 using various interfaces and lines, performs various functions of the terminal 200 and processes data by executing or executing instructions, programs, code sets, or instruction sets stored in the memory 220, and calling data stored in the memory 220. Processor 210 may include one or more processing units, such as: the processor 210 may include a Central Processing Unit (CPU), an Application Processor (AP), a modem processor, a Graphics Processing Unit (GPU), an Image Signal Processor (ISP), a controller, a video codec, a Digital Signal Processor (DSP), a baseband processor, and/or a neural-Network Processing Unit (NPU), etc. The controller may be, among other things, the neural center and the command center of the terminal 200. The NPU is a neural-network (NN) computing processor that processes input information quickly by using a biological neural network structure, for example, by using a transfer mode between neurons of a human brain, and can also learn by itself continuously. The NPU can implement applications such as intelligent recognition of the terminal 200, for example: image beautification, image recognition, face recognition, voice recognition, text understanding, and the like.

It is to be understood that the processor 210 may be mapped to a System On Chip (SOC) in an actual product, and the processing unit and/or the interface may not be integrated into the processor 210, and the corresponding functions may be implemented by a communication Chip or an electronic component alone. The above-described interfacing relationship between the modules is merely illustrative, and does not constitute a unique limitation on the structure of the terminal 200.

The Memory 220 may include a Random Access Memory (RAM) or a Read-Only Memory (Read-Only Memory). Optionally, the memory 220 includes a non-transitory computer-readable medium. The memory 220 may be used to store instructions, programs, code, sets of codes, or sets of instructions. The memory 220 may include a program storage area and a data storage area, wherein the program storage area may store instructions for implementing an operating system, instructions for implementing at least one function (such as a touch function, a sound playing function, an image playing function, etc.), instructions for implementing various method embodiments described below, and the like, and the operating system may be an Android (Android) system (including a system based on Android system depth development), an IOS system developed by apple inc (including a system based on IOS system depth development), or other systems. The storage data area may also store data created by the terminal 200 in use, such as raw images, phone books, audio-video data, chat log data, and the like.

As shown in fig. 1D, the embodiment of the present application provides a hardware structure of a chip, which includes a neural network processor 30. The chip may be disposed in the execution device 110 or the training device 120 shown in fig. 1B to complete the calculation work of the calculation module 111. The chip may also be disposed in the training apparatus 120 as shown in fig. 1B to complete the training work of the training apparatus 120 and output the target model/rule 101. The algorithms for the various layers of the object detection model described in this application can be implemented in a chip as shown in FIG. 1D.

The neural network processor NPU 30 is mounted as a coprocessor on a main CPU (host CPU), which allocates tasks. The core portion of the NPU is an arithmetic circuit 303, and the controller 304 controls the arithmetic circuit 303 to extract data in a memory (weight memory or input memory) and perform an operation. The arithmetic circuit 303 internally includes a plurality of processing units (PEs). The operational circuit 303 is a two-dimensional systolic array. The arithmetic circuit 303 may also be a one-dimensional systolic array or other electronic circuit capable of performing mathematical operations such as multiplication and addition. In some implementations, the arithmetic circuitry 303 is a general-purpose matrix processor.

For example, assume that there is an input matrix A, a weight matrix B, and an output matrix C. The arithmetic circuit 303 fetches the data corresponding to the matrix B from the weight memory 302 and buffers the data on each PE in the arithmetic circuit 303. The arithmetic circuit 303 takes the matrix a data from the input memory 301 and performs matrix arithmetic with the matrix B, and a partial result or a final result of the obtained matrix is stored in an accumulator (accumulator) 308.

The vector calculation unit 307 may further process the output of the operation circuit 303, such as vector multiplication, vector addition, exponential operation, logarithmic operation, magnitude comparison, and the like. For example, the vector calculation unit 307 may be used for network calculation of a non-convolution/non-FC layer in a neural network, such as pooling (Pooling), batch normalization (batch normalization), local response normalization (local response normalization), and the like.

The vector calculation unit 307 stores the vector of the processed output to the unified memory 306. For example, the vector calculation unit 307 may apply a non-linear function to the output of the arithmetic circuit 303, such as a vector of accumulated values, to generate the activation value. In some implementations, the vector calculation unit 307 generates normalized values, combined values, or both.

The unified memory 306 is used to store input data as well as output data.

The weight data directly passes through a memory unit access controller 305 (DMAC) to carry input data in the external memory to the input memory 301 and/or the unified memory 306, store the weight data in the external memory into the weight memory 302, and store data in the unified memory 306 into the external memory.

A Bus Interface Unit (BIU) 310, configured to implement interaction between the main CPU, the DMAC, and the value storage 309 through a bus.

A value memory 309 is coupled to the controller 304 for storing instructions used by the controller 304.

And the controller 304 is configured to call an instruction cached in the value taking memory 309, so as to control a working process of the operation accelerator.

Generally, the unified memory 306, the input memory 301, the weight memory 302 and the value memory 309 are On-Chip memories, the external memory is a memory outside the NPU, and the external memory may be a double data rate synchronous dynamic random access memory (DDR SDRAM), a High Bandwidth Memory (HBM) or other readable and writable memories.

The operations of the target detection model layers described in this application may be performed by the operation circuit 303 or the vector calculation unit 307.

The software system of the terminal 200 may adopt any one of a hierarchical architecture, an event-driven architecture, a micro-core architecture, a micro-service architecture, or a cloud architecture. The embodiment of the present application exemplifies a software architecture of the terminal 200 by taking an Android system with a layered architecture as an example.

As shown in fig. 1E, according to the architecture diagram of the terminal with the Android system provided in the embodiment of the present application, a Linux kernel layer 420, a system runtime library layer 440, an application framework layer 460, and an application layer 480 may be stored in the memory 220, where the layers communicate with each other through a software interface, and the Linux kernel layer 420, the system runtime library layer 440, and the application framework layer 460 belong to an operating system space.

The application layer 480 belongs to a user space, at least one application program runs in the application layer 480, and the application programs may be native application programs carried by an operating system, or third-party application programs developed by third-party developers, and specifically may include an object detection function (for executing the object detection method described in the present application), a gallery, a calendar, a call, a map, navigation, WLAN, bluetooth, music, a video, a short message, and other application programs.

The application framework layer 460 provides various APIs that may be used by applications that build the application layer, and developers can also build their own applications by using these APIs, such as a window manager, a content provider, a view system, a phone manager, a resource manager, a notification manager, a message manager, an activity manager, a package manager, and a location manager.

The window manager is used for managing window programs. The window manager can obtain the size of the display screen, judge whether a status bar exists, lock the screen, intercept the screen and the like.

The content provider is used to store and retrieve data and make it accessible to applications. The data may include video, images, audio, calls made and received, browsing history and bookmarks, phone books, etc.

The view system includes visual controls such as controls to display text, controls to display pictures, and the like. The view system may be used to build applications. The display interface may be composed of one or more views. For example, the display interface including the short message notification icon may include a view for displaying text and a view for displaying pictures.

The phone manager is used to provide a communication function of the terminal 200. Such as management of call status (including on, off, etc.).

The resource manager provides various resources for the application, such as localized strings, icons, pictures, layout files, video files, and the like.

The system runtime library layer 440 provides the main feature support for the Android system through some C/C + + libraries. For example, the SQLite library provides support for a database, the OpenGL/ES library provides support for 3D drawing, the Webkit library provides support for a browser kernel, and the like. Also provided in the system Runtime layer 440 is an Android Runtime library (Android Runtime), which mainly provides some core libraries that can allow developers to write Android applications using the Java language.

The Linux kernel layer 420 provides underlying drivers for various hardware of the terminal 200, such as a display driver, an audio driver, a camera driver, a bluetooth driver, a Wi-Fi driver, power management, and the like.

Referring to fig. 2A, fig. 2A is a schematic structural diagram of a target detection model provided in the present embodiment, and is applied to the execution device 110 or the training device 120, as shown in the figure, the model includes a feature extraction module 510, a feature fusion module 520, and a target detection module 530, the feature extraction module 510 is connected to the feature fusion module 520 and the target detection module 530, and the feature fusion module 520 is connected to the target detection module 530;

the feature extraction module 510 is configured to extract feature maps of multiple scales of an original image.

The original image comprises a target to be detected, the target to be detected can be any object, such as a human face, a pet, various articles and the like, and is specifically restricted by a detection category defined by a target detection function, and the detection category is not limited uniquely here.

The scale refers to the image resolution size, for example, the scale of the original image may be 300 × 300 pixel scale.

In a specific implementation, the feature extraction module 510 may extract feature maps of different scales through convolutional layers of different scales, where the feature extraction principle of the convolutional layers is a common technique in the field of image processing, and is not described herein again.

The feature fusion module 520 is configured to perform feature fusion on at least three feature maps in the feature maps with multiple scales, where the at least three feature maps include feature maps with the same or different scales.

The target detection module 530 is configured to obtain a target category and a target position of the target to be detected in the original image according to the feature map without feature fusion and the feature map after feature fusion in the plurality of feature maps.

It can be seen that in the embodiment of the application, the target detection model includes a feature extraction module, a feature fusion module and a target detection module, the feature extraction module is connected to the feature fusion module and the target detection module, and the feature fusion module is connected to the target detection module, wherein the feature extraction module is used for extracting feature maps of multiple scales of the original image, the feature fusion module is used for performing feature fusion on at least three feature maps in the feature maps of the multiple scales, and the target detection module is used for obtaining the target category and the target position of the target to be detected in the original image according to the feature maps which are not subjected to feature fusion in the multiple feature maps and the feature maps which are subjected to feature fusion. Because the at least three feature maps comprise feature maps with the same scale and different scales, the target detection model can perform feature fusion on the feature maps with the same scale and different scales at the same time, integrates part of high-level and low-level semantic information, improves the target detection accuracy on the basis of not excessively increasing the complexity of the model, and is particularly suitable for small target detection of a light-weight target detection model.

In one possible example, as shown in fig. 2B, the at least three feature maps include a first feature map corresponding to a first convolutional layer 511, a second feature map corresponding to a second convolutional layer 512, and a third feature map corresponding to a third convolutional layer 513, the first convolutional layer 511 and the second convolutional layer 512 have the same dimension, and the third convolutional layer 513 has a dimension larger than that of the second convolutional layer 512;

the feature fusion module 520 includes a first feature fusion module 521 and a second feature fusion module 522;

the first convolution layer 511 is connected to the target detection module 530, the first convolution layer 511 and the second convolution layer 512 are connected to the first feature fusion module 521, the first feature fusion module 521 and the third convolution layer 513 are connected to the second feature fusion module 522, and the second feature fusion module 522 is connected to the target detection module 530;

the first feature fusion module 521 is configured to perform first feature fusion processing on the first feature map and the second feature map to obtain a first fusion feature map;

the second feature fusion module 522 is configured to perform a second feature fusion process on the first fused feature map and the third feature map to obtain a second fused feature map;

the target detection module 530 is configured to obtain a target category and a target position of a target to be detected in the original image according to the first feature map, the second fused feature map, and a feature map, other than the first feature map, in a feature map without feature fusion.

The first feature fusion processing is feature fusion processing for feature maps of the same size, and the second feature fusion processing is feature fusion processing for feature maps of different sizes.

The data dimension of the first feature map may be 10 × 1024 (corresponding to the scale of the first convolution layer 511 being 10 × 10), the data dimension of the second feature map may be 10 × 1024 (corresponding to the scale of the second convolution layer 511 being 10 × 10), and the data dimension of the third feature map may be 19 × 19 512 (corresponding to the scale of the third convolution layer 511 being 19 × 19).

In addition, the feature extraction module 510 may include, in addition to the first convolution layer 511, the second convolution layer 512, and the third convolution layer 513, other convolution layers such as convolution layers conv0 to conv10 of mobileNet before the third convolution layer 513, and a gradually decreasing convolution layer after the first convolution layer 511.

As can be seen, in this example, the feature fusion module 520 can sample different feature fusion processes for feature maps of the same size and feature maps of different scales, so as to ensure the accuracy of image processing.

In one possible example, as shown in fig. 2C, the first feature fusion module 521 includes a first convolution unit 5210, a second convolution unit 5211, a first batch regularization unit 5212, a second batch regularization unit 5213, a first dot multiplication unit 5214, a first activation function unit 5215;

the first convolution unit 5210 is connected to the first batch regularization unit 5212, the second convolution unit 5211 is connected to the second batch regularization unit 5213, the first batch regularization unit 5212 and the second batch regularization unit 5213 are connected to the first point multiplication unit 5214, and the first point multiplication unit 5214 is connected to the first activation function unit 5215;

the first convolution unit 5210 processes the first feature map, the second convolution unit 5211 processes the second feature map, and the first activation function unit 5215 outputs the first fused feature map.

In a specific implementation, the first convolution unit 5210 and the second convolution unit 5211 may be conventional convolution operation layers. The processed data dimension of the first convolution unit 5210 may be 1 × 512, and the processed data dimension of the second convolution unit 5211 may be 1 × 512.

As can be seen, in this example, the first feature fusion module adopts an upsampling and element multiplying manner, which is equivalent to a spatial attention model spatial attribute, and can implement supervision on the lower layer by using the upper layer information, thereby improving the feature fusion effect and improving the model performance.

In one possible example, as shown in fig. 2D, the second feature fusion module 522 includes a third convolution unit 5220, a deconvolution unit 5221, a fourth convolution unit 5222, a fifth convolution unit 5223, a third batch regularization unit 5224, a fourth batch regularization unit 5225, a second dot product unit 5226, a second activation function unit 5227;

the third convolution unit 5221 is connected to the third batch regularization unit 5224, the deconvolution unit 5221 is connected to the fourth convolution unit 5222, the fourth convolution unit 5222 is connected to the fifth convolution unit 5223, the fifth convolution unit 5223 is connected to the fourth batch regularization unit 5225, and the fourth batch regularization unit 5225 is connected to the second activation function unit 5227;

the deconvolution unit 5221 processes the first fused feature map, the third convolution unit 5220 processes the third feature map, and the second activation function unit 5227 outputs the second fused feature map.

In a specific implementation, the third convolution unit 5220, the fourth convolution unit 5222 and the fifth convolution unit 5223 may be conventional convolution operation layers. Wherein the post-processing data dimension of the third convolution unit 5221 may be 1x 512, the post-processing data dimension of the deconvolution unit 5221 may be 2 x 512, the post-processing data dimension of the fourth convolution unit 5222 may be 1x 512, and the post-processing data dimension of the fifth convolution unit 5223 may be 1x 512.

As can be seen, in this example, the second feature fusion module adopts an upsampling and element multiplying manner, which is equivalent to a spatial attention model spatial attribute, and can implement supervision on the lower layer by using the upper layer information, thereby improving the feature fusion effect and improving the model performance.

In addition, the fourth convolution unit 5222 may also be a depth direction Depthwise convolution operation layer, and the fifth convolution unit 5223 may also be a Pointwise convolution operation layer. Wherein the post-processing data dimension of the third convolution unit 5221 may be 1x 512, the post-processing data dimension of the deconvolution unit 5221 may be 3 x 512, the post-processing data dimension of the fourth convolution unit 5222 may be 3 x 512, and the post-processing data dimension of the fifth convolution unit 5223 may be 1x 512.

It can be seen that in this example, the operation amount of the upsampling structure can be reduced by using a convolution structure of Depthwise plus Pointwise, and finally, a convolution kernel of 1 × 1 is used as much as possible, which is beneficial to further reducing the operation amount and improving the operation efficiency of the model.

In one possible example, the target detection module 530 is specifically configured to: processing the feature maps which are not subjected to feature fusion and the feature maps after feature fusion in the feature maps through a convolution filter to obtain target categories and position offsets relative to a default frame in a plurality of prediction frames of the feature maps of the target to be detected; suppressing the target types in the plurality of predicted frames and the position offset of the predicted frames relative to the default frame by using non-maximum suppression to obtain the final position offset of the target types in the predicted frames and the position offset of the predicted frames relative to the default frame; and determining the position coordinates of the predicted frame according to the position offset of the final predicted frame relative to the default frame and the position coordinates of the default frame.

As can be seen, in this example, the target types in the plurality of predicted frames and the position offsets of the predicted frames from the default frame are suppressed by the non-maximum suppression, so as to obtain the final position offsets of the target types in the predicted frames and the predicted frames from the default frame, thereby improving the prediction accuracy.

In one possible example, the feature extraction module 530 includes a deep learning model network that supports single-stage multi-target detection.

The deep learning model network supporting single-stage multi-target detection may be, for example, a mobileNet, Alexnet, VGG, inclusion, or the like, which is not limited herein.

In one possible example, the deep learning model network supporting single-stage multi-target detection comprises a modified mobileNet;

the modified mobileNet includes mobileNet convolution layers 0 through 11 and custom sequentially reduced convolution layers 12 through 16.

Among them, the convolutional layers 0 to 11 of the mobileNet are commonly used convolutional layers, and are not described herein again. The post-processing data dimension of the volume layer 12 may be 10 x 512, the post-processing data dimension of the volume layer 13 may be 5 x 256, the post-processing data dimension of the volume layer 14 may be 3 x 256, the post-processing data dimension of the volume layer 15 may be 1x 256, and the post-processing data dimension of the volume layer 16 may be 1x 128.

In this example, feature maps of different scales can be extracted in a form of a basic network and an auxiliary network, so that semantic information of an original image is included as comprehensively as possible, and prediction precision and accuracy are improved.

Referring to fig. 3, fig. 3 is a flowchart illustrating a method for training a target detection model according to an embodiment of the present application, which is applied to the target detection model trained by the training apparatus 120.

Step 301, setting initial module parameters of the feature extraction module, the feature fusion module and the target detection module of the target detection model.

In this possible example, the setting of initial module parameters of the feature extraction module, the feature fusion module and the target detection module of the target detection model includes: and using model parameters of a pre-training model trained on a preset training data set as initial module parameters of the feature extraction module, the feature fusion module and the target detection module.

The preset training data set may be an existing training set such as ImageNet.

Step 302, keeping the module parameters of the feature extraction module unchanged in the first stage, and training the target detection model to update the module parameters of the feature fusion module and the target detection module;

step 303, training the target detection model in the second stage to synchronously update the module parameters of the feature extraction module, the feature fusion module and the target detection module, so as to obtain the trained target detection model.

In this possible example, the number of training cycles of the first phase is less than the number of training cycles of the second phase.

Wherein the first phase may be 20 cycles and the second phase may be 40 cycles.

It can be seen that, in the embodiment of the present application, by training the target detection model in stages, the network of the module parameters of the first-stage fixed feature extraction module can train a generic feature fusion module and a target detection module, and has a certain detection accuracy, but the accuracy is to be further improved, the training result in this stage is mainly used as the basic model in the next stage to prepare for the training in the next stage, during the training in the second stage, the complete network model is continuously trained on the basis of the trained model in the first stage, the training in this stage synchronously updates the module parameters of the feature extraction module on the basis of the first stage, and the training result is adjusted on the basis of the training result in the first stage according to the information after feature fusion, so that the detection accuracy is further improved on the basis of the first stage.

Referring to fig. 4, fig. 4 is a flowchart illustrating a method for configuring a target detection model according to an embodiment of the present application, which is applied to the target detection model trained by the training apparatus 120.

Step 401, obtaining a trained target detection model.

Step 402, generating a model configuration file according to the target detection model.

In this possible example, the generating a model profile from the object detection model includes: using a model transformation tool in a neural processing engine, transforming the object detection model into a model profile recognizable by the terminal.

In this possible example, the neural processing engine is a high pass platform neural processing engine, SNPE, and the model configuration file is a DLC file.

Step 403, sending the model configuration file to a terminal to configure the target detection model.

It can be seen that, in the embodiment of the application, the target detection model can be configured to the terminal in the form of the model configuration file which can be identified by the terminal, so that the configuration accuracy and efficiency are improved.

Referring to fig. 5, fig. 5 is a flowchart illustrating a target detection method applied to the execution device 110 configured with a trained target detection model according to an embodiment of the present application.

Step 501, obtaining an original image.

Step 502, processing the original image by using the target detection model to obtain the target type and the target position of the target to be detected in the original image.

In this possible example, the processing the raw image using the object detection model includes: and calling a model configuration file of the target detection model by using a neural processing engine, and operating the target detection model to process the original image.

It can be seen that, in the embodiment of the application, the target detection model used by the terminal can improve the target detection accuracy without excessively increasing the complexity of the model, and is particularly suitable for small target detection of a light-weight target detection model. The detection accuracy is improved.

The above description has introduced the solution of the embodiment of the present application mainly from the perspective of the method-side implementation process. It is understood that the electronic device comprises corresponding hardware structures and/or software modules for performing the respective functions in order to realize the above-mentioned functions. Those of skill in the art will readily appreciate that the present application is capable of hardware or a combination of hardware and computer software implementing the various illustrative elements and algorithm steps described in connection with the embodiments provided herein. Whether a function is performed as hardware or computer software drives hardware depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

In the embodiment of the present application, the electronic device may be divided into the functional units according to the method example, for example, each functional unit may be divided corresponding to each function, or two or more functions may be integrated into one processing unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit. It should be noted that the division of the unit in the embodiment of the present application is schematic, and is only a logic function division, and there may be another division manner in actual implementation.

The embodiment of the present application provides a training apparatus for a target detection model, which may be a training device 120. Specifically, the training device of the target detection model is used for executing the steps executed by the training equipment in the above training method of the target detection model. The training device for the target detection model provided by the embodiment of the application can comprise modules corresponding to corresponding steps.

In the embodiment of the present application, the functional modules of the training apparatus of the target detection model may be divided according to the above method examples, for example, each functional module may be divided corresponding to each function, or two or more functions may be integrated into one processing module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. The division of the modules in the embodiment of the present application is schematic, and is only a logic function division, and there may be another division manner in actual implementation.

Fig. 6 is a schematic diagram of a possible structure of the training apparatus for the target detection model according to the embodiment, in a case where each functional module is divided according to each function. As shown in fig. 6, the training apparatus 6 of the object detection model includes a setting unit 60, a first training unit 61, and a second training unit 62.

The setting unit 60 is configured to set initial module parameters of the feature extraction module, the feature fusion module and the target detection module of the target detection model;

the first training unit 61 is configured to, in a first stage, keep the module parameters of the feature extraction module unchanged, train the target detection model to update the module parameters of the feature fusion module and the target detection module;

the second training unit 62 is configured to train the target detection model in a second stage to synchronously update the module parameters of the feature extraction module, the feature fusion module, and the target detection module, so as to obtain the trained target detection model.

In a possible example, in terms of setting initial module parameters of the feature extraction module, the feature fusion module and the target detection module of the target detection model, the setting unit 60 is specifically configured to use model parameters of a pre-training model trained on a preset training data set as the initial module parameters of the feature extraction module, the feature fusion module and the target detection module.

In one possible example, the number of training cycles of the first phase is less than the number of training cycles of the second phase.

All relevant contents of each step related to the above method embodiment may be referred to the functional description of the corresponding functional module, and are not described herein again. Of course, the training apparatus for the target detection model provided in the embodiments of the present application includes, but is not limited to, the above modules, for example: the training means of the object detection model may further comprise a storage unit 63. The memory unit 63 may be used for storing program codes and data of the training means of the object detection model.

In the case of using an integrated unit, a schematic structural diagram of a training apparatus for an object detection model provided in an embodiment of the present application is shown in fig. 7. In fig. 7, the training device 7 of the target detection model includes: a processing module 70 and a communication module 71. The processing module 70 is used to control and manage the actions of the training devices of the object detection model, for example, the steps performed by the setup unit 60, the first training unit 61, and the second training unit 62, and/or other processes for performing the techniques described herein. The communication module 71 is used to support the interaction between the training apparatus of the target detection model and other devices. As shown in fig. 7, the training apparatus for the object detection model may further include a storage module 72, and the storage module 72 is used for storing program codes and data of the training apparatus for the object detection model, for example, contents stored in the storage unit 63.

The processing module 70 may be a Processor or a controller, and may be, for example, a Central Processing Unit (CPU), a general-purpose Processor, a Digital Signal Processor (DSP), an ASIC, an FPGA or other programmable logic device, a transistor logic device, a hardware component, or any combination thereof. Which may implement or perform the various illustrative logical blocks, modules, and circuits described in connection with the disclosure. The processor may also be a combination of computing functions, e.g., comprising one or more microprocessors, DSPs, and microprocessors, among others. The communication module 71 may be a transceiver, an RF circuit or a communication interface, etc. The storage module 72 may be a memory.

All relevant contents of each scene related to the method embodiment may be referred to the functional description of the corresponding functional module, and are not described herein again. The training device of the object detection model may perform the steps performed by the training apparatus in the training method of the object detection model shown in fig. 3.

The embodiment of the present application provides a configuration apparatus of a target detection model, which may be a training device 120. Specifically, the configuration device of the target detection model is used for executing the steps executed by the training equipment in the configuration method of the target detection model. The configuration device for the target detection model provided by the embodiment of the application may include modules corresponding to the corresponding steps.

In the embodiment of the present application, the configuration device of the target detection model may be divided into the functional modules according to the above method examples, for example, each functional module may be divided corresponding to each function, or two or more functions may be integrated into one processing module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. The division of the modules in the embodiment of the present application is schematic, and is only a logic function division, and there may be another division manner in actual implementation.

Fig. 8 is a schematic diagram showing a possible configuration of the configuration apparatus of the object detection model according to the above embodiment, in a case where each functional module is divided according to each function. As shown in fig. 8, the configuration device 8 of the object detection model includes an acquisition unit 80, a configuration unit 81, and a transmission unit 82.

The acquiring unit 80 is configured to acquire a trained target detection model;

the configuration unit 81 is configured to generate a model configuration file according to the target detection model;

the sending unit 82 is configured to send the model configuration file to a terminal to configure the target detection model.

In one possible example, in the aspect of generating a model profile according to the object detection model, the configuration unit 81 is specifically configured to use a model conversion tool in a neural processing engine to convert the object detection model into a model profile recognizable by the terminal.

In one possible example, the neural processing engine is a high pass platform neural processing engine, SNPE, and the model configuration file is a DLC file.

All relevant contents of each step related to the above method embodiment may be referred to the functional description of the corresponding functional module, and are not described herein again. Of course, the configuration apparatus of the target detection model provided in the embodiment of the present application includes, but is not limited to, the above modules, for example: the means for configuring the object detection model may further comprise a storage unit 83. The storage unit 83 may be used to store program codes and data of the configuration means of the object detection model.

In the case of using an integrated unit, a schematic structural diagram of a configuration apparatus of an object detection model provided in an embodiment of the present application is shown in fig. 9. In fig. 9, the configuration device 9 of the object detection model includes: a processing module 90 and a communication module 91. The processing module 90 is used for controlling and managing actions of the configuration means of the object detection model, for example, steps performed by the acquisition unit 80, the configuration unit 81 and the sending unit 82, and/or other processes for performing the techniques described herein. The communication module 91 is used to support the interaction between the configuration apparatus of the object detection model and other devices. As shown in fig. 9, the configuration device of the object detection model may further include a storage module 92, and the storage module 92 is used for storing program codes and data of the configuration device of the object detection model, for example, storing contents stored in the storage unit 83.

The processing module 90 may be a processor or a controller, such as a central processing unit CPU, a general purpose processor, a digital signal processor DSP, an ASIC, an FPGA or other programmable logic device, a transistor logic device, a hardware component, or any combination thereof. Which may implement or perform the various illustrative logical blocks, modules, and circuits described in connection with the disclosure. The processor may also be a combination of computing functions, e.g., comprising one or more microprocessors, DSPs, and microprocessors, among others. The communication module 91 may be a transceiver, an RF circuit or a communication interface, etc. The storage module 92 may be a memory.

All relevant contents of each scene related to the method embodiment may be referred to the functional description of the corresponding functional module, and are not described herein again. The configuration device of the target detection model may perform the steps performed by the training device in the configuration method of the target detection model shown in fig. 4.

The embodiment of the present application provides an object detection apparatus, which may be an application device 110. Specifically, the object detection apparatus is configured to perform the steps performed by the application device 110 in the above object detection method. The target detection device provided by the embodiment of the application can comprise modules corresponding to the corresponding steps.

In the embodiment of the present application, the target detection apparatus may be divided into the functional modules according to the method example, for example, each functional module may be divided corresponding to each function, or two or more functions may be integrated into one processing module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. The division of the modules in the embodiment of the present application is schematic, and is only a logic function division, and there may be another division manner in actual implementation.

Fig. 10 shows a schematic diagram of a possible structure of the object detection device according to the above embodiment, in a case where each functional module is divided according to each function. As shown in fig. 10, the object detection apparatus 10 includes an acquisition unit 100 and a use unit 101.

The acquiring unit 100 is configured to acquire an original image;

the using unit 101 is configured to process the original image by using the target detection model, and obtain a target type and a target position of a target to be detected in the original image.

In one possible example, in the aspect of processing the raw image by using the object detection model, the using unit 101 is specifically configured to use a neural processing engine to call a model configuration file of the object detection model, and run the object detection model to process the raw image.

All relevant contents of each step related to the above method embodiment may be referred to the functional description of the corresponding functional module, and are not described herein again. Of course, the object detection device provided in the embodiments of the present application includes, but is not limited to, the above modules, for example: the object detection apparatus may further include a storage unit 103. The storage unit 103 may be used to store program codes and data of the object detection means.

In the case of using an integrated unit, a schematic structural diagram of an object detection device provided in the embodiment of the present application is shown in fig. 11. In fig. 11, the object detection device 11 includes: a processing module 110 and a communication module 111. The processing module 110 is used to control and manage the actions of the object detection device, e.g., the acquisition unit 100, the steps performed using the unit 101, and/or other processes for performing the techniques described herein. The communication module 111 is used to support interaction between the object detection apparatus and other devices. As shown in fig. 11, the object detection device may further include a storage module 112, and the storage module 112 is used for storing program codes and data of the object detection device, for example, contents stored in the storage unit 103.

The processing module 110 may be a processor or a controller, and may be, for example, a central processing unit CPU, a general purpose processor, a digital signal processor DSP, an ASIC, an FPGA or other programmable logic device, a transistor logic device, a hardware component, or any combination thereof. Which may implement or perform the various illustrative logical blocks, modules, and circuits described in connection with the disclosure. The processor may also be a combination of computing functions, e.g., comprising one or more microprocessors, DSPs, and microprocessors, among others. The communication module 111 may be a transceiver, an RF circuit or a communication interface, etc. The storage module 112 may be a memory.

All relevant contents of each scene related to the method embodiment may be referred to the functional description of the corresponding functional module, and are not described herein again. The target detection apparatus may perform the steps performed by the application device in the target detection method shown in fig. 5.

Embodiments of the present application also provide a computer storage medium, where the computer storage medium stores a computer program for electronic data exchange, the computer program enabling a computer to execute part or all of the steps of any one of the methods described in the above method embodiments, and the computer includes an electronic device.

Embodiments of the present application also provide a computer program product comprising a non-transitory computer readable storage medium storing a computer program operable to cause a computer to perform some or all of the steps of any of the methods as described in the above method embodiments. The computer program product may be a software installation package, the computer comprising an electronic device.

It should be noted that, for simplicity of description, the above-mentioned method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present application is not limited by the order of acts described, as some steps may occur in other orders or concurrently depending on the application. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required in this application.

In the foregoing embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

In the embodiments provided in the present application, it should be understood that the disclosed apparatus may be implemented in other manners. For example, the above-described embodiments of the apparatus are merely illustrative, and for example, the above-described division of the units is only one type of division of logical functions, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of some interfaces, devices or units, and may be an electric or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit may be stored in a computer readable memory if it is implemented in the form of a software functional unit and sold or used as a stand-alone product. Based on such understanding, the technical solution of the present application may be substantially implemented or a part of or all or part of the technical solution contributing to the prior art may be embodied in the form of a software product stored in a memory, and including several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the above-mentioned method of the embodiments of the present application. And the aforementioned memory comprises: a U-disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic or optical disk, and other various media capable of storing program codes.

Those skilled in the art will appreciate that all or part of the steps in the methods of the above embodiments may be implemented by associated hardware instructed by a program, which may be stored in a computer-readable memory, which may include: flash Memory disks, Read-Only memories (ROMs), Random Access Memories (RAMs), magnetic or optical disks, and the like.

The foregoing detailed description of the embodiments of the present application has been presented to illustrate the principles and implementations of the present application, and the above description of the embodiments is only provided to help understand the method and the core concept of the present application; meanwhile, for a person skilled in the art, according to the idea of the present application, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present application.

Claims

1. A target detection model is characterized by comprising a feature extraction module, a feature fusion module and a target detection module, wherein the feature extraction module is connected with the feature fusion module and the target detection module, and the feature fusion module is connected with the target detection module;

2. The method of claim 1, wherein the at least three feature maps comprise a first feature map corresponding to a first convolutional layer, a second feature map corresponding to a second convolutional layer, and a third feature map corresponding to a third convolutional layer, wherein the first convolutional layer and the second convolutional layer have the same dimensions, and the third convolutional layer has dimensions larger than the second convolutional layer;

the feature fusion module comprises a first feature fusion module and a second feature fusion module;

the first convolution layer is connected with the target detection module, the first convolution layer and the second convolution layer are connected with the first feature fusion module, the first feature fusion module and the third convolution layer are connected with the second feature fusion module, and the second feature fusion module is connected with the target detection module;

the first feature fusion module is used for performing first feature fusion processing on the first feature map and the second feature map to obtain a first fusion feature map;

the second feature fusion module is used for performing second feature fusion processing on the first fusion feature map and the third feature map to obtain a second fusion feature map;

the target detection module is configured to obtain a target category and a target position of a target to be detected in the original image according to the first feature map, the second fused feature map, and a feature map, other than the first feature map, in a feature map without feature fusion.

3. The method of claim 2, wherein the first feature fusion module comprises a first convolution unit, a second convolution unit, a first batch regularization unit, a second batch regularization unit, a first dot multiplication unit, and a first activation function unit;

the first convolution unit is connected with the first batch of regularization units, the second convolution unit is connected with the second batch regularization units, the first batch regularization units and the second batch regularization units are connected with the first point multiplication unit, and the first point multiplication unit is connected with the first activation function unit;

the first convolution unit processes the first feature map, the second convolution unit processes the second feature map, and the first activation function unit outputs the first fused feature map.

4. The method according to claim 2 or 3, wherein the second feature fusion module comprises a third convolution unit, a deconvolution unit, a fourth convolution unit, a fifth convolution unit, a third batch of regularization units, a fourth batch of regularization units, a second dot multiplication unit, and a second activation function unit;

the third convolution unit is connected with the third batch of regularization units, the deconvolution unit is connected with the fourth convolution unit, the fourth convolution unit is connected with the fifth convolution unit, the fifth convolution unit is connected with the fourth batch regularization unit, and the fourth batch regularization unit is connected with the second activation function unit;

the deconvolution unit processes the first fused feature map, the third convolution unit processes the third feature map, and the second activation function unit outputs the second fused feature map.

5. The method according to any one of claims 1-4, wherein the object detection module is specifically configured to:

processing the feature maps which are not subjected to feature fusion and the feature maps after feature fusion in the feature maps through a convolution filter to obtain target categories and position offsets relative to a default frame in a plurality of prediction frames of the feature maps of the target to be detected;

suppressing the target types in the plurality of predicted frames and the position offset of the predicted frames relative to the default frame by using non-maximum suppression to obtain the final position offset of the target types in the predicted frames and the position offset of the predicted frames relative to the default frame;

and determining the position coordinates of the predicted frame according to the position offset of the final predicted frame relative to the default frame and the position coordinates of the default frame.

6. The method of any one of claims 1-5, wherein the feature extraction module comprises a deep learning model network that supports single stage multi-target detection.

7. The method of claim 6, wherein the deep learning model network supporting single stage multi-object detection comprises a modified mobileNet;

8. A method for training an object detection model, applied to the model of any one of claims 1-7, the method comprising:

9. The method of claim 8, wherein setting initial module parameters of the feature extraction module, the feature fusion module and the target detection module of the target detection model comprises:

and using model parameters of a pre-training model trained on a preset training data set as initial module parameters of the feature extraction module, the feature fusion module and the target detection module.

10. A method according to claim 8 or 9, wherein the number of training cycles of the first stage is smaller than the number of training cycles of the second stage.

11. A method for configuring an object detection model, applied to the model according to any one of claims 8-10, the method comprising:

acquiring a trained target detection model;

generating a model configuration file according to the target detection model;

and sending the model configuration file to a terminal to configure the target detection model.

12. The method of claim 11, wherein generating a model profile from the object detection model comprises:

using a model transformation tool in a neural processing engine, transforming the object detection model into a model profile recognizable by the terminal.

13. The method of claim 12, wherein the neural processing engine is a high pass platform neural processing engine (SNPE), and the model configuration file is a DLC file.

14. An object detection method applied to a terminal configured with the object detection model according to any one of claims 8 to 10, the method comprising:

acquiring an original image;

15. The method of claim 14, wherein the processing the raw image using the object detection model comprises:

and calling a model configuration file of the target detection model by using a neural processing engine, and operating the target detection model to process the original image.

16. The method of claim 15, wherein the neural processing engine is a high pass platform neural processing engine (SNPE), and the model configuration file is a DLC file.

17. A training device for an object detection model, applied to the model of any one of claims 1 to 7, comprising a setting unit, a first training unit and a second training unit,

18. An arrangement of object detection models, applied to the model according to any of claims 1-10, comprising an acquisition unit, a configuration unit and a transmission unit,

the acquisition unit is used for acquiring a trained target detection model;

19. An object detection apparatus, applied to a terminal configured with an object detection model according to any one of claims 1 to 9, comprising an acquisition unit and a use unit,

the acquisition unit is used for acquiring an original image;

20. A server, comprising a processor, memory, a communication interface, and one or more programs stored in the memory and configured to be executed by the processor, the programs comprising instructions for performing the steps in the method of any of claims 8-13.

21. A terminal comprising a processor, memory, a communication interface, and one or more programs stored in the memory and configured to be executed by the processor, the programs comprising instructions for performing the steps in the method of any of claims 14-16.

22. A computer-readable storage medium, characterized in that a computer program for electronic data exchange is stored, wherein the computer program causes a computer to perform the method according to any of the claims 8-16.