CN114529768B

CN114529768B - Method, device, electronic equipment and storage medium for determining object category

Info

Publication number: CN114529768B
Application number: CN202210154477.XA
Authority: CN
Inventors: 许军; 陈可心
Original assignee: Apollo Intelligent Connectivity Beijing Technology Co Ltd
Current assignee: Apollo Intelligent Connectivity Beijing Technology Co Ltd
Priority date: 2022-02-18
Filing date: 2022-02-18
Publication date: 2023-07-21
Anticipated expiration: 2042-02-18
Also published as: CN114529768A

Abstract

The disclosure provides a method for determining object categories, relates to the field of artificial intelligence, and particularly relates to the technical field of deep learning and automatic driving. The specific implementation scheme is as follows: image processing is carried out on an input image by utilizing N deep learning models to obtain N output results, wherein N is an integer greater than 1; determining whether a predetermined difference exists between the N output results; and classifying the object in the input image to determine a class of the object in the input image in response to determining that there is a predetermined difference between the N output results. The disclosure also provides an apparatus, an electronic device and a storage medium for determining the object class.

Description

Method, device, electronic equipment and storage medium for determining object category

Technical Field

The present disclosure relates to the field of artificial intelligence, and in particular, to the field of deep learning and autopilot techniques. More particularly, the present disclosure provides a method, apparatus, electronic device, and storage medium for determining an object class.

Background

The lightweight deep learning model can be applied to a mobile terminal. The mobile terminal may be, for example, a mobile phone, a car phone terminal, etc. For example, a lightweight deep learning model on the vehicle side may be utilized to determine the class of objects in the captured image, so that the autonomous vehicle determines a driving pattern from the class of objects.

Disclosure of Invention

The present disclosure provides a method, apparatus, device, and storage medium for determining an object class.

According to an aspect of the present disclosure, there is provided a method of determining a class of an object, the method comprising: image processing is carried out on an input image by utilizing N deep learning models to obtain N output results, wherein N is an integer greater than 1; determining whether a predetermined difference exists between the N output results; and classifying the object in the input image to determine a class of the object in the input image in response to determining that a predetermined difference exists between the N output results.

According to another aspect of the present disclosure, there is provided an apparatus for determining a class of an object, the apparatus comprising: the image processing module is used for carrying out image processing on the input image by utilizing the N deep learning models to obtain N output results, wherein N is an integer greater than 1; a first determining module, configured to determine whether a predetermined difference exists between the N output results; and a classification module for classifying the object in the input image to determine a class of the object in the input image in response to determining that there is a predetermined difference between the N output results.

According to another aspect of the present disclosure, there is provided an electronic device including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform a method provided in accordance with the present disclosure.

According to another aspect of the present disclosure, there is provided a non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform a method provided according to the present disclosure.

According to another aspect of the present disclosure, there is provided a computer program product comprising a computer program which, when executed by a processor, implements a method provided according to the present disclosure.

It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following specification.

Drawings

The drawings are for a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

FIG. 1 is an exemplary application scenario diagram of a method and apparatus for determining object categories according to one embodiment of the present disclosure;

FIG. 2 is a flow chart of a method of determining object categories according to one embodiment of the present disclosure;

FIG. 3 is a flow chart of a method of determining object categories according to another embodiment of the present disclosure;

FIG. 4A is a schematic illustration of an input image according to one embodiment of the present disclosure;

FIG. 4B is a schematic diagram of a first output result according to one embodiment of the present disclosure;

FIG. 4C is a schematic diagram of a second output result according to one embodiment of the present disclosure;

FIG. 4D is a schematic diagram of calculating an intersection ratio between at least one first detection frame and at least one second detection frame according to one embodiment of the present disclosure;

FIG. 5 is a block diagram of an apparatus for determining object categories according to one embodiment of the disclosure; and

fig. 6 is a block diagram of an electronic device to which a method of determining an object class may be applied, according to one embodiment of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

The lightweight deep learning model deployed on the vehicle-mounted side has limited computational power. When the class of the object in the acquired image is determined by using a lightweight deep learning model on the vehicle-mounted side, the obtained result may have errors.

For example, objects farther from the sampling device are smaller in size in the image. The lightweight deep learning model may not be able to accurately determine its class. As another example, a reflection of a traffic light may be formed in the surface water. The lightweight deep learning model may determine the category of the reflection as a pedestrian, causing the associated autonomous vehicle to enter a braking mode.

In order to improve the performance of the lightweight deep learning model, the results with errors can be screened from the related data set to obtain a training data set with negative labels so as to train the lightweight deep learning model. For example, the results may be manually screened out of the relevant dataset for errors.

The lightweight deep learning model can determine the class of the relevant object from the image in a short period of time (e.g., in 1 minute), resulting in a large data volume of the relevant dataset. Therefore, using manual screening of the related dataset for erroneous results requires significant time and labor costs.

FIG. 1 is an exemplary application scenario diagram of a method and apparatus for determining object categories according to one embodiment of the present disclosure.

It should be noted that fig. 1 illustrates only an example of an application scenario in which the embodiments of the present disclosure may be applied to help those skilled in the art understand the technical content of the present disclosure, but it does not mean that the embodiments of the present disclosure may not be applied to other devices, systems, environments, or scenarios.

As shown in fig. 1, the scenario 100 of this embodiment includes a server 110, an object traveling on a road, a communication base station 130, and a road traffic network. The road traffic network may include roads and intersections formed by the intersections of the roads.

For example, in this scenario 100, objects traveling on a road include pedestrians 121, trucks 122, and cars 123. In one example, the car 123 may be an autonomous vehicle, for example.

For example, in this scenario 100, the road traffic network includes at least an intersection 140 and an intersection 150. Wherein the road segments 161, 162, 163, etc. in the road traffic network meet to form the intersection 140.

The road section in the embodiment of the disclosure refers to a road section between two adjacent intersections. The road segments that meet to form each intersection may include a road segment that enters the intersection and a road segment that exits the intersection. For example, the links that intersect to form the intersection 140 include links 161, 163, etc. that travel into the intersection 140, and links 162, etc. that travel out of the intersection 140.

For example, as shown in fig. 1, a car 123 may be driven on a road section where an intersection is formed, and a lightweight deep learning model may be deployed on a car machine side of the car 123, for example, and the car machine side may upload data to a background server through a communication base station 130. The server 110 may request data from a background server through a network, for example, to obtain data uploaded by the vehicle side.

For example, the server 110 may have various deep learning models deployed thereon. The server 110 may be, for example, any server that supports deep learning model operations, such as a server of a distributed system, or a server that incorporates blockchains.

It should be noted that, the method for determining an object class provided by the embodiments of the present disclosure may be generally performed by the server 110. The apparatus for determining an object class provided by the embodiments of the present disclosure may be provided in the server 110.

It should be understood that the number and types of servers, roads, objects, and communication base stations in fig. 1 are merely illustrative. Terminals, roads, objects and communication base stations of any data and type are possible according to implementation requirements.

Fig. 2 is a flow chart of a method of determining object categories according to one embodiment of the present disclosure.

As shown in fig. 2, the method 200 may include operations S210 to S230.

In operation S210, image processing is performed on the input image using the N deep learning models, resulting in N output results.

For example, N is an integer greater than 1.

For example, an input dataset of a lightweight deep learning model on a car 123 such as that shown in FIG. 1 may be acquired. An image is selected from the input dataset as an input image img_in. The Output result output_1 corresponding to the image img_in may be acquired as an Output result of the lightweight deep learning model. In one example, the input image contains a plurality of categories of objects. The categories may be pedestrians, cars, trucks, non-motor vehicles, and the like.

For another example, in the present embodiment, taking n=2 as an example, for example, the server 110 shown in fig. 1 may perform image processing on the input image img_in by using 1 deep learning model, to obtain 1 Output result output_2.

In operation S220, it is determined whether there is a predetermined difference between the N output results.

For example, it may be determined in any manner whether or not there is a predetermined difference between the N output results. In one example, the number of objects detected by different deep learning models may be determined based on the output results of the different deep learning models to determine whether a predetermined difference exists.

In response to determining that there is a predetermined difference between the N output results, the objects in the input image are classified to determine a class of the objects in the input image in operation S230.

For example, objects in an input image may be classified using a classification model to determine a class of objects in the input image.

According to the embodiment of the disclosure, the result with errors can be efficiently screened from the output result of the light-weight deep learning model, so that the time cost and the labor cost are saved.

In some embodiments, unlike the method 200, in this embodiment, a lightweight deep learning model may be deployed on a server, and the input image is processed by using the lightweight learning model and N-1 deep learning models, respectively, to obtain N output results.

In some embodiments, the N deep learning models include a lightweight deep learning model and a target detection model, as will be described in detail below in conjunction with FIG. 3.

Fig. 3 is a flow chart of a method of determining object categories according to another embodiment of the present disclosure.

As shown in fig. 3, the method 300 may include operation S311, operation 312, operation S320, and operation S330.

In operation S311, a first image processing is performed on the input image using the lightweight deep learning model, resulting in a first output result.

For example, the first output result includes at least one first detection box, each of which is used to mark an area in the input image where an object is located and to characterize a class of the object.

In operation S312, a second image processing is performed on the input image using the object detection model, resulting in a second output result.

For example, the second output results include at least one second detection box, each for marking a region of an object in the input image and for characterizing a class of the object.

For example, the object detection model may be, for example, a YOLO (You Only Look Once) model. In one example, the target detection model may be pre-trained to improve performance of the target detection model.

In operation S320, it is determined whether there is a predetermined difference between the N output results?

In the embodiment of the present disclosure, in response to determining that there is a predetermined difference between the N output results, operation S330 is performed.

In the embodiment of the present disclosure, in response to determining that there is no predetermined difference between the N output results, the input image is replaced, and the operation returns to operation S311.

In the embodiments of the present disclosure, whether there is a predetermined difference between the N output results may be determined according to various ways.

For example, it may be determined whether a predetermined difference exists between the first output result and the second output result according to an intersection ratio between the at least one first detection frame and the at least one second detection frame. The cross ratio was Intersection Over Union (IoU). In one example, an overlap region between two detection frames may be determined first, and then an intersection ratio of the two detection frames may be determined according to a ratio between a size of the overlap region and a size of one detection frame.

For another example, it may be determined whether or not there is a predetermined difference between the first output result and the second output result according to the number of the first detection frames and the number of the second detection frames.

In operation S330, objects in the input image are classified using the classification model to determine categories of the objects in the input image.

For example, the backbone Network of the classification model may be a pre-trained ResNet (residual Network).

In one example, the classification model may be used to classify the object within the image area marked by each first detection frame, and the obtained result is taken as the target class of each object in the input image.

Note that, the operation S311 and the operation 312 may be performed in parallel. Embodiments of the present disclosure are not limited thereto and the two sets of operations may be performed in other orders, such as performing operation S311 first, then operation S312, or performing operation S312 first, then operation S311.

In some embodiments, determining whether a predetermined difference exists between the first output result and the second output result based on an intersection ratio between the at least one first detection frame and the at least one second detection frame comprises: calculating an intersection ratio between at least one first detection frame and at least one second detection frame; and determining that a predetermined difference exists between the first output result and the second output result in response to determining that the intersection ratio between the first detection frame and the second detection frame is greater than or equal to a preset intersection ratio threshold value and the categories represented by the first detection frame and the second detection frame are different. The following will describe in detail with reference to fig. 4A to 4D.

Fig. 4A is a schematic diagram of an input image according to one embodiment of the present disclosure.

As shown in fig. 4A, the input image 400 may be an image acquired at a certain time by an acquisition device mounted on an autonomous vehicle. Input image 400 includes 5 objects, object 410, object 420, object 430, object 440, and object 450, respectively.

Fig. 4B is a schematic diagram of a first output result according to one embodiment of the present disclosure.

As shown in fig. 4B, the first output result 401 may be a result of inputting the input image 400 into the lightweight deep learning model described above.

As shown in fig. 4B, the first output result 401 includes 5 first detection frames, which are a first detection frame 401a, a first detection frame 401B, a first detection frame 401c, a first detection frame 401d, and a first detection frame 401e, respectively.

For example, the first detection box 401a may mark the area where the object 410 is located, and the class of the object 410 characterized by the first detection box 401a may be a car. The first detection box 401b may mark the area where the object 420 is located, and the class of the object 420 characterized by the first detection box 401b may be a truck. The first detection box 401c may mark the area where the object 430 is located, and the category of the object 430 characterized by the first detection box 401c may be a pedestrian. The first detection box 401d may mark the area where the object 440 is located, and the class of the object 440 characterized by the first detection box 401d may be a truck. The first detection box 401e may mark the area where the object 450 is located, and the category of the object 450 characterized by the first detection box 401e may be a pedestrian.

Fig. 4C is a schematic diagram of a second output result according to one embodiment of the present disclosure.

As shown in fig. 4C, the second output result 402 may be a result of inputting the input image 400 into the object detection model described above.

As shown in fig. 4C, the second output result 402 includes 4 second detection frames, namely, a second detection frame 402a, a second detection frame 402b, a second detection frame 402C, and a second detection frame 402d.

For example, the second detection box 402a may mark the area where the object 410 is located, and the class of the object 410 characterized by the second detection box 402a may be a car. The second detection box 402b may mark the area where the object 420 is located, and the class of the object 420 characterized by the second detection box 402b may be an emergency vehicle. The second detection box 402c may mark the area where the object 430 is located, and the category of the object 430 characterized by the second detection box 402c may be a pedestrian. The second detection box 402d may mark the area where the object 440 is located and the class of the object 440 characterized by the second detection box 402d may be a truck.

Next, in some embodiments, calculating the cross-over ratio between the at least one first detection frame and the at least one second detection frame may further comprise: deleting a first detection frame with a size smaller than a preset size threshold value in at least one first detection frame to obtain I first detection frames, wherein I is an integer larger than or equal to 1; deleting the second detection frames with the sizes smaller than a preset size threshold value in at least one second detection frame to obtain J second detection frames, wherein J is an integer larger than or equal to 1; and calculating the cross-over ratio of the I first detection frames and the J second detection frames.

For example, if the preset size threshold is greater than the size of the first detection frame 401e, the first detection frame 401e may be deleted, so as to obtain 4 first detection frames. In one example, the pre-set size threshold may be a diameter of a tire of the subject 440. The object 450 is further from the acquisition device described above, the size in the input image is smaller, and thus the size of the first detection frame 401e of the marker object 450 is smaller, and the first detection frame 401e can be deleted.

For another example, the sizes of the 4 second detection frames are all larger than the preset size threshold, and the 4 second detection frames may be reserved. I.e. in this embodiment i= 4,J =4.

Fig. 4D is a schematic diagram of calculating an intersection ratio between at least one first detection frame and at least one second detection frame according to one embodiment of the present disclosure.

As shown in fig. 4D, from the first output result 401 described above and the second output result 402 described above, the cross-over ratio between the 4 first detection frames and the 4 second detection frames can be calculated.

For example, a shadow region between the first detection frame 401a and the second detection frame 402a may characterize an overlapping region of the two. The ratio of the intersection between the first detection frame 401a and the second detection frame 402a may be a ratio of the area of the overlapping region to the area of the first detection frame 401 a. As shown in fig. 4D, the area of the overlapping region is larger than half the area of the first detection frame 401 a. In one example, the preset cross ratio threshold may be 0.5.

Next, it may be determined that the cross-over ratio between the first detection frame 401a and the second detection frame 402a is greater than a preset cross-over ratio threshold. Similarly, it may be determined that the cross-over ratio between the first detection frame 401b and the second detection frame 402b, the cross-over ratio between the first detection frame 401c and the second detection frame 402c, and the cross-over ratio between the first detection frame 401d and the second detection frame 402d are all greater than the preset cross-over ratio threshold.

As described above, the class of objects 420 characterized by the first detection box 401b may be trucks. The class of objects 420 characterized by the second detection box 402b may be emergency vehicles. The cross-over ratio of the two is greater than the preset cross-over ratio threshold, and the two characteristics are different in category, so that it can be determined that a preset difference exists between the first output result 401 and the second output result 402.

In some embodiments, classifying the object in the input image includes: inputting an image area marked by at least one first detection frame into a classification model to obtain a target class of at least one object corresponding to the at least one first detection frame, wherein the intersection ratio of each first detection frame in the at least one first detection frame and one second detection frame is greater than or equal to a preset intersection ratio threshold value, and the classes represented by the two are different; the target class of each object in the input image is determined based on the at least one first detection box and the target class of the at least one object.

For example, the image areas marked by the 5 first detection frames described above may be respectively input into the classification model, so as to obtain target categories of 5 objects corresponding to the 5 first detection frames.

Alternatively, for example, there are M first detection frames in total in the first output result. Classifying objects in the input image includes: and inputting the image areas marked by the K first detection frames into a classification model to obtain target categories of the K objects corresponding to the K first detection frames. The intersection ratio of each first detection frame and one second detection frame in the K first detection frames is larger than or equal to a preset intersection ratio threshold value, and the categories of the two characteristics are different. And determining the target category of each object in the input image according to the M first detection frames and the target categories of the K objects.

For example, K is less than or equal to M, K is an integer greater than or equal to 1, and M is an integer greater than or equal to 1.

In one example, as described above, the cross-over ratio of the first detection block 401b and the second detection block 402b is greater than a preset cross-over ratio threshold, and the two are characterized by different categories. The image area marked by the first detection frame 401b may be input into a classification model, resulting in the object class of the object 420 corresponding to the first detection frame 401b being an emergency vehicle.

In addition, the other category represented by the first detection frame can be taken as the target category of the related object. For example, object 410 is a sedan, object 430 is a pedestrian, object 440 is a truck, and object 450 is a pedestrian.

Next, another way of determining whether or not there is a predetermined difference between the first output result and the second output result will be described in detail with reference to fig. 4A to 4D.

In some embodiments, it may be determined whether a predetermined difference exists between the first output result and the second output result according to the number of the first detection frames and the number of the second detection frames.

For example, in response to determining that the total number of the first detection frames and the total number of the second detection frames are different, it is determined that a predetermined difference exists between the first output result and the second output result. In one example, as shown in fig. 4B and 4C, the first output result includes 5 first detection frames, the second output result includes 4 second detection frames, and it may be determined that the total number of the first detection frames and the total number of the second detection frames are different. Next, it may be determined that there is a predetermined difference between the first output result and the second output result.

Or, for another example, in response to determining that the number of first detection frames and the number of second detection frames that characterize the same category are different, a predetermined difference is determined to exist between the first output result and the second output result. For example, in one example, as shown in fig. 4B and 4C, in the first output result, the number of first detection frames characterizing a class of trucks is 2; in the second output result, the number of second detection frames characterizing the class as trucks is 1. That is, it may be determined that the number of first detection frames and the number of second detection frames, which characterize the category as trucks, are different, and that a predetermined difference exists between the first output result and the second output result.

In some embodiments, a sample image for adjusting parameters of the lightweight deep learning model is determined from the target class of each object in the input image and the input image.

For example, the input image 400 described above may be used as a sample image, the label of which is the target class for each object in the image.

Fig. 5 is a block diagram of an apparatus for determining object categories according to one embodiment of the present disclosure.

As shown in fig. 5, the apparatus 500 may include an image processing module 510, a first determination module 520, and a classification module 530.

The image processing module 510 is configured to perform image processing on an input image by using N deep learning models, so as to obtain N output results. Wherein N is an integer greater than 1.

A first determining module 520 is configured to determine whether a predetermined difference exists between the N output results.

A classification module 530, configured to classify the object in the input image to determine a class of the object in the input image in response to determining that there is a predetermined difference between the N output results.

In some embodiments, the N deep learning models include a lightweight deep learning model and a target detection model, and the image processing module includes: the first image processing sub-module is used for carrying out first image processing on the input image by utilizing the lightweight deep learning model to obtain a first output result, wherein the first output result comprises at least one first detection frame, and each first detection frame is used for marking the area where one object in the input image is located and representing the class of the object; and the second image processing sub-module is used for carrying out second image processing on the input image by utilizing the target detection model to obtain a second output result, wherein the second output result comprises at least one second detection frame, and each second detection frame is used for marking the area of one object in the input image and representing the category of the object.

In some embodiments, the first determining module comprises: and the first determining submodule is used for determining whether a preset difference exists between the first output result and the second output result according to the cross-over ratio between the at least one first detection frame and the at least one second detection frame.

In some embodiments, the first determination submodule includes: a calculating unit for calculating an intersection ratio between the at least one first detection frame and the at least one second detection frame; and the determining unit is used for determining that a preset difference exists between the first output result and the second output result in response to the fact that the cross-over ratio between the first detection frame and the second detection frame is larger than or equal to a preset cross-over ratio threshold value and the types represented by the first detection frame and the second detection frame are different.

In some embodiments, the computing unit comprises: the first deleting subunit is used for deleting the first detection frames with the sizes smaller than a preset size threshold value in the at least one first detection frame to obtain I first detection frames, wherein I is an integer larger than or equal to 1; the second deleting subunit is used for deleting the second detection frames with the sizes smaller than a preset size threshold value in the at least one second detection frame to obtain J second detection frames, wherein J is an integer larger than or equal to 1; and the calculating subunit is used for calculating the cross-over ratio of the I first detection frames and the J second detection frames.

In some embodiments, the classification module comprises: the image processing device comprises an acquisition submodule, a classification module and a judgment submodule, wherein the acquisition submodule is used for inputting an image area marked by at least one first detection frame into the classification model to acquire a target category of at least one object corresponding to the at least one first detection frame, wherein the intersection ratio of each first detection frame in the at least one first detection frame and one second detection frame is larger than or equal to a preset intersection ratio threshold value, and the categories represented by the first detection frame and the second detection frame are different; and the second determining submodule is used for determining the target category of each object in the input image according to the at least one first detection frame and the target category of the at least one object.

In some embodiments, the first determining module comprises: and a third determining sub-module, configured to determine whether a predetermined difference exists between the first output result and the second output result according to the number of the first detection frames and the number of the second detection frames.

In some embodiments, the apparatus 500 further comprises: and the second determining module is used for determining a sample image for adjusting parameters of the lightweight deep learning model according to the target category of each object in the input image and the input image.

In the technical scheme of the disclosure, the related processes of collecting, storing, using, processing, transmitting, providing, disclosing and the like of the personal information of the user accord with the regulations of related laws and regulations, and the public order colloquial is not violated.

According to embodiments of the present disclosure, the present disclosure also provides an electronic device, a readable storage medium and a computer program product.

Fig. 6 illustrates a schematic block diagram of an example electronic device 600 that may be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 6, the apparatus 600 includes a computing unit 601 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 602 or a computer program loaded from a storage unit 608 into a Random Access Memory (RAM) 603. In the RAM 603, various programs and data required for the operation of the device 600 may also be stored. The computing unit 601, ROM 602, and RAM 603 are connected to each other by a bus 604. An input/output (I/O) interface 605 is also connected to bus 604.

Various components in the device 600 are connected to the I/O interface 605, including: an input unit 606 such as a keyboard, mouse, etc.; an output unit 607 such as various types of displays, speakers, and the like; a storage unit 608, such as a magnetic disk, optical disk, or the like; and a communication unit 609 such as a network card, modem, wireless communication transceiver, etc. The communication unit 609 allows the device 600 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunication networks.

The computing unit 601 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of computing unit 601 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, etc. The computing unit 601 performs the respective methods and processes described above, for example, a method of determining an object class. For example, in some embodiments, the method of determining object categories may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as storage unit 608. In some embodiments, part or all of the computer program may be loaded and/or installed onto the device 600 via the ROM 602 and/or the communication unit 609. When the computer program is loaded into RAM 603 and executed by the computing unit 601, one or more steps of the method of determining object categories described above may be performed. Alternatively, in other embodiments, the computing unit 601 may be configured to perform the method of determining the object class in any other suitable way (e.g. by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), complex Programmable Logic Devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the internet.

The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.

It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps recited in the present disclosure may be performed in parallel or sequentially or in a different order, provided that the desired results of the technical solutions of the present disclosure are achieved, and are not limited herein.

The above detailed description should not be taken as limiting the scope of the present disclosure. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present disclosure are intended to be included within the scope of the present disclosure.

Claims

1. A method of determining an object class, comprising:

performing image processing on an input image by using N deep learning models to obtain N output results, wherein N is an integer greater than 1, the N output results comprise a first output result and a second output result, the first output result comprises at least one first detection frame, each first detection frame is used for marking an area where one object in the input image is located and representing the class of the object, the second output result comprises at least one second detection frame, and each second detection frame is used for marking the area where the one object in the input image is located and representing the class of the object;

determining whether a predetermined difference exists between the N output results; and

classifying objects in the input image in response to determining that there are predetermined differences between the N output results, to determine a class of objects in the input image,

wherein the determining whether there is a predetermined difference between the N output results includes:

calculating an intersection ratio between the at least one first detection frame and the at least one second detection frame;

and determining that a predetermined difference exists between the first output result and the second output result in response to determining that the intersection ratio between the first detection frame and the second detection frame is greater than or equal to a preset intersection ratio threshold value and the categories represented by the first detection frame and the second detection frame are different.

2. The method of claim 1, wherein the N deep learning models include a lightweight deep learning model and a target detection model,

the image processing is performed on the input image by using the N deep learning models, and obtaining N output results includes:

performing first image processing on the input image by using the lightweight deep learning model to obtain the first output result;

and performing second image processing on the input image by using the target detection model to obtain the second output result.

3. The method of claim 1, wherein the calculating an intersection ratio between the at least one first detection box and the at least one second detection box comprises:

deleting the first detection frames with the sizes smaller than a preset size threshold value in the at least one first detection frame to obtain I first detection frames, wherein I is an integer larger than or equal to 1;

deleting the second detection frames with the sizes smaller than a preset size threshold value in the at least one second detection frame to obtain J second detection frames, wherein J is an integer larger than or equal to 1;

and calculating the cross-over ratio of the I first detection frames and the J second detection frames.

4. The method of claim 2, wherein the classifying the object in the input image comprises:

inputting an image area marked by at least one first detection frame into a classification model to obtain a target class of at least one object corresponding to the at least one first detection frame, wherein the cross-over ratio of each first detection frame in the at least one first detection frame to one second detection frame is greater than or equal to a preset cross-over ratio threshold value, and the classes represented by the two are different;

and determining the target category of each object in the input image according to the at least one first detection frame and the target category of the at least one object.

5. The method of claim 2, wherein the determining whether there is a predetermined difference between the N output results further comprises:

determining whether a predetermined difference exists between the first output result and the second output result according to the number of the first detection frames and the number of the second detection frames.

6. The method of claim 4 or 5, further comprising:

and determining a sample image for adjusting parameters of the lightweight deep learning model according to the target category of each object in the input image and the input image.

7. An apparatus for determining a class of an object, comprising:

the image processing module is used for carrying out image processing on an input image by utilizing N deep learning models to obtain N output results, wherein N is an integer larger than 1, the N output results comprise a first output result and a second output result, the first output result comprises at least one first detection frame, each first detection frame is used for marking an area where one object in the input image is located and representing the class of the object, the second output result comprises at least one second detection frame, and each second detection frame is used for marking the area where the object in the input image is located and representing the class of the object;

a first determining module, configured to determine whether a predetermined difference exists between the N output results; and

a classification module for classifying objects in the input image to determine a class of objects in the input image in response to determining that there is a predetermined difference between the N output results,

wherein the first determining module includes:

a calculating unit for calculating an intersection ratio between the at least one first detection frame and the at least one second detection frame;

and the determining unit is used for determining that a preset difference exists between the first output result and the second output result in response to the fact that the cross-over ratio between the first detection frame and the second detection frame is larger than or equal to a preset cross-over ratio threshold value and the types represented by the first detection frame and the second detection frame are different.

8. The apparatus of claim 7, wherein the N deep learning models comprise a lightweight deep learning model and a target detection model,

the image processing module includes:

the first image processing sub-module is used for carrying out first image processing on the input image by utilizing the lightweight deep learning model to obtain the first output result, wherein the first output result comprises at least one first detection frame, and each first detection frame is used for marking the area where one object in the input image is located and representing the class of the object;

and the second image processing sub-module is used for carrying out second image processing on the input image by utilizing the target detection model to obtain the second output result, wherein the second output result comprises at least one second detection frame, and each second detection frame is used for marking the area of one object in the input image and representing the category of the object.

9. The apparatus of claim 7, wherein the computing unit comprises:

the first deleting subunit is used for deleting the first detection frames with the sizes smaller than a preset size threshold value in the at least one first detection frame to obtain I first detection frames, wherein I is an integer larger than or equal to 1;

the second deleting subunit is used for deleting the second detection frames with the sizes smaller than a preset size threshold value in the at least one second detection frame to obtain J second detection frames, wherein J is an integer larger than or equal to 1;

and the calculating subunit is used for calculating the cross-over ratio of the I first detection frames and the J second detection frames.

10. The apparatus of claim 8, wherein the classification module comprises:

the image processing device comprises an acquisition submodule, a classification module and a judgment submodule, wherein the acquisition submodule is used for inputting an image area marked by at least one first detection frame into the classification model to acquire a target category of at least one object corresponding to the at least one first detection frame, wherein the intersection ratio of each first detection frame in the at least one first detection frame and one second detection frame is larger than or equal to a preset intersection ratio threshold value, and the categories represented by the first detection frame and the second detection frame are different;

and the second determining submodule is used for determining the target category of each object in the input image according to the at least one first detection frame and the target category of the at least one object.

11. The apparatus of claim 8, wherein the first determination module comprises:

and a third determining sub-module, configured to determine whether a predetermined difference exists between the first output result and the second output result according to the number of the first detection frames and the number of the second detection frames.

12. The apparatus of claim 10 or 11, further comprising:

and the second determining module is used for determining a sample image for adjusting parameters of the lightweight deep learning model according to the target category of each object in the input image and the input image.

13. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1 to 6.

14. A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method of any one of claims 1 to 6.