CN114119964A

CN114119964A - Network training method and device, and target detection method and device

Info

Publication number: CN114119964A
Application number: CN202111435769.2A
Authority: CN
Inventors: 吴嫣然; 林培文
Original assignee: Shanghai Sensetime Lingang Intelligent Technology Co Ltd
Current assignee: Shanghai Sensetime Lingang Intelligent Technology Co Ltd
Priority date: 2021-11-29
Filing date: 2021-11-29
Publication date: 2022-03-01

Abstract

The disclosure provides a method and a device for network training and a method and a device for target detection, wherein the method comprises the following steps: obtaining a plurality of first picture samples and a second picture sample obtained by enhancing each first picture sample; inputting the first picture sample and the second picture sample into a target detection network to be trained to obtain a first target detection frame aiming at the first picture sample and a second target detection frame aiming at the second picture sample, which are output by the target detection network; determining a target loss function value based on the first target detection box and the second target detection box; and adjusting the parameter values of the network parameters of the target detection network to be trained based on the target loss function values to obtain the trained target detection network. According to the method and the device, the consistency constraint between the corresponding first target detection frame and the corresponding second target detection frame can enable the target detection network to have more robust detection performance on the similar picture to a certain extent.

Description

Network training method and device, and target detection method and device

Technical Field

The present disclosure relates to the technical field of target detection, and in particular, to a method and an apparatus for network training, and a method and an apparatus for target detection.

Background

With the continuous development of target detection technology, target detection is widely applied to various fields, such as the fields of automatic driving, intelligent security and the like. In order to realize good detection of the target to be detected, a large number of target images need to be prepared for the target to be detected in advance, and the target detection network is trained by utilizing the target images, so that the target detection network of the target to be detected is obtained.

When the target detection network is applied to an actual application scene, different prediction results may be output for adjacent frame images, and even a phenomenon that the prediction results of adjacent frames are greatly different occurs, which causes a problem of insufficient prediction stability and poor detection performance.

Disclosure of Invention

The embodiment of the disclosure at least provides a method and a device for network training and a method and a device for target detection, so as to improve detection performance.

In a first aspect, an embodiment of the present disclosure provides a method for network training, where the method includes:

the method comprises the steps of obtaining a plurality of first picture samples and obtaining second picture samples obtained by enhancing each first picture sample; the similarity between the second picture sample and the first picture sample is higher than a first preset threshold;

inputting the first picture sample and the second picture sample into a target detection network to be trained to obtain a first target detection frame aiming at the first picture sample and a second target detection frame aiming at the second picture sample, which are output by the target detection network;

determining a target loss function value based on the first target detection box and the second target detection box;

and adjusting the network parameter value of the target detection network to be trained based on the target loss function value to obtain the trained target detection network.

By adopting the network training method, in the network training process, not only can the target detection be carried out on the first picture sample, but also the target detection can be carried out on the second picture sample obtained by enhancing and processing the first picture sample, and then the target loss function value is determined through the obtained first target detection frame and the second target detection frame so as to carry out the consistency training of the target detection network. In the disclosure, the second picture sample obtained through enhancement processing covers features similar to the first picture sample to a certain extent, and the target detection network can have more robust detection performance for similar pictures, such as a plurality of adjacent video frames, to a certain extent through consistency constraint between the corresponding first target detection frame and the second target detection frame.

In one possible embodiment, the determining the target loss function value based on the first target detection block and the second target detection block includes:

obtaining a first sub-target loss function value based on a difference operation between the first target detection frame and the second target detection frame; and the number of the first and second groups,

obtaining a second sub-target loss function value based on a difference operation between the first target detection frame and a first target marking frame of the first picture sample and a difference operation between the second target detection frame and a second target marking frame of the second picture sample;

determining the target loss function value based on the first sub-target loss function value and the second sub-target loss function value.

The objective loss function value here is determined by a first sub-objective loss function value determined by a difference operation between a first objective detection box and a second objective detection box, on one hand, for constraining the proximity between the two detection boxes, and a second sub-objective loss function value determined by a difference operation between the objective detection box and the objective labeling box, on the other hand, for constraining the proximity between the detection box and the labeling box, and the closer the proximity, the better the network performance.

In a possible implementation, in a case where one of the first picture samples corresponds to a plurality of second picture samples, and a similarity between the plurality of second picture samples is higher than a second preset threshold, the determining the target loss function value based on the first sub-target loss function value and the second sub-target loss function value includes:

obtaining a third sub-target loss function value based on difference operation between second target detection frames respectively corresponding to the second image samples;

determining the target loss function value based on the first sub-target loss function value, the second sub-target loss function value, and the third sub-target loss function value.

Here, the proximity between the multiple target detection boxes can be constrained, and the training performance of the network can be further improved.

In one possible embodiment, the second target annotation box is determined by the first target annotation box.

determining an intersection ratio between the first target detection frame and the second target detection frame;

and determining a target loss function value based on the first target detection frame and the second target detection frame in response to the intersection ratio being greater than a preset ratio.

Here, it may be determined whether the interaction ratio between the two target detection frames is at a preset ratio to determine whether the two target detection frames are directed to the same target, and the calculation of the relevant target loss function value may be performed on the premise of determining that the two target detection frames belong to the same target, so as to improve the accuracy of the network.

In a possible embodiment, the first picture sample is subjected to enhancement processing according to at least one of the following modes:

performing picture processing on the first picture sample to obtain a second picture sample;

transmitting the pixel value of each pixel included in the first picture sample to a pixel position at a preset interval to obtain an updated pixel value of each pixel, and determining the second picture sample based on the updated pixel value of each pixel;

obtaining a patch picture with the picture size smaller than a preset threshold value, and pasting the patch picture to other positions except for the picture position indicated by the first target marking frame in the first picture sample to obtain a second picture sample;

and superposing random noise on the first picture sample to obtain the second picture sample.

In a second aspect, an embodiment of the present disclosure further provides a method for target detection, where the method includes:

acquiring a picture to be detected;

and performing target detection on the picture to be detected by using the target detection network trained by the network training method in the first aspect and any one of the various implementation modes of the first aspect to obtain a target detection result corresponding to the picture to be detected.

In a third aspect, an embodiment of the present disclosure further provides an apparatus for network training, where the apparatus includes:

the device comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring a plurality of first picture samples and obtaining a second picture sample obtained by enhancing each first picture sample; the similarity between the second picture sample and the first picture sample is higher than a first preset threshold;

the detection module is used for inputting the first picture sample and the second picture sample into a target detection network to be trained to obtain a first target detection frame aiming at the first picture sample and a second target detection frame aiming at the second picture sample, wherein the first target detection frame is output by the target detection network;

a determination module to determine a target loss function value based on the first target detection box and the second target detection box;

and the training module is used for adjusting the network parameter value of the target detection network to be trained based on the target loss function value to obtain the trained target detection network.

In a fourth aspect, an embodiment of the present disclosure further provides an apparatus for target detection, where the apparatus includes:

the acquisition module is used for acquiring a picture to be detected;

and the detection module is configured to perform target detection on the picture to be detected by using the target detection network trained by the network training method in any one of the first aspect and various embodiments thereof, so as to obtain a target detection result corresponding to the picture to be detected.

In a fifth aspect, an embodiment of the present disclosure further provides an electronic device, including: a processor, a memory and a bus, the memory storing machine-readable instructions executable by the processor, the processor and the memory communicating via the bus when the electronic device is running, the machine-readable instructions when executed by the processor performing the steps of the method of network training according to the first aspect and any of its various embodiments or the steps of the method of object detection according to the second aspect.

In a sixth aspect, the disclosed embodiments also provide a computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, performs the steps of the method for network training according to the first aspect and any of its various embodiments or the steps of the method for object detection according to the second aspect.

For the description of the effects of the above apparatus, electronic device, and computer-readable storage medium, reference is made to the description of the above method, which is not repeated here.

In order to make the aforementioned objects, features and advantages of the present disclosure more comprehensible, preferred embodiments accompanied with figures are described in detail below.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present disclosure, the drawings required for use in the embodiments will be briefly described below, and the drawings herein incorporated in and forming a part of the specification illustrate embodiments consistent with the present disclosure and, together with the description, serve to explain the technical solutions of the present disclosure. It is appreciated that the following drawings depict only certain embodiments of the disclosure and are therefore not to be considered limiting of its scope, for those skilled in the art will be able to derive additional related drawings therefrom without the benefit of the inventive faculty.

Fig. 1 illustrates a flow chart of a method of network training provided by an embodiment of the present disclosure;

fig. 2 is a flowchart illustrating a specific method of network training in the method of network training provided by the embodiment of the present disclosure;

FIG. 3 illustrates a flow chart of a method of target detection provided by an embodiment of the present disclosure;

fig. 4 is a schematic diagram illustrating an apparatus for network training provided by an embodiment of the present disclosure;

FIG. 5 is a schematic diagram of an apparatus for target detection provided by an embodiment of the present disclosure;

fig. 6 shows a schematic diagram of an electronic device provided by an embodiment of the present disclosure.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present disclosure more clear, the technical solutions of the embodiments of the present disclosure will be described clearly and completely with reference to the drawings in the embodiments of the present disclosure, and it is obvious that the described embodiments are only a part of the embodiments of the present disclosure, not all of the embodiments. The components of the embodiments of the present disclosure, generally described and illustrated in the figures herein, can be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present disclosure, presented in the figures, is not intended to limit the scope of the claimed disclosure, but is merely representative of selected embodiments of the disclosure. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the disclosure without making creative efforts, shall fall within the protection scope of the disclosure.

It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures.

The term "and/or" herein merely describes an associative relationship, meaning that three relationships may exist, e.g., a and/or B, may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the term "at least one" herein means any one of a plurality or any combination of at least two of a plurality, for example, including at least one of A, B, C, and may mean including any one or more elements selected from the group consisting of A, B and C.

Research shows that when the target detection network is applied to an actual application scene, different prediction results may be output for adjacent frame images, and even the phenomenon that the prediction results of adjacent frames are greatly different occurs, which causes the problem of insufficient prediction stability and poor detection performance.

In order to solve the problem of time sequence picture target detection, a video-based target detection method is proposed in the related art, in the video target detection method, a memory module is usually introduced to model picture information in the same time sequence, and in the modeling process, information of a historical frame can be utilized when target detection of a current frame is performed.

However, the above method depends on the label of video target detection, and a large amount of video label information needs to be used during training, so that some existing image-based target detection labels cannot be utilized, and a large amount of human resources are wasted. In addition, because a new memory module is introduced to store the time sequence information, the calculation amount of the model is greatly increased, and the modification of the model structure introduces difficulty in the floor deployment of the model.

Based on the above research, the present disclosure provides a network training method and apparatus, and a target detection method and apparatus based on consistency constraint, so as to improve detection performance.

To facilitate understanding of the present embodiment, first, a method for network training disclosed in the embodiments of the present disclosure is described in detail, where an execution subject of the method for network training provided in the embodiments of the present disclosure is generally an electronic device with certain computing capability, and the electronic device includes, for example: a terminal device, which may be a User Equipment (UE), a mobile device, a cellular phone, a cordless phone, a Personal Digital Assistant (PDA), a handheld device, a computing device, a vehicle mounted device, a wearable device, or a server or other processing device. In some possible implementations, the method of network training may be implemented by a processor invoking computer readable instructions stored in a memory.

Referring to fig. 1, a flowchart of a method for network training provided in the embodiment of the present disclosure is shown, where the method includes steps S101 to S104, where:

s101: obtaining a plurality of first picture samples and a second picture sample obtained by enhancing each first picture sample; the similarity between the second picture sample and the first picture sample is higher than a first preset threshold;

s102: inputting the first picture sample and the second picture sample into a target detection network to be trained to obtain a first target detection frame aiming at the first picture sample and a second target detection frame aiming at the second picture sample, which are output by the target detection network;

s103: determining a target loss function value based on the first target detection box and the second target detection box;

s104: and adjusting the network parameter value of the target detection network to be trained based on the target loss function value to obtain the trained target detection network.

In order to facilitate understanding of the method for network training provided by the embodiments of the present disclosure, a brief description is first given to an application scenario of the method. The network training method provided by the embodiment of the disclosure can be mainly applied to the related field of target detection, for example, can be applied to pedestrian detection in a video monitoring scene, and can also be applied to vehicle detection in an automatic driving scene, and the like.

In consideration of the fact that the target detection for the picture in the related art cannot take into account the related information of the time-series pictures with similar peripheries, the stability of the detection is low, and the target detection for the video needs a large amount of label information, so that the detection efficiency is low.

In order to solve the above problem, the embodiments of the present disclosure provide a network training method for performing target detection network training based on consistency constraints, so as to improve detection efficiency under the condition of improving stability.

Wherein the first picture samples obtained here are also different based on different application scenarios. In the case that a large number of first picture samples adapted to a specific application scenario are acquired, here, enhancement processing may be performed on each first picture sample to obtain a second picture sample with higher similarity to the first picture sample.

The enhancement processing in the embodiment of the present disclosure mainly refers to a process of modifying a picture by a small margin, for example, a second picture sample after enhancement processing may be obtained by adjusting picture processing modes such as brightness, contrast, saturation, hue, and the like of the picture; for another example, the pixels of the picture may be shifted to a certain extent, that is, the pixel values of the pixels included in the first picture sample may be transmitted to the pixel positions at the preset intervals to obtain the updated pixel values of the pixels, and the second picture sample is determined based on the updated pixel values of the pixels, so that the obtained second picture sample is very similar to the first picture sample; for another example, a patch adding manner may be adopted, that is, a patch picture with a picture size smaller than a preset threshold is obtained, and the patch picture is pasted to other positions in the first picture sample except for a picture position indicated by a pre-labeled first target labeling frame, so as to obtain a second picture sample, for example, a small additional patch is pasted to a position without a labeling frame on the first picture sample by using a data enhancement related idea such as cutout, mosaic, and the like, so as to obtain a similar second picture sample; for another example, a method of adding random noise may be adopted, and an additional noise is superimposed on the basis of not changing the picture appearance, so that the network can be more robust to different but similar inputs.

Regardless of the enhancement processing implemented in the above manner, the similarity between the processed second picture sample and the first picture sample before processing is relatively high, which may be higher than a first preset threshold (e.g., 0.8). Because the second picture sample and the first picture sample have higher similarity, the corresponding detection results are more consistent theoretically when the subsequent target detection is carried out, and powerful data support can be provided for the training of the target detection network through consistency constraint.

In order to facilitate network training, the first picture sample in the embodiment of the present disclosure may be a picture sample that is pre-labeled with a first target labeling frame, where the first target labeling frame is used to indicate an image area where a target object in the first picture sample is located. In this way, after the enhancement processing is performed, a second target annotation frame of a second picture sample can be directly or indirectly obtained based on the annotation frame information of the first picture sample, where the second target annotation frame is used to indicate an image area where the target object is located in the second picture sample.

The manner of the second target labeling frame determined based on different enhancement manners is slightly different, for example, in the case of performing data enhancement in a random noise superposition manner, the second target labeling frame of the second picture sample may be determined directly based on the first target labeling frame; in the case of data enhancement in an offset manner, the second target annotation box of the second picture sample can be determined based on the translation operation of the first target annotation box.

In order to facilitate training of a target detection network with better stability, the embodiment of the disclosure may input the first picture sample and the second picture sample to the target detection network to be trained at the same time, so as to obtain a first target detection frame for the first picture sample and a second target detection frame for the second picture sample.

The related target detection network can train the corresponding information of the picture sample and the target detection frame corresponding to the target object in the corresponding picture sample, and the first target detection frame and the second target detection frame output by the network can be subsequently compared with the pre-labeled target labeling frame information to train the target detection network.

In the process of performing the target detection network training, the target loss function value may be determined based on the first target detection box and the second target detection box. And under the condition that the network iteration cutoff condition is not reached, adjusting the network parameter value of the target detection network to be trained based on the target loss function value, then inputting the first picture sample and the second picture sample into the adjusted target detection network, and carrying out next round of training on the adjusted target detection network until the network iteration cutoff condition is reached to obtain the trained target detection network.

The network iteration cutoff condition may be that the iteration number reaches a preset number (for example, 100), all the picture samples are traversed once, the target loss function value is smaller than a preset threshold, or other conditions.

The example is illustrated with the objective loss function value being smaller than the preset threshold as the network iteration cutoff condition, if the objective loss function value is not smaller than the preset threshold after the first round of the target detection network training, then the network can be reversely transmitted based on the objective loss function value, and the network parameter value is adjusted, then the first round of the network training is performed, and a new objective loss function value is determined, if the new objective loss function value is smaller than the preset threshold, then the training can be ended, if the new objective loss function value is not smaller than the preset threshold, then the third round of the network training can be performed, and the process is repeated until the new objective loss function value is smaller than the preset threshold, and the training is ended, and the adjusted network parameter value is obtained.

Therefore, when the picture to be detected is acquired, and the picture to be detected is input into the trained target detection network, the target detection result corresponding to the picture to be detected can be quickly acquired, wherein the target detection result can be a relevant result used for indicating information such as the position, the size and the like of a target object in the picture to be detected, and in addition, in a specific application, the target detection result can be presented in the picture to be detected in a target detection frame mode.

In order to enable the network to better adapt to different but similar picture inputs and predict a more stable target detection frame, the embodiments of the present disclosure may, on one hand, constrain the output between two picture samples and, on the other hand, constrain the output of respective picture samples, that is, the target loss function value here may be determined by a difference operation between a first target detection frame and a second target detection frame, and an obtained first sub-target loss function value, or may be determined by a difference operation between the first target detection frame and a first target labeling frame of the first picture sample and a difference operation between the second target detection frame and a second target labeling frame of the second picture sample, and an obtained second sub-target loss function value.

The first sub-target loss function value is used for carrying out consistency constraint between prediction results of two image samples, the second sub-target loss function value is used for carrying out consistency constraint between the prediction results of the image samples and the labeling results, and the adaptability of the network can be greatly improved through the two consistency constraints.

In the disclosed embodiment, a weighted sum between the two sub-target loss function values described above may be used to determine the target loss function value.

In practical applications, for one picture sample, there may be a plurality of corresponding enhanced second picture samples, and the similarity between the plurality of second picture samples is relatively high, for example, may be higher than a second preset threshold (0.9), which may support subsequent network training. In this case, for each second picture sample, a corresponding second target detection frame may be obtained.

In order to further improve the stability of the network, the consistency constraint may be performed on the prediction results among the plurality of second picture samples, that is, the target loss function value may be determined jointly by the obtained third sub-target loss function values through the difference operation between the second target detection boxes corresponding to the plurality of second picture samples, respectively.

Considering that in practical applications, there may be a plurality of target objects in the related picture sample, in this case, the detection frame group belonging to the same target object in the first picture sample and the second picture sample may be determined based on the cross-over ratio operation, and then the training on the network may be performed, specifically, the following steps are performed:

step one, determining the intersection and parallel ratio between a first target detection frame corresponding to a first target detection frame and a second target detection frame corresponding to a second target detection frame;

and step two, responding to the fact that the intersection ratio is larger than the preset ratio, and determining a target loss function value based on the first target detection frame and the second target detection frame.

The Intersection of Union (IoU) can be used to describe the coincidence degree between two detection frames. Similar to the set comparison in mathematics, it is equal to the number of elements contained within the intersection of two sets, divided by the number of elements contained within their union. The two detection boxes here can be seen as a set of two pixels whose intersection ratio is equal to the area of the overlapping part of the two detection boxes divided by the area of their merging.

Under the condition that the IOU between the two target detection frames corresponding to the two picture samples is larger than a preset ratio (such as 0.7), the two detection frames can be regarded as the same target, and then the predicted positions of the two detection frames are constrained to be as close as possible, so that a better consistency constraint effect is achieved, and the stability of the network is further improved.

In a specific application, for two target detection frames included in one detection frame group, the coordinates of each corner included in the target detection frame may be used to determine the position of the target detection frame in the picture sample, and then the area of the overlapped portion of the two detection frames is determined by the coordinate operation to be divided by the combined area, so as to determine the IOU. In addition, when there are a plurality of target objects, the pairing situation may be determined by using the coordinates of each corner included in each target detection frame, and the detection frame group corresponding to each target object may be determined.

To facilitate a further understanding of the method of network training provided by the embodiments of the present disclosure, further description may be provided in conjunction with fig. 2.

As shown in fig. 2, for two paired picture samples (i.e. a first picture sample and a second picture sample), here, the two picture samples may be respectively input into a target detection network, and at this time, a first target detection frame and a second target detection frame output by the network may be obtained, and here, multiple rounds of training may be performed on the target detection network according to the relevant consistency constraints until a trained target detection network is obtained.

Based on the network training method provided by the above embodiment, the embodiment of the present disclosure further provides a target detection method, as shown in fig. 3, the method specifically includes the following steps:

s301: acquiring a picture to be detected;

s302: and carrying out target detection on the picture to be detected by using the target detection network trained by the network training method to obtain a target detection result corresponding to the picture to be detected.

The picture to be detected can be obtained based on different application scenes, the target detection can be realized only by inputting the obtained picture to be detected into the trained target detection network, the method is simple and efficient, and the related detection process is referred to the above description and is not repeated herein.

It will be understood by those skilled in the art that in the method of the present invention, the order of writing the steps does not imply a strict order of execution and any limitations on the implementation, and the specific order of execution of the steps should be determined by their function and possible inherent logic.

Based on the same inventive concept, the embodiment of the present disclosure further provides a device corresponding to the method, and since the principle of solving the problem of the device in the embodiment of the present disclosure is similar to that of the method in the embodiment of the present disclosure, the implementation of the device may refer to the implementation of the method, and repeated details are omitted.

Referring to fig. 4, a schematic diagram of a network training apparatus provided in an embodiment of the present disclosure is shown, where the apparatus includes: an acquisition module 401, a detection module 402, a determination module 403 and a training module 404; wherein the content of the first and second substances,

an obtaining module 401, configured to obtain a plurality of first picture samples and a second picture sample obtained by performing enhancement processing on each first picture sample; the similarity between the second picture sample and the first picture sample is higher than a first preset threshold;

a detection module 402, configured to input the first picture sample and the second picture sample to a target detection network to be trained, so as to obtain a first target detection frame output by the target detection network and specific to the first picture sample, and a second target detection frame output by the target detection network and specific to the second picture sample;

a determining module 403, configured to determine a target loss function value based on the first target detection box and the second target detection box;

and the training module 404 is configured to adjust a network parameter value of the target detection network to be trained based on the target loss function value, so as to obtain the trained target detection network.

By adopting the network training device, in the network training process, not only can the target detection be carried out on the first picture sample, but also the target detection can be carried out on the second picture sample obtained by enhancing and post-processing the first picture sample, and then the target loss function value is determined through the obtained first target detection frame and the second target detection frame so as to carry out the consistency training of the target detection network. In the disclosure, the second picture sample obtained through enhancement processing covers features similar to the first picture sample to a certain extent, and the target detection network can have more robust detection performance for similar pictures, such as a plurality of adjacent video frames, to a certain extent through consistency constraint between the corresponding first target detection frame and the second target detection frame.

In one possible implementation, the determining module 403 is configured to determine the objective loss function value based on the first objective detection block and the second objective detection block according to the following steps:

obtaining a first sub-target loss function value based on the difference operation between the first target detection frame and the second target detection frame; and the number of the first and second groups,

a target loss function value is determined based on the first sub-target loss function value and the second sub-target loss function value.

In a possible implementation manner, in a case that one first picture sample corresponds to a plurality of second picture samples, and the similarity between the plurality of second picture samples is higher than a second preset threshold, the determining module 403 is configured to determine the target loss function value based on the first sub-target loss function value and the second sub-target loss function value according to the following steps:

obtaining a third sub-target loss function value based on difference operation between second target detection frames respectively corresponding to the plurality of second picture samples;

and determining a target loss function value based on the first sub-target loss function value, the second sub-target loss function value and the third sub-target loss function value.

determining the intersection ratio between a first target detection frame corresponding to the first target detection frame and a second target detection frame corresponding to the second target detection frame;

and determining a target loss function value based on the first target detection frame and the second target detection frame in response to the intersection ratio being greater than the preset ratio.

In a possible implementation, the obtaining module 401 is configured to perform enhancement processing on the first picture sample according to at least one of the following manners:

carrying out picture processing on the first picture sample to obtain a second picture sample;

transmitting the pixel value of each pixel included in the first picture sample to a pixel position at a preset interval to obtain an updated pixel value of each pixel, and determining a second picture sample based on the updated pixel value of each pixel;

and superposing random noise on the first picture sample to obtain a second picture sample.

Referring to fig. 5, a schematic diagram of an apparatus for detecting an object according to an embodiment of the present disclosure is shown, where the apparatus includes: an acquisition module 501 and a detection module 502; wherein the content of the first and second substances,

an obtaining module 501, configured to obtain a picture to be detected;

the detection module 502 is configured to perform target detection on the picture to be detected by using the target detection network trained by the network training method, so as to obtain a target detection result corresponding to the picture to be detected.

The description of the processing flow of each module in the device and the interaction flow between the modules may refer to the related description in the above method embodiments, and will not be described in detail here.

An embodiment of the present disclosure further provides an electronic device, as shown in fig. 6, which is a schematic structural diagram of the electronic device provided in the embodiment of the present disclosure, and the electronic device includes: a processor 601, a memory 602, and a bus 603. The memory 602 stores machine-readable instructions executable by the processor 601 (e.g., execution instructions corresponding to the obtaining module 401, the detecting module 402, the determining module 403, and the training module 404 in the apparatus in fig. 4; e.g., execution instructions corresponding to the obtaining module 501 and the detecting module 502 in the apparatus in fig. 5), when the electronic device is running, the processor 601 and the memory 602 communicate through the bus 603, and the machine-readable instructions are executed by the processor 601 to perform the network training method shown in fig. 1 or the target detection method shown in fig. 3.

The embodiments of the present disclosure also provide a computer-readable storage medium, on which a computer program is stored, which, when executed by a processor, performs the steps of the method described in the above method embodiments. The storage medium may be a volatile or non-volatile computer-readable storage medium.

The embodiments of the present disclosure also provide a computer program product, where the computer program product carries a program code, and instructions included in the program code may be used to execute the steps of the method described in the foregoing method embodiments, which may be referred to specifically for the foregoing method embodiments, and are not described herein again.

The computer program product may be implemented by hardware, software or a combination thereof. In an alternative embodiment, the computer program product is embodied in a computer storage medium, and in another alternative embodiment, the computer program product is embodied in a Software product, such as a Software Development Kit (SDK), or the like.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the system and the apparatus described above may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again. In the several embodiments provided in the present disclosure, it should be understood that the disclosed system, apparatus, and method may be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one logical division, and there may be other divisions when actually implemented, and for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of devices or units through some communication interfaces, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present disclosure may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a non-volatile computer-readable storage medium executable by a processor. Based on such understanding, the technical solution of the present disclosure may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for causing an electronic device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present disclosure. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

Finally, it should be noted that: the above-mentioned embodiments are merely specific embodiments of the present disclosure, which are used for illustrating the technical solutions of the present disclosure and not for limiting the same, and the scope of the present disclosure is not limited thereto, and although the present disclosure is described in detail with reference to the foregoing embodiments, those skilled in the art should understand that: any person skilled in the art can modify or easily conceive of the technical solutions described in the foregoing embodiments or equivalent technical features thereof within the technical scope of the present disclosure; such modifications, changes or substitutions do not depart from the spirit and scope of the embodiments of the present disclosure, and should be construed as being included therein. Therefore, the protection scope of the present disclosure shall be subject to the protection scope of the claims.

Claims

1. A method of network training, the method comprising:

2. The method of claim 1, wherein determining an objective loss function value based on the first objective detection block and the second objective detection block comprises:

3. The method of claim 2, wherein, in a case where one of the first picture samples corresponds to a plurality of second picture samples and a similarity between the plurality of second picture samples is higher than a second preset threshold, the determining the objective loss function value based on the first sub-objective loss function value and the second sub-objective loss function value comprises:

4. The method of claim 2 or 3, wherein the second target annotation box is determined by the first target annotation box.

5. The method of any of claims 1-4, wherein determining an objective loss function value based on the first objective detection block and the second objective detection block comprises:

6. The method according to any of claims 1-5, wherein the enhancement processing is performed on the first picture samples in at least one of:

7. A method of target detection, the method comprising:

acquiring a picture to be detected;

and carrying out target detection on the picture to be detected by using the target detection network trained by the network training method according to any one of claims 1 to 6 to obtain a target detection result corresponding to the picture to be detected.

8. An apparatus for network training, the apparatus comprising:

9. An apparatus for object detection, the apparatus comprising:

the acquisition module is used for acquiring a picture to be detected;

a detection module, configured to perform target detection on the picture to be detected by using the target detection network trained by the network training method according to any one of claims 1 to 6, so as to obtain a target detection result corresponding to the picture to be detected.

10. An electronic device, comprising: a processor, a memory and a bus, the memory storing machine-readable instructions executable by the processor, the processor and the memory communicating over the bus when the electronic device is running, the machine-readable instructions when executed by the processor performing the steps of the method of network training according to any one of claims 1 to 6 or the steps of the method of object detection according to claim 7.

11. A computer-readable storage medium, characterized in that a computer program is stored thereon, which, when being executed by a processor, performs the steps of the method of network training according to one of the claims 1 to 6 or the steps of the method of object detection according to claim 7.