CN116884017A

CN116884017A - Object detection method and device for large visual model, electronic equipment and storage medium

Info

Publication number: CN116884017A
Application number: CN202310958995.1A
Authority: CN
Inventors: 冉祥; 张宇; 陈小川; 刘欣冉
Original assignee: Beijing Micro Chain Daoi Technology Co ltd
Current assignee: Beijing Micro Chain Daoi Technology Co ltd
Priority date: 2023-08-01
Filing date: 2023-08-01
Publication date: 2023-10-13

Abstract

The application provides an object detection method and device of a visual large model, electronic equipment and a storage medium, which are applied to the technical field of image processing and are used for acquiring an image to be detected and a guide image; inputting the image to be detected into a neural network after training is completed; generating a second text feature according to the guide image; extracting a target to be detected from the image to be detected according to the first text feature or the second text feature; when the target to be detected is extracted according to the second text feature, the corresponding relation between the second text feature and the target to be detected is added into the detection data set, so that the detection cost and detection time of the object detection model are reduced, and the detection efficiency is improved.

Description

Object detection method and device for large visual model, electronic equipment and storage medium

Technical Field

The present application relates to the field of image processing technologies, and in particular, to an object detection method and apparatus for a visual large model, an electronic device, and a storage medium.

Background

The object detection is to detect target objects existing in an image by utilizing theories and methods in the fields of image processing, pattern recognition and the like, determine semantic categories of the target objects, and mark the positions of the target objects in the image.

In the prior art, object detection is mainly divided into the following steps: 1. dividing the image into different areas; 2. delivering the regions to a neural network and classifying them into various categories; 3. after dividing each region into corresponding categories, all the regions can be combined to obtain an original image with a detection object. The most important step in the object detection is to determine the classification of each region of the image, but the existing object detection algorithm can only learn the types existing in the detection data set, and when new types need to be added, new frames need to be marked, and a new category catalog is trained, wherein expensive training resources and complicated manual marking are involved, so that the detection cost of the existing object detection model is high, the detection time is long, and the detection efficiency is low.

Disclosure of Invention

In view of the shortcomings of the prior art, the application provides an object detection method, device, electronic equipment and storage medium for a large visual model, which are applied to the technical field of image processing, by acquiring a guide image of an image to be detected and generating a second text feature according to the guide image, and simultaneously inputting the image to be detected into a trained neural network, wherein the neural network comprises a detection data set, the detection data set comprises a corresponding relation between a first text feature and a first image feature, and further, an object to be detected can be extracted from the image to be detected according to the first text feature and the second text feature, and when the object to be detected is extracted according to the second text feature, the corresponding relation is stored in the detection data set, so that the detection time is saved and the detection efficiency is improved when the object to be detected is detected in a similar type to the new object. According to the application, the target to be detected of the image to be detected is extracted through the trained neural network and the guide image, so that the step of manually marking the new class when the new class is detected in the prior art is avoided, the detection time is saved, and the detection efficiency is improved.

In a first aspect, the present application provides a method of object detection for a visual large model, the method comprising the steps of:

acquiring an image to be detected and a guiding image of the same kind as a target to be detected in the image to be detected;

inputting an image to be detected into a neural network after training, extracting a target to be detected from the image to be detected, wherein the neural network comprises a detection data set, and the detection data set comprises a corresponding relation between a first text feature and a first image feature;

the step of inputting the image to be detected into the neural network after training and extracting the target to be detected from the image to be detected comprises the following steps:

generating a second text feature according to the guide image;

extracting a target to be detected from the image to be detected according to the first text feature or the second text feature;

when the target to be detected is extracted according to the second text feature, adding the corresponding relation between the second text feature and the target to be detected into the detection data set.

By the object detection method of the visual large model, the image to be detected and the guide image of the same kind as the object to be detected in the image to be detected are obtained, so that the object to be detected can be conveniently extracted from the image to be detected according to the guide image, and the tedious manual labeling step of new kinds is omitted. Inputting the image to be detected into a neural network after training, wherein the neural network comprises a detection data set, the detection data set comprises a corresponding relation between the first text characteristic and the first image characteristic, and the neural network is used for detecting the object of the image to be detected. And generating a second text feature according to the guide image, and enabling the guide image to correspond to the second text feature. According to the method, a target to be detected is extracted from an image to be detected according to a first text feature or a second text feature, specifically, after the image to be detected is input into a neural network after training is completed, the neural network cannot confirm and output an accurate target to be detected from an existing detection data set, so that the neural network can output a plurality of images possibly with the target to be detected, then similarity calculation is carried out on second image features of the images possibly with the target to be detected, the first text feature and the second text feature, and an image corresponding to the second image feature with the largest similarity value is the target to be detected, and the manual labeling process is saved and the detection efficiency is improved in a mode that the text feature corresponds to the image feature. When the target to be detected is obtained by extracting the second text features, the corresponding relation between the second text features and the target to be detected is added into the detection data set, and the new type of target to be detected and the second text features are added into the detection data set to be used as experiences for the neural network to learn, so that the target to be detected can be extracted more quickly when the target to be detected of the same type is detected later. Therefore, the method for detecting the object has the advantages of reducing the detection cost and the detection time of object detection and improving the detection efficiency without manually marking new classes.

Preferably, in the method for detecting an object of a large visual model provided by the present application, the step of obtaining an image to be detected and a guiding image of the same kind as a target to be detected in the image to be detected includes:

acquiring an image to be detected and inputting a third text characteristic according to the image to be detected;

obtaining an extraction result of extracting a target to be detected from the image to be detected according to the third text characteristic;

and when the extraction result is not matched with the expected result, acquiring the guide image of the same kind as the target to be detected in the image to be detected.

With the above-described object detection method of the visual large model, since the detection data set is generally large, the user cannot generally confirm whether the image to be detected is of a newly added type, and therefore, in order to shorten the detection time, it can be determined in advance whether the image of the same type as the object to be detected exists in the detection data set. The method comprises the specific steps that a user can input a third text according to an image to be detected, a text editor can edit the third text input by the user to generate a third text feature, the image to be detected and the third text feature are input into a neural network, if the third text feature is matched with a first text feature existing in a detection data set, the input image to be detected is indicated to be a class existing in the detection data set, and the neural network can directly extract a target to be detected according to the third text feature; if the third text feature cannot be matched with the first text feature, two situations may exist, namely, the image to be detected is a new type, and the description of the third text of the target to be detected is inaccurate by the user, so that the target to be detected cannot be accurately extracted by the neural network.

Preferably, in the method for detecting an object of a large visual model according to the present application, the step after extracting the object to be detected from the image to be detected according to the first text feature or the second text feature further includes:

when the target to be detected is extracted according to the first text feature, marking the corresponding relation between the third text feature and the target to be detected;

and adding the corresponding relation between the third text feature and the target to be detected into the detection data set.

When the object to be detected is extracted according to the first text feature, the fact that the image to be detected is not a newly added class is indicated, but the description of the third text of the object to be detected is inaccurate by the user, so that the object to be detected can be corresponding to the third text feature, the corresponding relation between the third text feature and the object to be detected can be added into the detection data set, the labeling frame of the object to be detected can be expanded, and the detection accuracy of the neural network for object detection can be improved.

Preferably, the step of extracting the target to be detected from the image to be detected according to the first text feature or the second text feature includes:

cutting an image to be detected to obtain a plurality of cutting area images;

Acquiring a second image feature of the plurality of cropped region images;

and calculating similarity values of the second image features, the first text features and the second text features, and extracting a clipping region image corresponding to the second image feature with the largest similarity value as a target to be detected.

By the object detection method of the visual large model, the image to be detected is cut, a plurality of cutting area images are obtained, the sizes of the cutting area images are different, the length-width ratios are different, and the cutting area images comprising the object to be detected exist. Because the neural network cannot extract the target to be detected as accurately as the existing mature detection neural network, the second image features of the plurality of clipping region images can be obtained, the similarity values of the second image features, the first text features and the second text features are calculated, the larger the similarity values are, the more accurate the target to be detected included in the clipping region images is indicated, and then the clipping region image corresponding to the second image feature with the largest similarity value can be extracted as the target to be detected, so that the accuracy of the neural network in extracting the detection target is improved.

Preferably, the present application provides an object detection method of a large visual model, wherein the step of clipping an image to be detected to obtain a plurality of clipping region images includes:

acquiring size information and length-width ratio information of an initial detection frame according to an image to be detected;

generating an initial detection frame according to the size information and the aspect ratio information;

and cutting the image to be detected according to the initial detection frame, and obtaining a plurality of cutting area images.

Preferably, the method for detecting an object of a large visual model, the step of inputting an image to be detected into a neural network after training, and extracting an object to be detected from the image to be detected, further includes:

acquiring a basic class image to be detected in a detection data set;

inputting the base class image to be tested into a text editor, a prediction neural network and a pre-training neural network respectively to correspondingly obtain a first text feature, a first image feature and a third image feature;

calculating the similarity value of the third image feature and the first image feature, and obtaining the third image feature with the maximum similarity value;

calculating a cross entropy function of the third image feature with the maximum similarity value and the first text feature, and training the pre-trained neural network according to the cross entropy function to obtain a trained neural network;

And inputting the image to be detected into the neural network after training.

Preferably, the application provides an object detection method of a large visual model, calculating a cross entropy function of a third image feature with the largest similarity value and a first text feature, training a pre-training neural network according to the cross entropy function, and obtaining the trained neural network comprises the following steps:

calculating a loss function of the third image feature with the maximum similarity value and the first image feature;

and calculating a cross entropy function of the third image feature with the maximum similarity value and the first text feature, and training the pre-training neural network according to the cross entropy function and the loss to obtain the trained neural network.

In a second aspect, the present application provides an object detection apparatus for a visual large model, the apparatus comprising:

the acquisition module is used for: the method comprises the steps of acquiring an image to be detected and a guiding image of the same kind as a target to be detected in the image to be detected;

an input module: the method comprises the steps that an image to be detected is input into a neural network after training is finished, a target to be detected is extracted from the image to be detected, the neural network comprises a detection data set, and the detection data set comprises a corresponding relation between a first text feature and a first image feature;

The input module comprises a generation module and an extraction module:

the generation module is used for: generating a second text feature from the guide image;

and an extraction module: the method comprises the steps of extracting a target to be detected from an image to be detected according to a first text feature or a second text feature;

and (3) an adding module: and adding the corresponding relation between the second text feature and the object to be detected into the detection data set when the object to be detected is extracted according to the second text feature.

In a third aspect, the application provides an electronic device comprising a processor and a memory storing computer readable instructions which, when executed by the processor, perform the steps of the method as provided in the first aspect above.

In a fourth aspect, the present application provides a computer readable storage medium having stored thereon a computer program which when executed by a processor performs the steps of the method as provided in the first aspect above.

The beneficial effects are that: according to the object detection method, the device, the electronic equipment and the storage medium of the visual large model, provided by the application, the image to be detected and the guide image of the same kind as the object to be detected in the image to be detected are obtained through the object detection method of the visual large model, so that the object to be detected can be conveniently extracted from the image to be detected according to the guide image, and the tedious manual labeling step of new kinds is omitted. Inputting the image to be detected into a neural network after training, wherein the neural network comprises a detection data set, the detection data set comprises a corresponding relation between the first text characteristic and the first image characteristic, and the neural network is used for detecting the object of the image to be detected. And generating a second text feature according to the guide image, and enabling the guide image to correspond to the second text feature. According to the method, a target to be detected is extracted from an image to be detected according to a first text feature or a second text feature, specifically, after the image to be detected is input into a neural network after training is completed, the neural network cannot confirm and output an accurate target to be detected from an existing detection data set, so that the neural network can output a plurality of images possibly with the target to be detected, then similarity calculation is carried out on second image features of the images possibly with the target to be detected, the first text feature and the second text feature, and an image corresponding to the second image feature with the largest similarity value is the target to be detected, and the manual labeling process is saved and the detection efficiency is improved in a mode that the text feature corresponds to the image feature. When the target to be detected is obtained by extracting the second text features, the corresponding relation between the second text features and the target to be detected is added into the detection data set, and the new type of target to be detected and the second text features are added into the detection data set to be used as experiences for the neural network to learn, so that the target to be detected can be extracted more quickly when the target to be detected of the same type is detected later. Therefore, the method for detecting the object has the advantages of reducing the detection cost and the detection time of the object detection model and improving the detection efficiency without manually marking new classes.

Drawings

Fig. 1 is a flowchart of an object detection method of a visual large model provided by the application.

Fig. 2 is a flowchart of step A2 in the method for detecting an object of a visual large model provided by the present application.

Fig. 3 is a block diagram of an object detection device for a large visual model according to the present application.

Fig. 4 is a schematic structural diagram of an electronic device provided by the present application.

Description of the reference numerals: 201. An acquisition module; 202. an input module; 203. a generating module; 204. an extraction module; 205. adding a module; 301. a processor; 302. a memory; 303. a communication bus; 3. an electronic device.

Detailed Description

The following description of the embodiments of the present application will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present application, but not all embodiments. The components of the embodiments of the present application generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the application, as presented in the figures, is not intended to limit the scope of the application, as claimed, but is merely representative of selected embodiments of the application. All other embodiments, which can be made by a person skilled in the art without making any inventive effort, are intended to be within the scope of the present application.

It should be noted that: like reference numerals and letters denote like items in the following figures, and thus once an item is defined in one figure, no further definition or explanation thereof is necessary in the following figures.

The following disclosure provides many different embodiments or examples for accomplishing the objectives of the present application and solving the problems of the prior art. In the existing object detection method, when a new class is added for object detection, complicated manual labeling steps are often needed for the new class, so that the object to be detected can be accurately extracted. In order to solve the problem, the application provides an object detection method, an object detection device, an electronic device and a storage medium of a visual large model, which specifically comprise the following steps:

referring to fig. 1 and fig. 2, the method for detecting an object of a large visual model according to the embodiment of the present application is applied to the technical field of image processing, and is characterized in that an image to be detected is input into a trained neural network, a guide image of the same type as an object to be detected is obtained at the same time, the image to be detected is extracted according to a second text feature of the guide image or a first text feature included in the neural network, an image corresponding to the image feature most similar to the first text feature or the second text feature is found out and is output as the object to be detected, and when the object to be detected is extracted according to the second text feature, the corresponding relationship between the second text feature and the object to be detected is added into a detection data set for storage, so that the subsequent use is facilitated, the detection time is reduced, and the detection efficiency is improved.

The object detection method of the visual large model comprises the following steps:

a1: acquiring an image to be detected and a guiding image of the same kind as a target to be detected in the image to be detected;

a2: inputting an image to be detected into a neural network after training, extracting a target to be detected from the image to be detected, wherein the neural network comprises a detection data set, and the detection data set comprises a corresponding relation between a first text feature and a first image feature;

a21: generating a second text feature according to the guide image;

a22: extracting a target to be detected from the image to be detected according to the first text feature or the second text feature;

a3: when the target to be detected is extracted according to the second text feature, adding the corresponding relation between the second text feature and the target to be detected into the detection data set.

In step A1, the image to be detected is an image input by a user, and includes an object to be detected and a background unrelated to the object to be detected. The target to be detected is a part which is expected by a user and can be segmented from the image to be detected, such as: the guiding image is an image which is input by a user according to the target to be detected and is of the same type as the target to be detected, and only comprises the target to be detected, but does not comprise a background irrelevant to the target to be detected. If the target to be detected is expected to be a white cat, the guide image to be input is an image of the white cat, wherein the angle and the size of the image in the guide image can be the same as or different from those of the target to be detected.

In practical application, because the data in the detection data set is huge, when the image to be detected is acquired, a user cannot judge whether the image to be detected is a newly added class or an existing class in the detection data set, if the image to be detected is an existing class in the detection data set, when the object detection is carried out, only a third text of the object to be detected is required to be input into the neural network, and the neural network for object detection can accurately extract the object to be detected which accords with the expected object to be detected; if the image to be detected is a new type, or if the third text description input by the user to the target to be detected is inaccurate, and the neural network cannot accurately output the target to be detected, the guiding image can be selected to be acquired, so that the neural network can accurately extract the target to be detected according to the second text feature or the first text feature of the guiding image. Therefore, when the image to be detected is obtained, in order to save the extraction time, the method of extracting the object to be detected by adopting the guiding image is avoided, the image to be detected needs to be judged, and whether the result of extracting the object to be detected can be obtained directly by inputting the third text, and only when the object to be detected cannot be directly extracted by inputting the third text, the object to be detected can be extracted by obtaining the guiding image, so that the detection time is saved and the detection efficiency is improved, therefore, in some specific embodiments, the steps of obtaining the image to be detected and the guiding image of the same kind as the object to be detected in the image to be detected include:

In the step of acquiring an image to be detected and inputting a third text feature according to the image to be detected, inputting a third text of semantic description of a target to be detected by a user according to the image to be detected, and inputting the third text into a text editor to obtain the third text feature, wherein the third text feature is actually a numerical value array converted according to the semantic of the third text; the content segments with similar meanings with the third text have similar numerical array representations, so if the image to be detected is a type existing in the detection data set, it is indicated that the detection data set has first text feature representations similar to the third text feature, the detection data set stores first text feature and first image feature pairs (for example, the first text feature representing a 'white cat' and the first image feature representing a 'white cat' are a group of text image pairs, and the first text feature and the first image feature representing a 'white cat' are stored in the detection data set through manual labeling), and then the object to be detected similar to the first image feature can be directly output from the image to be detected according to the first image feature corresponding to the first text feature, so that the time for extracting the object to be detected can be greatly saved. If the neural network extracts a result according to the third text feature, and the result does not match the expected target to be detected (the user can clearly determine whether the output target to be detected is the expected target to be detected by naked eyes), two situations may exist: firstly, the image to be detected is a new type, secondly, the third text of the description of the object to be detected, which is input by a user, is inaccurate, and both the two conditions can lead the image extracted by the neural network to be inconsistent with the expected object to be detected.

In step A2, the neural network for object detection is obtained by training the existing predictive neural network with the object detection function, but the model of the existing predictive neural network is usually huge and is not suitable for being integrated in a middle-miniature detection system, so that the existing predictive neural network with the powerful object detection function (such as a clip model, a chinese name is a language-image pre-training model) can be used for training the pre-training neural network, so that the pre-training neural network has the object detection function similar to the clip, but the model size is smaller and the model size is suitable for most systems. Further, in some specific embodiments, the step of inputting the image to be detected into the trained neural network and extracting the target to be detected from the image to be detected further includes:

acquiring a basic class image to be detected in a detection data set;

and inputting the image to be detected into the neural network after training.

In the step of acquiring the base class image to be detected in the detection data set, the detection data set comprises a first text feature and a first image feature, and the first text feature and the first image feature have a corresponding relation, which is consistent with the image text pair relation in the clip model. The base class to-be-detected image comprises a base class to-be-detected target and a background irrelevant to the base class to-be-detected target, and the base class to-be-detected target is matched with a first text feature-first image feature in the detection data set. If the base class image to be detected is to be obtained, the image type corresponding to the existing first text feature-first image feature is only required to be obtained from the detection data set, and the image to be detected containing the type is selected as the base class image to be detected according to the image type. For example, a group of first text features-first image features are randomly selected from the detection data set, and the corresponding semantic text and the image are obtained to be expressed as a grape, and then an image including the grape is selected as a base class image to be tested, however, multiple groups of first text features-first image features can be selected, multiple groups of base class images to be tested are obtained, and the number of samples is increased, so that the training of the pre-training neural network is better facilitated.

The text editor is an existing tool capable of converting semantic text into a numerical array, the prediction neural network can be a clip model or other neural network models with object detection functions, and the pre-training neural network is a neural network which needs to be trained to obtain the object detection functions in the prediction neural network. Inputting the basic class images to be tested into a text editor, predicting a neural network and a pre-training neural network respectively to correspondingly obtain a first text feature, a first image feature and a third image feature, then calculating the similarity of the first image feature and the third image feature, obtaining the third image feature with the maximum similarity value, simultaneously calculating the cross entropy function of the third image feature with the maximum similarity value and the first text feature, training the pre-training neural network according to the cross entropy function, and continuously training the pre-training neural network in the process to ensure that the third image feature obtained through the pre-training neural network is more similar to the first image feature, thereby enabling the pre-training neural network to obtain the function of object detection in the predicted neural network. The meaning of the actual representation of the process is: the method comprises the steps of taking a basic class to-be-tested image as a training sample, firstly obtaining accurate first text features through a text editor and a standard predictive neural network, and obtaining first image features corresponding to the first text features (namely outputting basic class to-be-tested targets of the basic class to-be-tested image), then inputting the basic class to-be-tested image into the pre-training neural network, wherein the untrained neural network does not have the function of accurately extracting the basic class to-be-tested targets, so that an output result (third image features) has a larger phase difference with the first image features, and in order to continuously optimize the output result of the pre-training neural network, the similarity value of the third image features and the first image features can be calculated for one time, the third image features with the largest similarity value can be found out, the third image features are more similar to the first image features, but the accurate third image features cannot be obtained in practice, the fact, the difference between the third image features and the first image features always exists, in order to solve the problem, the text features with the largest similarity value can be further calculated, the cross entropy function is used for detecting the first image features, the first image features can be more accurately and the text features can be more accurately trained, and the difference is more than the first image features can be obtained, and the text features can be more accurately detected, and the text features can be more accurately and the training the feature is more accurately, and the feature is more than the first image features can be more accurately, and has the largest.

Further, in some specific embodiments, training the pre-trained neural network according to the cross entropy function alone is insufficient to meet the requirement of object detection, and in order to solve the problem, calculating a cross entropy function of the third image feature with the largest similarity value and the first text feature, and training the pre-trained neural network according to the cross entropy function, the step of obtaining the trained neural network includes:

and calculating a cross entropy function of the third image feature with the maximum similarity value and the first text feature, and training the pre-training neural network according to the cross entropy function and the loss function to obtain the trained neural network.

In practical application, in order to train the pre-trained neural network better, a loss function of the third image feature and the first image feature with the largest similarity value can be calculated first, the loss function quantifies the difference between the third image feature and the first image feature with the largest similarity value, and meanwhile, a cross entropy function of the third image feature and the first text feature is calculated, so that the pre-trained neural network is trained according to the cross entropy function and the loss function, the trained neural network is obtained, and the detection accuracy of the neural network is improved.

Wherein in step a21, the second text feature is obtained by inputting the guiding image into the text editor.

In step a22, the first text feature represents text features of all base class images to be detected in the detection database, the second text feature is text feature of the guiding image, and the step of extracting the target to be detected from the image to be detected according to the first text feature or the second text feature includes:

cutting an image to be detected to obtain a plurality of cutting area images;

acquiring a second image feature of the plurality of cropped region images;

In practical application, the object to be detected is only a part of the image to be detected, so that in order to better extract the image to be detected, the image to be detected can be cut to obtain a plurality of cut area images, the plurality of cut area images comprise images containing the object to be detected, then the plurality of cut area images are input into a trained neural network to obtain second image features of each cut area image, similarity values of the second image features and the first text features and the second text features are calculated, the second image features are subjected to dot multiplication respectively with the first text features and the second text features, the dot multiplication result is the similarity value, then the second image features with the largest similarity values (the dot multiplication result of the second image features and the first text features and the dot multiplication result of the second text features are compared together, and accordingly the cut area images corresponding to the second image features with the largest similarity values are extracted as the object to be detected), detection time is saved, and detection efficiency is improved.

Further, in some specific embodiments, the step of clipping the image to be detected to obtain a plurality of clipping region images includes:

In practical application, the size information and the aspect ratio information of the initial detection frame are set by a user, and in general, the size information is smaller than the size information of the image to be detected, and the aspect ratio information may be 2:1,1:1,1:2, etc., it is understood that in practical application, specific numerical values of the above information are not limited to this, and the user is specifically required to set according to the practical situation. In order to extract the region of interest and obtain the clipping region image, a plurality of (for example, 3) initial detection frames are generally used to traverse all possible pixel frames on the image to be detected, but if this is adopted, the number of initial detection frames generated on a picture will be quite huge, for example, 224×224×3 initial detection frames will be generated by using a 224×224 size image to be detected, and 224×224×3 clipping region images are generated, which is very unfavorable for improving the efficiency of object detection, so in order to solve the problem, the image to be detected can be firstly input into a back bone neural network to extract the feature image of the image to be detected, for example, the image to be detected with 224×224 size is input into the back bone, the feature image to be detected is set to be sampled 5 times, and the feature image of 7*7 can be obtained, and the feature image of the image to be detected is further processed by ROI (region of interest) clipping, so that the feature image of the image to be detected is clipped by using a single initial detection frame, and therefore, the aspect ratio information of the size of the initial detection frame and the aspect ratio of the feature image to be detected need to be set is larger than the size information of the initial detection frame, and the aspect ratio information of the feature ratio is set, and the aspect ratio information of the feature ratio of the feature frame to be detected is larger than the initial size. In order to ensure that the initial detection frame can cover the characteristic image as much as possible, a center point of the initial detection frame can be selected from the characteristic image, 3 different size information and 3 different aspect ratio information are set by taking the center point as the center of the initial detection frame, so that 3*3 =9 initial detection frames can be obtained, and further 9 clipping region images can be obtained. It will be appreciated that the data information illustrated in the above examples is only referred to, and in practical applications, the actual data is set differently by the user according to the specific situation. The method ensures that the extracted clipping region image covers the image to be detected as much as possible, reduces the number of the generated clipping region images, and further improves the detection efficiency of object detection.

According to the object detection method of the large visual model, the image to be detected and the guide image of the same kind as the object to be detected in the image to be detected are obtained through the object detection method of the large visual model, so that the object to be detected can be conveniently extracted from the image to be detected according to the guide image, and the tedious manual labeling step of new kinds is omitted. Inputting the image to be detected into a neural network after training, wherein the neural network comprises a detection data set, the detection data set comprises a corresponding relation between the first text characteristic and the first image characteristic, and the neural network is used for detecting the object of the image to be detected. And generating a second text feature according to the guide image, and enabling the guide image to correspond to the second text feature. According to the method, a target to be detected is extracted from an image to be detected according to a first text feature or a second text feature, specifically, after the image to be detected is input into a neural network after training is completed, the neural network cannot confirm and output an accurate target to be detected from an existing detection data set, so that the neural network can output a plurality of images possibly with the target to be detected, then similarity calculation is carried out on second image features of the images possibly with the target to be detected, the first text feature and the second text feature, and an image corresponding to the second image feature with the largest similarity value is the target to be detected, and the manual labeling process is saved and the detection efficiency is improved in a mode that the text feature corresponds to the image feature. When the target to be detected is obtained by extracting the second text features, the corresponding relation between the second text features and the target to be detected is added into the detection data set, and the new type of target to be detected and the second text features are added into the detection data set to be used as experiences for the neural network to learn, so that the target to be detected can be extracted more quickly when the target to be detected of the same type is detected later. Therefore, the method for detecting the object has the advantages of reducing the detection cost and the detection time of the object detection model and improving the detection efficiency without manually marking new classes.

Referring to fig. 3, the object detection device for a large visual model provided by the present application includes:

the acquisition module 201: the method comprises the steps of acquiring an image to be detected and a guiding image of the same kind as a target to be detected in the image to be detected;

the input module 202: the method comprises the steps that an image to be detected is input into a neural network after training is finished, a target to be detected is extracted from the image to be detected, the neural network comprises a detection data set, and the detection data set comprises a corresponding relation between a first text feature and a first image feature;

the input module 202 includes a generating module 203 and an extracting module 204:

the generation module 203: generating a second text feature from the guide image;

extraction module 204: the method comprises the steps of extracting a target to be detected from an image to be detected according to a first text feature or a second text feature;

the adding module 205: and adding the corresponding relation between the second text feature and the object to be detected into the detection data set when the object to be detected is extracted according to the second text feature.

The obtaining module 201 may be a module for receiving user input information in the object detection model, in practical application, the to-be-detected image and the guiding image of the same kind as the to-be-detected object in the to-be-detected image are manually input by a user, the obtaining module 201 obtains the to-be-detected image and the guiding image, after the obtaining module 201 obtains the to-be-detected image and the guiding image, the to-be-detected image is transferred to the input module 202, the input module 202 is used for inputting the to-be-detected image into the trained neural network to perform operations such as image feature extraction, meanwhile, the obtaining module 201 transfers the guiding image to the generating module 203, the generating module 203 is a text editor, the guiding image generates a second text feature, the content of the guiding image is represented in a two-dimensional array form, then, the extracting module 204 is used for extracting the to-be-detected object from the to-be-detected image according to the first text feature and the second text feature, further, the to-be-detected object is obtained according to the first text feature extraction or the second text feature extraction is transferred to the adding module 205, if the adding module 205 receives the information that the to-be-detected object is the to-be-detected object, the text feature extraction is obtained according to the second text feature extraction, the corresponding to the detected feature is obtained, and the data is concentrated, and the data of the detected object is added to the data is concentrated.

In practical application, because the data in the detection data set is huge, when the acquisition module 201 acquires the image to be detected, the user cannot determine whether the image to be detected is a newly added class or an existing class in the detection data set, if the image to be detected is an existing class in the detection data set, when the object detection is performed, the input module 202 only needs to input the third text of the object to be detected into the neural network, and the neural network for object detection can accurately extract the object to be detected which meets the expected object to be detected; if the image to be detected is a new type, or if the third text description input by the user to the target to be detected is inaccurate, and the neural network cannot accurately output the target to be detected, the guiding image can be selected to be acquired, so that the neural network can accurately extract the target to be detected according to the second text feature or the first text feature of the guiding image. Therefore, when the obtaining module 201 obtains the image to be detected, in order to save the extraction time, the method of extracting the target to be detected by using the guiding image is avoided, the image to be detected needs to be judged, and whether the result of extracting the target to be detected by directly inputting the third text is obtained, and only when the target to be detected cannot be directly extracted by directly inputting the third text, the obtaining module 201 can further obtain the guiding image, and extract the target to be detected by using the guiding image, thereby saving the detection time and improving the detection efficiency.

In practical application, if the base class image to be detected is to be obtained, only the image type corresponding to the existing first text feature-first image feature is required to be obtained from the detection data set, and the image to be detected containing the type is selected as the base class image to be detected according to the image type. For example, a group of first text features-first image features are randomly selected from the detection data set, and the corresponding semantic text and the corresponding image which are expressed as 'grape' are obtained, and then one image which comprises 'grape' is selected as the base class image to be tested, however, multiple groups of first text features-first image features can be selected, multiple groups of base class images to be tested are obtained, and the number of samples is increased, so that the training of the pre-training neural network is facilitated.

In practical application, in order to obtain a trained neural network, an image to be tested of a base class can be used as a training sample, an accurate first text feature and a first image feature corresponding to the first text feature can be obtained through a text editor and a standard predictive neural network (namely, an accurate base class object to be tested of the image to be tested of the base class is output), then the image to be tested of the base class is input into a pre-training neural network, as the untrained neural network does not have the function of accurately extracting the base class object to be tested, the output result (third image feature) has a larger phase difference with the first image feature, in order to continuously optimize the output result of the pre-training neural network, the similarity value of the third image feature and the first image feature can be calculated for the first time, the third image feature with the largest similarity value can be found out, the third image feature with the first image feature is more similar to the first image feature, but the accurate third image feature cannot be obtained in practice, in the practical application, the third image feature and the first image feature cannot always have a difference, in order to further calculate the maximum phase difference with the first image feature, namely, the first image feature and the first image feature is more similar to the first image feature is more accurately detected, the first image feature is more than the first image feature is more similar to the first image feature, and the first image feature is more similar to the first image feature is more accurately calculated, and the first image feature is more similar to the first text feature is more accurately calculated, and the first feature is more than the text feature and more than the text feature is more accurately calculated, and the feature has the largest.

In practical application, the object to be detected is only a part of the image in the image to be detected, so that in order to better extract the image to be detected, the image to be detected can be cut to obtain a plurality of cut area images, the plurality of cut area images comprise images containing the object to be detected, the plurality of cut area images are input into a trained neural network to obtain second image features of the images of all cut areas, then similarity values of the second image features and the first text features and the second text features are calculated, the second image features are subjected to dot multiplication respectively with the first text features and the second text features, the dot multiplication result is the similarity value, and then the cut area image corresponding to the second image feature with the largest similarity value is extracted as the object to be detected, so that detection time is saved, and detection efficiency is improved.

In practical application, in order to extract a region of interest and obtain a cropped region image, a plurality of (e.g. 3) initial detection frames are generally used to traverse all possible pixel frames on an image to be detected, but if this is adopted, the number of initial detection frames generated on a picture will be quite huge, for example, 224×224×3 initial detection frames will be generated by using a single image to be detected with the size of 224×224, and 224×224×3 cropped region images are generated, which is very unfavorable for improving the efficiency of object detection, so in order to solve the problem, the image to be detected may be firstly input into a backbone neural network to extract a feature image of the image to be detected, for example, the image to be detected with the size of 224×224 is input into the backbone, the feature image to be detected is set to be sampled 5 times, the feature image to be detected can be obtained, and the feature image of the image to be detected is further cropped by using a single initial detection frame with the size of 224×224×3 initial detection frames, so that the aspect ratio of the feature image to be detected is required to be initially cropped, and the aspect ratio of the feature image to be detected is set to be larger than the initial aspect ratio of the feature image to be detected, and the feature image is set to be larger than the initial aspect ratio of the feature frame. In order to ensure that the initial detection frame can cover the characteristic image as much as possible, a center point of the initial detection frame can be selected from the characteristic image, 3 different size information and 3 different aspect ratio information are set by taking the center point as the center of the initial detection frame, so that 3*3 =9 initial detection frames can be obtained, and further 9 clipping region images can be obtained. It will be appreciated that the data information illustrated in the above examples is only referred to, and in practical applications, the actual data is set differently by the user according to the specific situation. The method ensures that the extracted clipping region image covers the image to be detected as much as possible, reduces the number of the generated clipping region images, and further improves the detection efficiency of object detection.

According to the object detection device of the large visual model, the image to be detected and the guiding image of the same kind as the object to be detected in the image to be detected are obtained, so that the object to be detected can be conveniently extracted from the image to be detected according to the guiding image, and the tedious manual labeling step of new kinds is omitted. Inputting the image to be detected into a neural network after training, wherein the neural network comprises a detection data set, the detection data set comprises a corresponding relation between the first text characteristic and the first image characteristic, and the neural network is used for detecting the object of the image to be detected. And generating a second text feature according to the guide image, and enabling the guide image to correspond to the second text feature. According to the method, a target to be detected is extracted from an image to be detected according to a first text feature or a second text feature, specifically, after the image to be detected is input into a neural network after training is completed, the neural network cannot confirm and output an accurate target to be detected from an existing detection data set, so that the neural network can output a plurality of images possibly with the target to be detected, then similarity calculation is carried out on second image features of the images possibly with the target to be detected, the first text feature and the second text feature, and an image corresponding to the second image feature with the largest similarity value is the target to be detected, and the manual labeling process is saved and the detection efficiency is improved in a mode that the text feature corresponds to the image feature. When the target to be detected is obtained by extracting the second text features, the corresponding relation between the second text features and the target to be detected is added into the detection data set, and the new type of target to be detected and the second text features are added into the detection data set to be used as experiences for the neural network to learn, so that the target to be detected can be extracted more quickly when the target to be detected of the same type is detected later. Therefore, the method for detecting the object has the advantages of reducing the detection cost and the detection time of the object detection model and improving the detection efficiency without manually marking new classes.

Referring to fig. 4, fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present application, and the present application provides an electronic device 3, including: processor 301 and memory 302, the processor 301 and memory 302 being interconnected and in communication with each other by a communication bus 303 and/or other form of connection mechanism (not shown), the memory 302 storing computer readable instructions executable by the processor 301, which when executed by an electronic device, the processor 301 executes the computer readable instructions to perform the methods in any of the alternative implementations of the above embodiments to perform the functions of: acquiring an image to be detected and a guiding image of the same kind as a target to be detected in the image to be detected; inputting an image to be detected into a neural network after training, extracting a target to be detected from the image to be detected, wherein the neural network comprises a detection data set, and the detection data set comprises a corresponding relation between a first text feature and a first image feature; the step of inputting the image to be detected into the neural network after training and extracting the target to be detected from the image to be detected comprises the following steps: generating a second text feature according to the guide image; extracting a target to be detected from the image to be detected according to the first text feature or the second text feature; when the target to be detected is extracted according to the second text feature, adding the corresponding relation between the second text feature and the target to be detected into the detection data set.

An embodiment of the present application provides a computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs a method in any of the alternative implementations of the above embodiments to implement the following functions: acquiring an image to be detected and a guiding image of the same kind as a target to be detected in the image to be detected; inputting an image to be detected into a neural network after training, extracting a target to be detected from the image to be detected, wherein the neural network comprises a detection data set, and the detection data set comprises a corresponding relation between a first text feature and a first image feature; the step of inputting the image to be detected into the neural network after training and extracting the target to be detected from the image to be detected comprises the following steps: generating a second text feature according to the guide image; extracting a target to be detected from the image to be detected according to the first text feature or the second text feature; when the target to be detected is extracted according to the second text feature, adding the corresponding relation between the second text feature and the target to be detected into the detection data set.

The computer readable storage medium may be implemented by any type or combination of volatile or non-volatile Memory devices, such as static random access Memory (Static Random Access Memory, SRAM for short), electrically erasable Programmable Read-Only Memory (Electrically Erasable Programmable Read-Only Memory, EEPROM for short), erasable Programmable Read-Only Memory (Erasable Programmable Read Only Memory, EPROM for short), programmable Read-Only Memory (ROM for short), magnetic Memory, flash Memory, magnetic disk, or optical disk.

In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other manners. The above-described apparatus embodiments are merely illustrative, for example, the division of the units is merely a logical function division, and there may be other manners of division in actual implementation, and for example, multiple units or components may be combined or integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be through some communication interface, device or unit indirect coupling or communication connection, which may be in electrical, mechanical or other form.

Further, the units described as separate units may or may not be physically separate, and units displayed as units may or may not be physical units, may be located in one place, or may be distributed over a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

Furthermore, functional modules in various embodiments of the present application may be integrated together to form a single portion, or each module may exist alone, or two or more modules may be integrated to form a single portion.

In this document, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions.

The above description is only an example of the present application and is not intended to limit the scope of the present application, and various modifications and variations will be apparent to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the protection scope of the present application.

Claims

1. An object detection method for a visual large model, characterized in that the method comprises the steps of:

inputting the image to be detected into a neural network after training is completed, and extracting a target to be detected from the image to be detected, wherein the neural network comprises a detection data set, and the detection data set comprises a corresponding relation between a first text feature and a first image feature;

Generating a second text feature according to the guide image;

and when the target to be detected is extracted according to the second text feature, adding the corresponding relation between the second text feature and the target to be detected into the detection data set.

2. The method for detecting an object of a large visual model according to claim 1, wherein the step of acquiring an image to be detected and a guide image of the same kind as an object to be detected in the image to be detected includes:

acquiring the image to be detected and inputting a third text characteristic according to the image to be detected;

acquiring an extraction result of extracting the target to be detected from the image to be detected according to the third text feature;

and when the extraction result is not matched with the expected result, acquiring a guide image of the same kind as the target to be detected in the image to be detected.

3. The method for detecting an object of a visual large model according to claim 2, wherein the step after extracting an object to be detected from the image to be detected based on the first text feature or the second text feature further comprises:

and adding the corresponding relation between the third text feature and the target to be detected to the detection data set.

4. The method for detecting an object of a large visual model according to claim 1, wherein the step of extracting a target to be detected from the image to be detected based on the first text feature or the second text feature comprises:

cutting the image to be detected to obtain a plurality of cutting area images;

acquiring second image features of a plurality of the cropped region images;

and calculating the similarity value of the second image feature, the first text feature and the second text feature, and extracting the clipping region image corresponding to the second image feature with the maximum similarity value as the target to be detected.

5. The method for object detection of a visual large model according to claim 4, wherein the step of clipping the image to be detected to obtain a plurality of clipping region images comprises:

acquiring size information and length-width ratio information of an initial detection frame according to the image to be detected;

Generating the initial detection frame according to the size information and the length-width ratio information;

6. The method for detecting an object of a large visual model according to claim 1, wherein the step of inputting the image to be detected into a trained neural network and extracting the object to be detected from the image to be detected further comprises:

acquiring a basic class image to be detected in the detection data set;

inputting the base class image to be tested into a text editor, a prediction neural network and a pre-training neural network respectively, and correspondingly obtaining a first text characteristic, a first image characteristic and a third image characteristic;

calculating a cross entropy function of the third image feature with the maximum similarity value and the first text feature, and training the pre-training neural network according to the cross entropy function to obtain the trained neural network;

and inputting the image to be detected into the neural network after training.

7. The method for detecting an object in a large visual model according to claim 6, wherein the steps of calculating a cross entropy function of the third image feature having the largest similarity value and the first text feature, and training the pre-trained neural network according to the cross entropy function, and obtaining the trained neural network include:

8. An object detection device for a visual large model, the device comprising:

an input module: the method comprises the steps of inputting an image to be detected into a neural network after training is completed, and extracting a target to be detected from the image to be detected, wherein the neural network comprises a detection data set, and the detection data set comprises a corresponding relation between a first text feature and a first image feature;

The input module comprises a generation module and an extraction module:

and an extraction module: the method comprises the steps of extracting a target to be detected from the image to be detected according to the first text feature or the second text feature;

and (3) an adding module: and adding the corresponding relation between the second text feature and the target to be detected into the detection data set when the target to be detected is extracted according to the second text feature.

9. An electronic device comprising a processor and a memory storing computer readable instructions which, when executed by the processor, perform the steps of the method of any of claims 1-7.

10. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, performs the steps of the method according to any of claims 1-7.