CN117876711A

CN117876711A - Image target detection method, device, equipment and medium based on image processing

Info

Publication number: CN117876711A
Application number: CN202410275453.9A
Authority: CN
Inventors: 董方; 沈傲然
Original assignee: Jinrui Tongchuang Beijing Technology Co ltd
Current assignee: Jinrui Tongchuang Beijing Technology Co ltd
Priority date: 2024-03-12
Filing date: 2024-03-12
Publication date: 2024-04-12

Abstract

The embodiment of the invention provides an image target detection method, an image target detection device and an image target detection medium based on image processing, which relate to the technical field of image processing, wherein the method comprises the following steps: collecting an original video, performing an image feature recognition task, and taking an image generated by the image feature recognition task as a first contour image; acquiring multi-scale features from the first contour image, dynamically aggregating the multi-scale features to generate a standard scale feature image, generating a corresponding prediction probability image according to the standard scale feature image, calculating a feature threshold of the standard scale feature image to generate a prediction threshold image, and generating a second contour image according to the prediction probability image and the prediction threshold image; generating a feature map sequence in the second contour image according to the two-dimensional spatial relationship of all target objects in the second contour image; and determining a target object in the original video through the feature map sequence. The scheme improves the accuracy of target detection through multi-level detection steps.

Description

Image target detection method, device, equipment and medium based on image processing

Technical Field

The present invention relates to the field of image processing technologies, and in particular, to an image target detection method, apparatus, computer device, and medium based on image processing.

Background

The existing image-based target detection methods are generally weak in applicability, and often cannot detect correctly when a target object is blocked by dust, steam, water drops or other substances. Meanwhile, an image extracted from a video may often suffer from problems of jitter, inaccurate focal length, or photographing angle, insufficient image sharpness, or distortion or tilting of a target object in the image. In this case, the accuracy of detecting the target in the image is generally low, the error is large, and the requirement of the actual service cannot be met.

Therefore, a more intelligent target detection method is needed for improving the accuracy of image detection.

Disclosure of Invention

In view of the above, the embodiment of the invention provides an image target detection method based on image processing, so as to solve the technical problem of lower target detection accuracy of images in the prior art. The method comprises the following steps:

acquiring an original video, acquiring a plurality of images from the original video according to a preset time interval to form an original image set, inputting the original image set into a light target detection model to perform an image feature recognition task, and taking an image generated by the image feature recognition task as a first contour image;

acquiring multi-scale features from the first contour image, dynamically aggregating the multi-scale features to generate a standard scale feature image, generating a corresponding prediction probability image according to the standard scale feature image, calculating a feature threshold of the standard scale feature image to generate a prediction threshold image, and generating a second contour image according to the prediction probability image and the prediction threshold image;

generating a feature map sequence in the second contour image according to the two-dimensional spatial relationship of all target objects in the second contour image;

and determining a target object in the original video through the feature map sequence.

The embodiment of the invention also provides an image target detection device based on image processing, which is used for solving the technical problem of lower target detection accuracy of images in the prior art. The device comprises:

the first contour extraction module is used for acquiring an original video, acquiring a plurality of images from the original video according to a preset time interval to form an original image set, inputting the original image set into the light target detection model to perform an image feature recognition task, and taking an image generated by the image feature recognition task as a first contour image;

the second contour extraction module is used for acquiring multi-scale features from the first contour image, generating a standard scale feature map after dynamically aggregating the multi-scale features, generating a corresponding prediction probability map according to the standard scale feature map, calculating a feature threshold of the standard scale feature map to generate a prediction threshold map, and generating a second contour image according to the prediction probability map and the prediction threshold map;

the feature map sequence acquisition module is used for generating a feature map sequence in the second contour image according to the two-dimensional spatial relationship of all the target objects in the second contour image;

and the target object generation module is used for determining a target object in the original video through the feature map sequence.

The embodiment of the invention also provides computer equipment, which comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor realizes any image target detection method based on image processing when executing the computer program so as to solve the technical problem of lower target detection accuracy of images in the prior art.

The embodiment of the invention also provides a computer readable storage medium which stores a computer program for executing any image target detection method based on image processing, so as to solve the technical problem of lower target detection accuracy of images in the prior art.

Compared with the prior art, the beneficial effects that above-mentioned at least one technical scheme that this description embodiment adopted can reach include at least:

the first contour image is extracted through the light target detection model, the light target detection model is higher in efficiency, and the approximate range of the target can be detected first; the range of the target can be further thinned in the approximate range by the prediction probability map corresponding to the first contour image and the prediction threshold map corresponding to the first contour image; the outline of the target object is extracted through the two-dimensional spatial relationship of the target object, the detection capability of the target with large curvature bending and large angle rotation is enhanced, and the detection precision of the target object can be improved through the image processing processes.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a flowchart of an image object detection method based on image processing according to an embodiment of the present invention;

FIG. 2 is a block diagram of a computer device according to an embodiment of the present invention;

fig. 3 is a block diagram of an image object detection apparatus based on image processing according to an embodiment of the present invention.

Detailed Description

Embodiments of the present application are described in detail below with reference to the accompanying drawings.

Other advantages and effects of the present application will become apparent to those skilled in the art from the present disclosure, when the following description of the embodiments is taken in conjunction with the accompanying drawings. It will be apparent that the described embodiments are only some, but not all, of the embodiments of the present application. The present application may be embodied or carried out in other specific embodiments, and the details of the present application may be modified or changed from various points of view and applications without departing from the spirit of the present application. It should be noted that the following embodiments and features in the embodiments may be combined with each other without conflict. All other embodiments, which can be made by one of ordinary skill in the art based on the embodiments herein without making any inventive effort, are intended to be within the scope of the present application.

In an embodiment of the present invention, there is provided an image target detection method based on image processing, as shown in fig. 1, the method including:

step S101: acquiring an original video, acquiring a plurality of images from the original video according to a preset time interval to form an original image set, inputting the original image set into a light target detection model to perform an image feature recognition task, and taking an image generated by the image feature recognition task as a first contour image;

step S102: acquiring multi-scale features from the first contour image, dynamically aggregating the multi-scale features to generate a standard scale feature image, generating a corresponding prediction probability image according to the standard scale feature image, calculating a feature threshold of the standard scale feature image to generate a prediction threshold image, and generating a second contour image according to the prediction probability image and the prediction threshold image;

step S103: generating a feature map sequence in the second contour image according to the two-dimensional spatial relationship of all target objects in the second contour image;

step S104: and determining a target object in the original video through the feature map sequence.

In specific implementation, in order to effectively process a large number of images in a video and solve image blurring caused by various reasons, the method comprises the following steps of obtaining a plurality of images from an original video according to preset time intervals to form an original image set, inputting the original image set into a light target detection model to perform an image feature recognition task, and taking an image generated by the image feature recognition task as a first contour image:

dividing an original image set into a plurality of batches, randomly selecting a plurality of images from the original images of each batch to zoom, cut and splice to generate reinforced images corresponding to each batch, further obtaining a plurality of reinforced images, and taking the reinforced images as a data set of a neural network model, wherein the number of the reinforced images is the same as that of the batches; before training the neural network model each time, calculating the size of an anchor frame of the data set, and taking the size of the anchor frame as the super parameter of the neural network model; slicing the enhanced image in a backbone network of the neural network to generate a sub-feature image, convolving the generated sub-feature image to generate a plurality of object feature images, and generating a plurality of prediction target frames according to the plurality of object feature images; in the post-processing process of target detection, a Gaussian weighted non-maximum suppression algorithm is adopted to screen a plurality of predicted target frames, and after redundant target frames are removed, a screened target frame is generated; and matching the anchor frame with the screened target frames, calculating a loss value between the anchor frame and the boundary frame of each screened target frame through a loss function, and selecting the screened target frame with the minimum loss value as the first contour image.

In some embodiments, in order to enhance image data at the input end of the neural network model, firstly, the images are scaled to meet the requirements, after the color space of the images is adjusted, four images are selected from the original images in the same batch to be spliced in random scaling, random clipping and random distribution modes, so that the detection effect of the small target is improved. Meanwhile, at the input end, the anchor frame size corresponding to the data set is calculated in a self-adaptive mode, namely, the most suitable anchor frame size (the anchor frame size is used as the super parameter of the neural network) of the data set is calculated automatically before the data set is trained each time, and the detection effect is improved.

Specifically, from the image set with a large number of images and a large image size of the original video, an image range possibly containing a detection target is extracted for the next target detection. Meanwhile, because the number of images collected through video is usually huge and the image data volume is huge, a light target detection model is used for realizing rapid mass detection, so that time is saved and efficiency is improved.

In the specific implementation, in order to perform target detection on objects in areas with different sizes and shapes and irregular areas with distortion or inclination, the accuracy of target detection is improved, multi-scale features are obtained from a first contour image, a standard scale feature map is generated after dynamic aggregation of the multi-scale features, a corresponding prediction probability map is generated according to the standard scale feature map, a feature threshold of the standard scale feature map is calculated to generate a prediction threshold map, and a second contour image is generated according to the prediction probability map and the prediction threshold map:

the first contour image is subjected to a backbone network model and a feature pyramid model to obtain multi-scale features; dynamically polymerizing the multi-scale features to generate a standard scale feature map; inputting the standard scale feature map into a convolution layer for prediction probability calculation, generating a probability value corresponding to each pixel point in the standard scale feature map, and generating a prediction probability map according to the probability value corresponding to each pixel point; using a head network to predict a characteristic threshold value of the standard scale characteristic map, and generating a corresponding prediction threshold value map according to the characteristic threshold value; binarizing the prediction probability map and the prediction threshold map to obtain a segmentation map, and obtaining a contour surrounding curve of the target object according to the segmentation map; and merging the contour surrounding curve of the target object into the first contour image to generate a second contour image.

Specifically, the second contour image is the same size as the first contour image. By extracting the second contour image from the first contour image, the larger contour range is further narrowed for further accurate target detection.

In the specific implementation, in order to dynamically polymerize the features, the multi-scale features are dynamically polymerized through the following steps of:

connecting the multi-scale features together to generate a combined feature sequence; and calculating the attention mechanism weight of each feature in the combined feature sequence through the spatial attention model, multiplying the attention mechanism weight by the multi-scale features, and then aggregating the multiplied attention mechanism weight and the multi-scale features to generate a standard scale feature map.

Specifically, shallow features or large scale feature maps can see more detailed information and small target objects, deep features or small scale feature maps can see large scale target objects and capture global information. Therefore, the multi-scale features are connected, feature images with different scales are fused to obtain the features, and the accuracy of target detection is enhanced by utilizing the feature images with different scales. The aggregation method of the embodiment calculates the importance of the features with different scales and different positions, and dynamically aggregates the features.

In specific implementation, in order to model the two-dimensional spatial relationship of the target objects through the self-attention model, the following steps are adopted to realize that the feature map sequence in the second contour image is generated according to the two-dimensional spatial relationship of all the target objects in the second contour image:

inputting the second contour image into a convolutional neural network, extracting visual features, and generating a target feature map according to the visual features, wherein the dimension of the target feature map is the same as the dimension of the second contour image; converting the target feature map into a feature map sequence; determining a target object in the original video through the feature map sequence, wherein the method comprises the following steps of: each element in the feature map sequence is sequentially encoded through a convolution layer, a depth convolution layer and a convolution layer, so that the encoding of each element is generated, and the encoding of all elements in the feature map sequence forms an encoding feature vector with preset length; and inputting the coding feature vector into a probability distribution function, and determining an object with the largest output probability as a target object in the original video.

In particular, in order to convert the target feature map into a feature map sequence and output the feature map sequence to the self-attention decoder, the conversion of the target feature map into the feature map sequence is achieved by:

respectively calculating sine codes and cosine codes of each pixel point in the target feature map; the sine codes and the cosine codes are weighted and overlapped to generate self-adaptive two-dimensional position codes corresponding to each pixel point; and converting the pixel points of the target feature map into a feature map sequence according to the self-adaptive two-dimensional position codes of all the pixel points.

Specifically, the second contour image is input into the convolutional neural network, visual features are extracted, downsampling operation can be performed simultaneously, the size of the feature map is reduced, and calculation load is reduced for modeling among pixels of a subsequent target feature map.

Specifically, because spatial position relations exist among pixels of the target feature map, in order to extract the spatial position relations, spatial position relations are reserved as position codes after sine codes and cosine codes are weighted and overlapped.

In specific implementation, in order to improve the accuracy of target identification, the identified target object is verified by the following steps:

if the target object is a fixed data set, constructing standard data set distribution, and verifying the correctness of the fixed data set through the standard data set distribution, wherein the standard data set distribution comprises an arrangement rule corresponding to the fixed data set; if the target object is a word, a word dictionary is constructed, and the correctness of the target object is verified by matching the word with the word dictionary.

Specifically, in order to further improve the accuracy of target detection, multiple verification methods may be selected to verify the detection result according to the actual situation. If an incorrect detection result occurs, the accuracy of the detection model can be improved by detecting again, updating the hyper-parameters of the detection model or updating the verification set.

In this embodiment, a computer device is provided, as shown in fig. 2, including a memory 201, a processor 202, and a computer program stored on the memory and executable on the processor, where the processor implements any of the above image target detection methods based on image processing when executing the computer program.

In particular, the computer device may be a computer terminal, a server or similar computing means.

In the present embodiment, there is provided a computer-readable storage medium storing a computer program that performs any of the above-described image object detection methods based on image processing.

In particular, computer-readable storage media, including both permanent and non-permanent, removable and non-removable media, may be used to implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer-readable storage media include, but are not limited to, phase-change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Disks (DVD) or other optical storage, magnetic cassettes, magnetic tape disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device. Computer-readable storage media, as defined herein, does not include transitory computer-readable media (transmission media), such as modulated data signals and carrier waves.

Based on the same inventive concept, the embodiment of the invention also provides an image target detection device based on image processing, as described in the following embodiment. Since the principle of solving the problem of the image target detection device based on the image processing is similar to that of the image target detection method based on the image processing, the implementation of the image target detection device based on the image processing can be referred to the implementation of the image target detection method based on the image processing, and the repetition is omitted. As used below, the term "unit" or "module" may be a combination of software and/or hardware that implements the intended function. While the means described in the following embodiments are preferably implemented in software, implementation in hardware, or a combination of software and hardware, is also possible and contemplated.

Fig. 3 is a block diagram of an image object detection apparatus based on image processing according to an embodiment of the present invention, as shown in fig. 3, including: the first contour extraction module 301, the second contour extraction module 302, the feature map sequence generation module 303, and the target object generation module 304 are described below.

The first contour extraction module 301 is configured to collect an original video, acquire a plurality of images from the original video according to a preset time interval to form an original image set, input the original image set into a light target detection model to perform an image feature recognition task, and use an image generated by the image feature recognition task as a first contour image;

the second contour extraction module 302 is configured to obtain multi-scale features from the first contour image, dynamically aggregate the multi-scale features to generate a standard-scale feature map, generate a corresponding prediction probability map according to the standard-scale feature map, calculate a feature threshold of the standard-scale feature map to generate a prediction threshold map, and generate a second contour image according to the prediction probability map and the prediction threshold map;

a feature map sequence generating module 303, configured to generate a feature map sequence in the second contour image according to the two-dimensional spatial relationships of all the target objects in the second contour image;

the target object generating module 304 is configured to determine a target object in the original video through the feature map sequence.

In one embodiment, the first contour extraction module includes:

the image enhancement processing unit is used for dividing an original image set into a plurality of batches, randomly selecting a plurality of images from the original image of each batch to zoom, cut and splice so as to generate enhanced images corresponding to each batch, further obtaining a plurality of enhanced images, and taking the enhanced images as a data set of a neural network model, wherein the number of the enhanced images is the same as that of the batches;

the anchor frame size calculation unit is used for calculating the anchor frame size of the data set before training the neural network model each time, and taking the anchor frame size as the super parameter of the neural network model;

the prediction target frame generation unit is used for generating a sub-feature image by slicing the enhanced image in a backbone network of the neural network, generating a plurality of object feature images by convolution operation on the generated sub-feature image, and generating a plurality of prediction target frames according to the plurality of object feature images;

the target frame screening unit is used for screening a plurality of predicted target frames by adopting a Gaussian weighted non-maximum suppression algorithm in the post-processing process of target detection, and generating a screened target frame after removing redundant target frames;

and the first contour image generation unit is used for matching the anchor frame with the screened target frames, calculating the loss value between the anchor frame and the boundary frame of each screened target frame through a loss function, and selecting the screened target frame with the minimum loss value as the first contour image.

In one embodiment, the second contour extraction module includes:

the scale feature acquisition unit is used for acquiring multi-scale features of the first contour image through the backbone network model and the feature pyramid model;

the dynamic aggregation unit is used for dynamically aggregating the multi-scale features to generate a standard scale feature map;

the prediction probability feature map generation unit is used for inputting the standard scale feature map into the convolution layer to perform prediction probability calculation, generating probability values corresponding to each pixel point in the standard scale feature map, and generating a prediction probability map according to the probability values corresponding to each pixel point;

the prediction threshold value diagram generating unit is used for predicting the characteristic threshold value of the standard scale characteristic diagram by using the head network and generating a corresponding prediction threshold value diagram according to the characteristic threshold value;

the surrounding curve acquisition unit is used for carrying out binarization processing on the prediction probability map and the prediction threshold map to obtain a segmentation map, and obtaining a contour surrounding curve of the target object according to the segmentation map;

and the second contour image generating unit is used for merging the contour surrounding curve of the target object into the first contour image to generate a second contour image.

In one embodiment, a dynamic aggregation unit is used for connecting the multi-scale features together to generate a combined feature sequence; and calculating the attention mechanism weight of each feature in the combined feature sequence through the spatial attention model, multiplying the attention mechanism weight by the multi-scale features, and then aggregating the multiplied attention mechanism weight and the multi-scale features to generate a standard scale feature map.

In one embodiment, the feature map sequence generation module includes:

the target feature map generating unit is used for inputting the second contour image into the convolutional neural network, extracting visual features and generating a target feature map according to the visual features, wherein the dimension of the target feature map is the same as the dimension of the second contour image;

the feature map sequence conversion unit is used for converting the target feature map into a feature map sequence;

the coding feature vector generation unit is used for coding each element in the feature map sequence sequentially through the convolution layer, the depth convolution layer and the convolution layer to generate codes of each element, and the codes of all the elements in the feature map sequence form coding feature vectors with preset lengths;

and the target object determining unit is used for inputting the coding feature vector into the probability distribution function and determining the object with the largest output probability as the target object in the original video.

In one embodiment, the feature map sequence conversion unit is used for respectively calculating sine codes and cosine codes of each pixel point in the target feature map; the sine codes and the cosine codes are weighted and overlapped to generate self-adaptive two-dimensional position codes corresponding to each pixel point; and converting the pixel points of the target feature map into a feature map sequence according to the self-adaptive two-dimensional position codes of all the pixel points.

In one embodiment, the apparatus further comprises:

and the data verification module is used for constructing standard data set distribution and a word dictionary for verifying the correctness of the target object.

In one embodiment, a data verification module includes:

the data set verification unit is used for constructing standard data set distribution if the target object is a fixed data set, and verifying the correctness of the fixed data set through the standard data set distribution, wherein the standard data set distribution comprises an arrangement rule corresponding to the fixed data set;

and the character verification unit is used for constructing a character dictionary if the target object is characters, and verifying the correctness of the target object by matching the characters with the character dictionary.

The embodiment of the invention realizes the following technical effects:

extracting a first contour image through a light target detection model, extracting an image range possibly containing a detection target from an original video (the number of images is large and the images contain a lot of targets), wherein the light target detection model is faster in efficiency and higher in efficiency, and can detect the approximate range of the target at first; generating a binary image which is closer to binarization through a prediction probability image corresponding to the first contour image and a prediction threshold image corresponding to the first contour image, improving the accuracy of target object identification, and enabling the target range to be more refined in the approximate range; the outline of the target object is extracted through the two-dimensional space relation of the target object, the 2D space relation of the target object is modeled, a convolution layer is introduced into a neural network, the capturing capacity of global and local features is enhanced, the recognition capacity of the targets with large curvature bending and large angle rotation is enhanced, the targets with bending, rotation or partial shielding can still be accurately recognized, and the practicability in a real environment is improved; the standard data set distribution and the word dictionary are constructed to verify the correctness of the target object, and the target detection can be performed again according to the situation when the error is detected, so that the accuracy of the target detection is effectively improved.

It will be apparent to those skilled in the art that the modules or steps of the embodiments of the invention described above may be implemented in a general purpose computing device, they may be concentrated on a single computing device, or distributed across a network of computing devices, they may alternatively be implemented in program code executable by computing devices, so that they may be stored in a storage device for execution by computing devices, and in some cases, the steps shown or described may be performed in a different order than what is shown or described, or they may be separately fabricated into individual integrated circuit modules, or a plurality of modules or steps in them may be fabricated into a single integrated circuit module. Thus, embodiments of the invention are not limited to any specific combination of hardware and software.

The above is only a preferred embodiment of the present invention, and is not intended to limit the present invention, but various modifications and variations can be made to the embodiment of the present invention by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. An image target detection method based on image processing, comprising:

2. The image processing-based image object detection method according to claim 1, wherein acquiring a plurality of images from the original video at a preset time interval to form an original image set, inputting the original image set into a light object detection model for image feature recognition task, and taking an image generated by the image feature recognition task as a first contour image, comprising:

dividing the original image set into a plurality of batches, randomly selecting a plurality of images from the original images of each batch to zoom, cut and splice to generate reinforced images corresponding to each batch, further obtaining a plurality of reinforced images, and taking the reinforced images as a data set of a neural network model, wherein the number of the reinforced images is the same as that of the batches;

before training the neural network model each time, calculating the size of an anchor frame of the data set, and taking the size of the anchor frame as a super parameter of the neural network model;

slicing the enhanced image in a backbone network of the neural network to generate a sub-feature map, convolving the generated sub-feature map to generate a plurality of object feature maps, and generating a plurality of prediction target frames according to the plurality of object feature maps;

in the post-processing process of target detection, a plurality of predicted target frames are screened by adopting a Gaussian weighted non-maximum suppression algorithm, and after redundant target frames are removed, a screened target frame is generated;

and matching the anchor frame with the screened target frames, calculating a loss value between the anchor frame and the boundary frame of each screened target frame through a loss function, and selecting the screened target frame with the minimum loss value as a first contour image.

3. The image processing-based image object detection method according to claim 1, wherein acquiring multi-scale features from the first contour image, dynamically aggregating the multi-scale features to generate a standard-scale feature map, generating a corresponding prediction probability map according to the standard-scale feature map, calculating a feature threshold of the standard-scale feature map to generate a prediction threshold map, and generating a second contour image according to the prediction probability map and the prediction threshold map, comprising:

obtaining multi-scale features of the first contour image through a backbone network model and a feature pyramid model;

dynamically polymerizing the multi-scale features to generate the standard scale feature map;

inputting the standard scale feature map into a convolution layer for prediction probability calculation, generating a probability value corresponding to each pixel point in the standard scale feature map, and generating the prediction probability map according to the probability value corresponding to each pixel point;

predicting a feature threshold value of the standard scale feature map by using a head network, and generating a corresponding prediction threshold map according to the feature threshold value;

binarizing the prediction probability map and the prediction threshold map to obtain a segmentation map, and obtaining a contour surrounding curve of a target object according to the segmentation map;

and merging the contour surrounding curve of the target object into the first contour image to generate a second contour image.

4. The image processing-based image object detection method according to claim 3, wherein dynamically aggregating the multi-scale features to generate a standard-scale feature map comprises:

connecting the multi-scale features together to generate a combined feature sequence;

and calculating the attention mechanism weight of each feature in the combined feature sequence through a spatial attention model, multiplying the attention mechanism weight by the multi-scale feature, and then aggregating the multiplied attention mechanism weight together to generate a standard scale feature map.

5. The image processing-based image object detection method according to claim 1, wherein generating a feature map sequence in the second contour image from a two-dimensional spatial relationship of all target objects in the second contour image includes:

inputting the second contour image into a convolutional neural network, extracting visual features, and generating a target feature map according to the visual features, wherein the dimension of the target feature map is the same as the dimension of the second contour image;

converting the target feature map into a feature map sequence;

determining a target object in the original video through the feature map sequence, wherein the method comprises the following steps of:

coding each element in the characteristic diagram sequence sequentially through a convolution layer, a depth convolution layer and a convolution layer to generate codes of each element, wherein the codes of all the elements in the characteristic diagram sequence form coding characteristic vectors with preset lengths;

and inputting the coding feature vector into a probability distribution function, and determining an object with the largest output probability as a target object in the original video.

6. The image processing-based image object detection method according to claim 5, wherein converting the object feature map into a feature map sequence includes:

respectively calculating sine codes and cosine codes of each pixel point in the target feature map;

the sine codes and the cosine codes are weighted and overlapped to generate self-adaptive two-dimensional position codes corresponding to each pixel point;

and converting the pixel points of the target feature map into a feature map sequence according to the self-adaptive two-dimensional position codes of all the pixel points.

7. The image processing-based image object detection method according to any one of claims 1 to 6, further comprising:

if the target object is a fixed data set, constructing standard data set distribution, and verifying the correctness of the fixed data set through the standard data set distribution, wherein the standard data set distribution comprises an arrangement rule corresponding to the fixed data set;

if the target object is a word, a word dictionary is constructed, and the correctness of the target object is verified by matching the word with the word dictionary.

8. An image object detection apparatus based on image processing, comprising:

the second contour extraction module is used for acquiring multi-scale features from the first contour image, generating a standard scale feature map after dynamic aggregation of the multi-scale features, generating a corresponding prediction probability map according to the standard scale feature map, calculating a feature threshold of the standard scale feature map to generate a prediction threshold map, and generating a second contour image according to the prediction probability map and the prediction threshold map;

the feature map sequence generation module is used for generating a feature map sequence in the second contour image according to the two-dimensional spatial relationship of all target objects in the second contour image;

9. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the image object detection method based on image processing according to any one of claims 1 to 7 when executing the computer program.

10. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program that performs the image object detection method based on image processing according to any one of claims 1 to 7.