CN116630292A

CN116630292A - Target detection method, target detection device, electronic device, and storage medium

Info

Publication number: CN116630292A
Application number: CN202310674299.8A
Authority: CN
Inventors: 瞿晓阳; 王健宗; 吴建汉
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2023-06-07
Filing date: 2023-06-07
Publication date: 2023-08-22

Abstract

The application provides a target detection method, a target detection device, electronic equipment and a storage medium, and belongs to the field of digital medical treatment. The method comprises the following steps: dividing the target image to obtain a local image; mapping all the partial images to a preset vector space to obtain an image embedded vector of each partial image; extracting features of the image embedded vector to obtain target image features; feature sampling is carried out on the target image features to obtain first image features, second image features and third image features, and feature resolutions of the first image features, the second image features, the target image features and the third image features are sequentially increased; performing feature fusion on the first image feature, the second image feature, the target image feature and the third image feature to obtain a fused image feature; and (3) carrying out position detection and category detection on the target image based on the fusion image characteristics to obtain the position data and the object category of the target object, thereby improving the accuracy of target detection.

Description

Target detection method, target detection device, electronic device, and storage medium

Technical Field

The present application relates to the field of digital medical technology, and in particular, to a target detection method, a target detection device, an electronic apparatus, and a storage medium.

Background

The target detection is a technology for detecting and identifying the target from the sequence image containing the target, is a premise of various high-level visual processing and analysis tasks, and has the application fields of intelligent video monitoring, robot navigation, focus detection of medical images and the like.

In practical application, due to the complexity of the scene where the target is located, the difficulty of detection task is promoted by poor imaging quality, shielding, illumination, scale change and the like of the picture. In particular, in medical imaging, the conventional RGB imaging method has obvious disadvantages, so that image information of other modes, such as ultrasound imaging, multispectral imaging, and the like, are required to be compensated.

The current target detection system is often built based on manual design features and model architecture, the building of the manual design features often requires related personnel to have rich knowledge and experience, and the design process of the model architecture of the current target detection system is complex, so that the target detection accuracy of the target detection system is often influenced, and how to improve the target detection accuracy becomes a technical problem to be solved urgently.

Disclosure of Invention

The embodiment of the application mainly aims to provide a target detection method, a target detection device, electronic equipment and a storage medium, aiming at improving the detection accuracy of the position and the category of a target object.

To achieve the above object, a first aspect of an embodiment of the present application provides a target detection method, including:

acquiring a target image;

dividing the target image to obtain at least two partial images;

mapping all the partial images to a preset vector space to obtain an image embedding vector of each partial image;

extracting features of the image embedded vector to obtain target image features;

feature sampling is carried out on the target image features to obtain first image features, second image features and third image features, wherein feature resolutions of the first image features, the second image features, the target image features and the third image features are sequentially increased;

performing feature fusion on the first image feature, the second image feature, the target image feature and the third image feature to obtain a fused image feature;

Performing object position detection on the target image based on the fusion image characteristics to obtain position data of a target object, wherein the position data are used for representing the position of the target object in the target image;

and detecting the object category of the target image based on the fusion image characteristics to obtain the object category of the target object.

In some embodiments, mapping all the partial images to a preset vector space to obtain an image embedded vector of each partial image includes:

performing position coding processing on the local image to obtain a position coding feature corresponding to the local image;

flattening the partial image to obtain a preliminary embedding vector;

performing feature addition on the preliminary embedded vector and the position coding feature to obtain an intermediate embedded vector;

and mapping the intermediate embedded vector to the vector space to obtain the image embedded vector.

In some embodiments, the feature extraction of the image embedding vector to obtain a target image feature includes:

inputting the image embedding vector into a preset feature extraction network, wherein the feature extraction network comprises a first standardization layer, an attention layer, a second standardization layer and an MLP layer;

Splicing the image embedded vectors to obtain fusion embedded vectors;

performing first normalization processing on the fusion embedded vector based on the first normalization layer to obtain a preliminary image coding feature;

performing attention calculation on the preliminary image coding features based on the attention layer to obtain intermediate image coding features;

performing second normalization processing on the intermediate image coding features based on the second normalization layer to obtain target image coding features;

and performing feature mapping on the target image coding features based on the MLP layer to obtain the target image features.

In some embodiments, the performing attention calculation on the preliminary image coding feature based on the attention layer to obtain an intermediate image coding feature includes:

affine transformation is carried out on the preliminary image coding features based on the attention layer, so that an image key vector, an image value vector and an image query vector are obtained;

performing attention calculation based on the image key vector, the image value vector and the image query vector to obtain key image coding features;

and splicing the image embedded vector and the key image coding feature to obtain the intermediate image coding feature.

In some embodiments, the feature sampling the target image feature to obtain a first image feature, a second image feature, and a third image feature includes:

performing first downsampling processing on the target image features to obtain the first image features;

performing second downsampling processing on the target image features to obtain second image features;

and carrying out up-sampling processing on the target image feature to obtain the third image feature.

In some embodiments, the performing object position detection on the target image based on the fused image features to obtain position data of the target object includes:

acquiring a plurality of initial anchor frames and initial position data of each initial anchor frame based on the fused image characteristics;

performing offset calculation on the initial anchor frame to obtain offset data of the initial anchor frame;

performing position adjustment on the initial anchor frame based on the offset data to obtain intermediate anchor frames and intermediate position data of each intermediate anchor frame;

and screening the intermediate position data to obtain the position data of the target object.

In some embodiments, the performing object class detection on the target image based on the fused image features to obtain an object class of the target object includes:

Performing class scoring on a target object of the target image based on a preset function and the fusion image characteristics to obtain target scoring data, wherein the target scoring data is used for representing the probability that the object class of the target object belongs to each preset candidate object class;

and screening the object category from the candidate object categories based on the target scoring data.

To achieve the above object, a second aspect of an embodiment of the present application provides an object detection apparatus, including:

the image acquisition module is used for acquiring a target image;

the image segmentation module is used for carrying out segmentation processing on the target image to obtain at least two partial images;

the image mapping module is used for mapping all the local images to a preset vector space to obtain an image embedded vector of each local image;

the feature extraction module is used for extracting features of the image embedded vector to obtain target image features;

the feature sampling module is used for carrying out feature sampling on the target image features to obtain first image features, second image features and third image features, wherein feature resolutions of the first image features, the second image features, the target image features and the third image features are sequentially increased;

The feature fusion module is used for carrying out feature fusion on the first image feature, the second image feature, the target image feature and the third image feature to obtain a fused image feature;

the position detection module is used for detecting the object position of the target image based on the fusion image characteristics to obtain position data of the target object, wherein the position data are used for representing the position of the target object in the target image;

and the class detection module is used for detecting the object class of the target image based on the fusion image characteristics to obtain the object class of the target object.

To achieve the above object, a third aspect of the embodiments of the present application provides an electronic device, which includes a memory, a processor, where the memory stores a computer program, and the processor implements the method described in the first aspect when executing the computer program.

To achieve the above object, a fourth aspect of the embodiments of the present application proposes a computer-readable storage medium storing a computer program which, when executed by a processor, implements the method of the first aspect.

The application provides a target detection method, a target detection device, electronic equipment and a storage medium, wherein a target image is acquired; dividing the target image to obtain at least two partial images; mapping all the partial images to a preset vector space to obtain an image embedding vector of each partial image, so that a plurality of image embedding vectors containing position information and partial image information can be obtained, and the feature quality of the image embedding vectors can be improved. Further, extracting features of the image embedded vector to obtain target image features; feature sampling is carried out on the target image features to obtain first image features, second image features and third image features, wherein feature resolutions of the first image features, the second image features, the target image features and the third image features are sequentially increased; and carrying out feature fusion on the first image feature, the second image feature, the target image feature and the third image feature to obtain a fused image feature, wherein the mode can realize multi-scale sampling of the target image feature to obtain image features under different scales, and the feature richness of the image features is improved. Finally, detecting the object position of the target image based on the fusion image characteristics to obtain position data of the target object, wherein the position data are used for representing the position of the target object in the target image; object type detection is carried out on the target image based on the fusion image characteristics to obtain the object type of the target object, and the detection accuracy of the position and the type of the target object can be improved, so that the focus area and the focus type in the medical image can be effectively detected, the disease information corresponding to the medical image can be identified in a targeted manner, and the auxiliary identification of the disease can be improved.

Drawings

FIG. 1 is a flow chart of a target detection method provided by an embodiment of the present application;

fig. 2 is a flowchart of step S103 in fig. 1;

fig. 3 is a flowchart of step S104 in fig. 1;

fig. 4 is a flowchart of step S304 in fig. 3;

fig. 5 is a flowchart of step S105 in fig. 1;

fig. 6 is a flowchart of step S107 in fig. 1;

fig. 7 is a flowchart of step S108 in fig. 1;

FIG. 8 is a schematic diagram of a target detection apparatus according to an embodiment of the present application;

fig. 9 is a schematic diagram of a hardware structure of an electronic device according to an embodiment of the present application.

Detailed Description

The present application will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present application more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application.

It should be noted that although functional block division is performed in a device diagram and a logic sequence is shown in a flowchart, in some cases, the steps shown or described may be performed in a different order than the block division in the device, or in the flowchart. The terms first, second and the like in the description and in the claims and in the above-described figures, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein is for the purpose of describing embodiments of the application only and is not intended to be limiting of the application.

First, several nouns involved in the present application are parsed:

artificial intelligence (artificial intelligence, AI): is a new technical science for researching and developing theories, methods, technologies and application systems for simulating, extending and expanding the intelligence of people; artificial intelligence is a branch of computer science that attempts to understand the nature of intelligence and to produce a new intelligent machine that can react in a manner similar to human intelligence, research in this field including robotics, language recognition, image recognition, natural language processing, and expert systems. Artificial intelligence can simulate the information process of consciousness and thinking of people. Artificial intelligence is also a theory, method, technique, and application system that utilizes a digital computer or digital computer-controlled machine to simulate, extend, and expand human intelligence, sense the environment, acquire knowledge, and use knowledge to obtain optimal results.

Natural language processing (natural language processing, NLP): NLP is a branch of artificial intelligence that is a interdisciplinary of computer science and linguistics, and is often referred to as computational linguistics, and is processed, understood, and applied to human languages (e.g., chinese, english, etc.). Natural language processing includes parsing, semantic analysis, chapter understanding, and the like. Natural language processing is commonly used in the technical fields of machine translation, handwriting and print character recognition, voice recognition and text-to-speech conversion, information intent recognition, information extraction and filtering, text classification and clustering, public opinion analysis and opinion mining, and the like, and relates to data mining, machine learning, knowledge acquisition, knowledge engineering, artificial intelligence research, linguistic research related to language calculation, and the like.

Information extraction (Information Extraction, NER): extracting the fact information of the appointed type of entity, relation, event and the like from the natural language text, and forming the text processing technology of the structured data output. Information extraction is a technique for extracting specific information from text data. Text data is made up of specific units, such as sentences, paragraphs, chapters, and text information is made up of specific units, such as words, phrases, sentences, paragraphs, or a combination of specific units. The noun phrase, the name of a person, the name of a place, etc. in the extracted text data are all text information extraction, and of course, the information extracted by the text information extraction technology can be various types of information.

Electron computer tomography (Computed Tomography, CT): the X-ray detector is used for scanning a cross section around a certain part of a human body one by utilizing an X-ray beam, gamma rays, ultrasonic waves and the like which are accurately collimated and a detector with extremely high sensitivity, has the characteristics of quick scanning time, clear images and the like, and can be used for checking various diseases; the rays used can be classified differently according to the type: x-ray CT (X-CT), gamma-ray CT (gamma-CT), and the like.

Object Detection (Object Detection): the task of object detection is to find all objects (objects) of interest in an image, and to determine their category and location, which is one of the core problems in the field of computer vision. The core problems of object detection include four classes, namely (1) classification problems: i.e. to which category the image in the picture (or a certain region) belongs. (2) positioning problem: the target may appear anywhere in the image. (3) size problem: the targets are of various sizes. (4) shape problem: the possible different shape of the target is detected in two main families: RCNN series, which is a representative algorithm based on region detection, and YOLO series, which is a representative algorithm based on region extraction.

Softmax function: the Softmax function is a normalized exponential function.

Based on the above, the embodiment of the application provides a target detection method, a target detection device, electronic equipment and a storage medium, aiming at improving the accuracy of target detection.

The method and apparatus for detecting an object, the electronic device, and the storage medium provided in the embodiments of the present application are specifically described by the following embodiments, and the method for detecting an object in the embodiments of the present application is described first.

The embodiment of the application can acquire and process the related data based on the artificial intelligence technology. Among these, artificial intelligence (Artificial Intelligence, AI) is the theory, method, technique and application system that uses a digital computer or a digital computer-controlled machine to simulate, extend and extend human intelligence, sense the environment, acquire knowledge and use knowledge to obtain optimal results.

Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a robot technology, a biological recognition technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.

The embodiment of the application provides a target detection method, and relates to the technical field of digital medical treatment. The target detection method provided by the embodiment of the application can be applied to the terminal, can be applied to the server side, and can also be software running in the terminal or the server side. In some embodiments, the terminal may be a smart phone, tablet, notebook, desktop, etc.; the server side can be configured as an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, and a cloud server for providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDNs, basic cloud computing services such as big data and artificial intelligent platforms and the like; the software may be an application or the like that realizes the target detection method, but is not limited to the above form.

Medical cloud (Medical cloud) refers to the fact that a Medical health service cloud platform is created by combining Medical technology on the basis of new technologies such as cloud computing, mobile technology, multimedia, 4G communication, big data, internet of things and the like, and Medical resources are shared and Medical scope is enlarged. Because the cloud computing technology is applied to combination, the medical cloud improves the efficiency of medical institutions, and residents can conveniently seek medical advice. Like reservation registration, electronic medical records, medical insurance and the like of the traditional hospital are products of combination of cloud computing and medical field, and the medical cloud also has the advantages of data security, information sharing, dynamic expansion and overall layout.

The application is operational with numerous general purpose or special purpose computer system environments or configurations. For example: personal computers, server computers, hand-held or portable devices, tablet devices, multiprocessor systems, microprocessor-based systems, set top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, distributed computing environments that include any of the above systems or devices, and the like. The application may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The application may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.

Fig. 1 is an optional flowchart of a target detection method according to an embodiment of the present application, where the method in fig. 1 may include, but is not limited to, steps S101 to S108.

Step S101, obtaining a target image;

step S102, dividing the target image to obtain at least two partial images;

step S103, mapping all the partial images to a preset vector space to obtain an image embedded vector of each partial image;

step S104, extracting features of the image embedded vector to obtain target image features;

step S105, feature sampling is carried out on the target image features to obtain first image features, second image features and third image features, wherein the feature resolutions of the first image features, the second image features, the target image features and the third image features are sequentially increased;

step S106, carrying out feature fusion on the first image feature, the second image feature, the target image feature and the third image feature to obtain a fused image feature;

step S107, detecting the object position of the target image based on the fused image features to obtain the position data of the target object, wherein the position data is used for representing the position of the target object in the target image;

And step S108, object category detection is carried out on the target image based on the fusion image characteristics, and the object category of the target object is obtained.

Step S101 to step S108 illustrated in the embodiment of the present application, by acquiring a target image; dividing the target image to obtain at least two partial images; mapping all the partial images to a preset vector space to obtain an image embedding vector of each partial image, so that a plurality of image embedding vectors containing position information and partial image information can be obtained, and the feature quality of the image embedding vectors can be improved. Further, extracting features of the image embedded vector to obtain target image features; feature sampling is carried out on the target image features to obtain first image features, second image features and third image features, wherein feature resolutions of the first image features, the second image features, the target image features and the third image features are sequentially increased; and carrying out feature fusion on the first image feature, the second image feature, the target image feature and the third image feature to obtain a fused image feature, wherein the mode can realize multi-scale sampling of the target image feature to obtain image features under different scales, and the feature richness of the image features is improved. Finally, detecting the object position of the target image based on the fusion image characteristics to obtain position data of the target object, wherein the position data are used for representing the position of the target object in the target image; and detecting the object type of the target image based on the fused image characteristics to obtain the object type of the target object, so that the detection accuracy of the position and the type of the target object can be improved.

In step S101 of some embodiments, the target image may be obtained through shooting by a camera, a video camera, or other shooting devices, or may be directly extracted from a preset image database, where the target image may be a three-dimensional image or a two-dimensional image, and is not limited.

In some embodiments, the target image may be obtained by computed tomography (Computed Tomography, CT), and in another embodiment, the three-dimensional image may also be obtained by magnetic resonance imaging (Magnetic Resonance Imaging, MRI).

It should be noted that, the target image may be an image under any scene, for example, the target image may be a pet image in a coco data set (such as a cat, a dog, etc.), a vehicle image (such as an airplane, a vehicle, etc.); the target image may also be a wide variety of vehicles in the intelligent driving dataset, and the like, without limitation.

In a medical application scene, the target image is a medical image, and the type of the object contained in the target image is a focus, namely a part on the organism where a lesion occurs. Medical images refer to images of internal tissues taken in a non-invasive manner for medical or medical research, e.g., stomach, abdomen, heart, knee, brain, such as CT (Computed Tomography, electronic computed tomography), MRI (Magnetic Resonance Imaging ), US (ultra sonic), X-ray images, electroencephalograms, and images generated by medical instruments by optical photography lamps.

In step S102 of some embodiments, when the target image is segmented, the target image may be segmented in equal area according to the size of the target image, so as to obtain at least two partial images with the same size. For example, according to the image size of the target image, the target image is divided into a plurality of pixel blocks with the size of 16×16, and each pixel block is used as a local image, where when the image of the target image is divided, an image processing tool such as Opencv or a residual network may be used to perform the dividing process on the target image, without limitation. For example, the image size of the target image is 224×224, the target image is divided into 16×16 partial images, and the image size of each partial image is 16×16×3.

Referring to fig. 2, in some embodiments, step S103 may include, but is not limited to, steps S201 to S204:

step S201, performing position coding processing on the local image to obtain a position coding feature corresponding to the local image;

step S202, flattening the partial image to obtain a preliminary embedding vector;

step S203, carrying out feature addition on the preliminary embedded vector and the position coding feature to obtain an intermediate embedded vector;

In step S204, the intermediate embedded vector is mapped to a vector space to obtain an image embedded vector.

In step S201 of some embodiments, when the partial image is position-coded, the partial image may be absolute position-coded or relative position-coded. Specifically, when the partial image is absolute-coded, an absolute position code of each pixel feature of the partial image is generated by a sine-cosine function, and each pixel feature of the partial image is position-marked according to the absolute position code, and the absolute position code is used as a position tag of the pixel feature. When the partial image is relatively coded, distance values between every two pixel features of the partial image are calculated respectively, the distance values can be Euclidean distance or Manhattan distance and the like, each pixel feature is numbered according to the magnitude relation of the distance values, and the relation numbers can be used for representing the feature sequence of the pixel features. By the method, the position sequence of each pixel characteristic in the partial image can be conveniently determined, and the position sequence can be used as a basis for position adjustment of the preliminary anchor frame generated in the subsequent target detection process, so that the subsequently generated position data is more accurate.

In step S202 of some embodiments, a partial image may be flattened by using a preset flattening layer or the like, and partial image information in the partial image may be extracted to obtain a one-dimensional preliminary embedding vector.

In step S203 of some embodiments, when the feature addition is performed on the preliminary embedded vector and the position-coding feature, the vector addition may be directly performed on the preliminary embedded vector and the position-coding feature, to obtain an intermediate embedded vector. For example, the preliminary embedding vector is E1, the position-coding feature is E2, and the intermediate embedding vector is e1+e2.

In step S204 of some embodiments, the intermediate embedded vector may be mapped to a vector space based on the multi-layer perceptron, so as to obtain an image embedded vector, where the vector space is a preset high-dimensional vector, and the process is to map the intermediate embedded vector from a low-dimensional space to a high-dimensional space by using the multi-layer perceptron, so as to obtain an image embedded vector with higher feature quality.

In some other embodiments, a convolution layer with a convolution scale of 16×16 and a step size of 16 may be used to convolve the intermediate embedded vector, and map the intermediate embedded vector to a vector space to obtain the image embedded vector.

Through the steps S201 to S204, the local image can be subjected to position coding and feature extraction to obtain a plurality of image embedded vectors containing position information and local image information, so that feature quality of the image embedded vectors can be improved, global attention calculation can be performed based on the image embedded vectors, global image information of a target image can be obtained, and target detection can be performed based on important content of the obtained global image information, which is beneficial to improving accuracy of target detection.

Referring to fig. 3, in some embodiments, step S104 may include, but is not limited to, steps S301 to S306:

step S301, inputting an image embedding vector into a preset feature extraction network, wherein the feature extraction network comprises a first standardization layer, an attention layer, a second standardization layer and an MLP layer;

step S302, splicing the image embedded vectors to obtain a fusion embedded vector;

step S303, performing first normalization processing on the fusion embedded vector based on a first normalization layer to obtain a preliminary image coding feature;

step S304, performing attention calculation on the preliminary image coding features based on the attention layer to obtain intermediate image coding features;

Step S305, performing second normalization processing on the intermediate image coding feature based on the second normalization layer to obtain a target image coding feature;

and step S306, performing feature mapping on the target image coding features based on the MLP layer to obtain target image features.

In step S301 of some embodiments, the image embedding vector may be input to a preset feature extraction network based on a preset computer program, wherein the feature extraction network includes a first normalization layer, an attention layer, a second normalization layer, and an MLP layer.

In step S302 of some embodiments, according to the position coding feature of each partial image, vector stitching is performed on the image embedding vectors in sequence, so as to obtain a fused embedding vector.

In step S303 of some embodiments, when the fusion embedded vector is subjected to the first normalization processing based on the first normalization layer, the fusion embedded vector is normalized, so that the mean and the variance of the fusion embedded vector meet a preset normalization condition, thereby obtaining the preliminary image coding feature, where the preset normalization condition may be that the mean of the fusion embedded vector is 0 and the variance is 1.

In step S304 of some embodiments, when performing attention computation on the preliminary image coding feature based on the attention layer, affine transformation is performed on the preliminary image coding feature to obtain an image key vector, an image value vector and an image query vector, and then attention computation is performed based on the image key vector, the image value vector and the image query vector to obtain the key image coding feature. And finally, splicing the image embedded vector and the key image coding feature to obtain the intermediate image coding feature.

In step S305 of some embodiments, when the second normalization processing is performed on the intermediate image coding feature based on the second normalization layer, the activation processing may be performed on the intermediate image coding feature, so as to implement normalization of the intermediate image coding feature. Specifically, the intermediate image coding feature is moved to an action area of an activation function of the second normalization layer, and then the intermediate image coding feature is subjected to normalization processing through the activation function, so that the target image coding feature is obtained, wherein the activation function can be a Relu function and the like.

In step S306 of some embodiments, feature mapping is performed on the target image coding features based on the MLP layer, the target image coding features are mapped to a low-dimensional vector space, more image feature information is obtained from the target image coding features, and then vector addition or vector connection is performed on the mapped target image coding features and key image coding features, so as to obtain target image features. Wherein the feature stride (feature_stride) of the target image feature is 16.

Through the steps S301 to S306, the global attention operation on the target image can be conveniently realized, the image feature information of the target image can be conveniently extracted, and the extracted image feature information is subjected to importance analysis to obtain the target image feature containing the important image feature information, so that the target image feature can be used in the subsequent target detection process, and the accuracy of target detection is improved.

Referring to fig. 4, in some embodiments, step S304 may include, but is not limited to, steps S401 to S403:

step S401, affine transformation is carried out on the primary image coding features based on the attention layer, and an image key vector, an image value vector and an image query vector are obtained;

step S402, performing attention calculation based on the image key vector, the image value vector and the image query vector to obtain key image coding features;

and S403, splicing the image embedded vector and the key image coding feature to obtain an intermediate image coding feature.

In step S401 of some embodiments, when affine transformation is performed on the preliminary image coding feature M based on the attention layer, a preset affine parameter W is acquired _K ，W _Q ,W _V Affine transformation is carried out on the preliminary image coding features based on affine parameters to obtain an image key vector, an image value vector and an image query vector, wherein the image key vector can be expressed as K=M×W _K Image ofThe value vector may be expressed as v=m×w _V The image query direction may be expressed as q=m×w _Q 。

In step S402 of some embodiments, attention computation is performed on the image key vector, the image value vector, and the image query vector by the softmax function to obtain a key image coding feature Z, and the attention computation process may be expressed as shown in formula (1):

Wherein d is the feature dimension, K, of the preliminary image coding feature M ^T Is the result of the transpose operation on the image key vector.

The process of attention calculation can strengthen the mapping of important feature information in the primary image coding features, reduce the mapping of secondary feature information in the primary image coding features and improve the accuracy of image content of key image coding features.

In step S403 of some embodiments, when the image embedded vector and the key image coding feature are spliced, vector addition or vector connection may be performed on the image embedded vector and the key image coding feature to obtain an intermediate image coding feature.

Further, in order to reduce gradient loss between features, when splicing the image embedding vector and the key image coding features, the image embedding vector and the key image coding features can be spliced in a residual connection mode, so that accuracy of feature splicing is improved, and influence of feature information loss caused by gradient loss on a detection process is reduced.

Through the steps S401 to S403, the global attention operation on the target image can be conveniently realized, so that comprehensive and rich image content information can be extracted.

Referring to fig. 5, in some embodiments, step S105 may include, but is not limited to, steps S501 to S503:

step S501, performing first downsampling processing on target image features to obtain first image features;

step S502, performing second downsampling processing on the target image features to obtain second image features;

step S503, performing upsampling processing on the target image feature to obtain a third image feature.

In step S501 of some embodiments, a first downsampling process may be performed on the target image feature by using a pooling operation or the like, to reduce the feature dimension of the target image feature, and convert the target image feature with a feature stride of 16 into a first image feature with a feature stride of 4.

In step S502 of some embodiments, a first downsampling process may be performed on the target image feature by using a pooling operation or the like, to reduce the feature dimension of the target image feature, and convert the target image feature with a feature stride of 16 into a second image feature with a feature stride of 8.

In step S503 of some embodiments, the target image feature may be up-sampled by deconvolution, anti-pooling, convolution transpose, or interpolation, or the like, and feature-amplified, and the target image feature with a feature stride of 16 is converted into a third image feature with a feature stride of 32.

It should be noted that, through the above multi-scale sampling operation, the feature resolutions of the first image feature, the second image feature, the target image feature, and the third image feature are sequentially increased.

Because the multi-scale feature sampling is to change the dimensions of features with different dimensions based on the target image features, the process of extracting and converting hierarchical features into non-hierarchical features is abandoned, and the image features with different dimensions can be extracted from images with different resolutions more conveniently.

It should be noted that, the object detection system constructed based on the object detection method of the embodiment of the present application includes an image preprocessing network, a feature extraction network, a feature sampling network and an object detection network, where the image preprocessing network is used to implement the image processing procedures from step S101 to step S103, the feature extraction network is used to implement the feature extraction procedure from step S104, and the feature sampling network is used to implement the multi-scale sampling procedure from step S105, where the feature sampling network is a non-hierarchical network; the object detection network is used to implement the object detection process of steps S106 to S108 described above. The target detection method in the embodiment of the application not only can exert the advantage of global attention sampling, but also can lead the process of target detection to be simpler and more practical and have mobility, so that the target detection system can be used for more application scenes, namely, the target detection method in the embodiment of the application can lead the target detection system to be constructed based on a non-layered visual transducer model, compared with the target detection system in the related art, the design of the target detection system is simpler, and the optimal detection effect can be achieved by carrying out minimum modification on the model framework under the condition of keeping the basic model framework.

Through the steps S501 to S503, the multi-scale sampling of the target image features can be realized, the image features under different scales are obtained, the feature richness of the image features is improved, and the target detection based on more image feature information in the subsequent target detection process is facilitated, so that the accuracy of the target detection is improved.

In step S106 of some embodiments, when the first image feature, the second image feature, the target image feature, and the third image feature are fused, vector addition may be performed on the first image feature, the second image feature, the target image feature, and the third image feature to obtain a fused image feature, or vector stitching may be performed on the first image feature, the second image feature, the target image feature, and the third image feature to obtain a fused image feature, where the fused image feature includes image information of the target image at different scales, so that the image content of the target image can be comprehensively reflected, and when the target detection is performed based on the fused image feature, the accuracy of the target detection can be effectively improved.

Referring to fig. 6, in some embodiments, step S107 includes, but is not limited to, steps S601 to S604:

Step S601, acquiring a plurality of initial anchor frames and initial position data of each initial anchor frame based on the fused image characteristics;

step S602, calculating offset of the initial anchor frame to obtain offset data of the initial anchor frame;

step S603, performing position adjustment on the initial anchor frame based on the offset data to obtain intermediate anchor frames and intermediate position data of each intermediate anchor frame;

step S604, screening the intermediate position data to obtain the position data of the target object.

In step S601 of some embodiments, a tensor size of an initial anchor frame is set according to a feature size of a fused image feature, and the fused image feature is divided into a plurality of grid areas based on the initial anchor frame of a fixed tensor size, each grid area is used as an initial anchor frame, wherein the initial anchor frame is rectangular, and a position element of each initial anchor frame is acquired according to a preset grid coordinate system, wherein the position element of the initial anchor frame comprises (x 1, y1, w1, h 1), wherein (x 1, y 1) is coordinates of a center point of the initial anchor frame determined according to the preset grid coordinate system, w1 is a width of the initial anchor frame, h1 is a height of the initial anchor frame, and the four position elements form initial position data of each initial anchor frame.

In step S602 of some embodiments, linear regression is performed on each initial anchor frame by using YOLO algorithm to predict whether each initial anchor frame has a target object and the offset condition of the initial anchor frame, so as to obtain offset data and initial confidence coefficient c1 of each initial anchor frame, where c1 is the probability that the initial anchor frame contains the target object.

In step S603 of some embodiments, the initial anchor frame is subjected to position adjustment based on the offset data, where the position adjustment includes performing position adjustment on a center coordinate of the initial anchor frame according to the offset data, and adjusting a width and a height of the initial anchor frame according to the offset data to obtain intermediate anchor frames and intermediate position data of each intermediate anchor frame, where the intermediate position data of each intermediate anchor frame includes (x 2, y2, w2, h2, c 2), where (x 2, y 2) is coordinates of a center point of the intermediate anchor frame determined according to a preset grid coordinate system, w2 is a width of the intermediate anchor frame, h2 is a height of the intermediate anchor frame, and c2 is an intermediate confidence, and is used to represent a probability that each intermediate anchor frame contains the target object.

In step S604 of some embodiments, according to a preset threshold, an intermediate anchor frame with an intermediate confidence level exceeding the preset threshold is taken as a target anchor frame containing the target object, and intermediate position data of the target anchor frame is taken as position data of the target object, where the position data is used to represent the position of the target object in the target image.

The steps S601 to S604 can detect the target object in the target image more conveniently, and determine the position of the target object in the target image, so that the accuracy of detecting the position of the target object can be improved.

Referring to fig. 7, in some embodiments, step S108 may include, but is not limited to, steps S701 to S702:

step S701, performing class scoring on a target object of a target image based on a preset function and fusion image features to obtain target scoring data, wherein the target scoring data is used for representing the probability that the object class of the target object belongs to each preset candidate object class;

step S702, screening object classes from candidate object classes based on the target scoring data.

In step S701 of some embodiments, the preset function may be a softmax function or the like, without limitation. Taking a softmax function as an example, creating probability distribution of a target object on each candidate object category based on a softmax classifier and fusion image features, scoring the category of the target object is realized, and taking the probability distribution vector of each candidate object category as target scoring data of the target object on the candidate object category, namely, the target scoring data is used for representing the probability that the object category of the target object belongs to each preset candidate object category.

In step S702 of some embodiments, since the size of the target score data may clearly reflect the degree of correlation between the target object and each candidate object class, that is, the greater the target score data corresponding to the candidate object class, the higher the likelihood that the target object belongs to the candidate object class. Therefore, according to the size of the target scoring data, the candidate object class with the largest target scoring data is selected as the object class of the target object, wherein the candidate object class comprises a multi-level class label, for example, a first-level class label in the candidate object class comprises a person, an animal, a plant, a vehicle and the like, and a second-level class label comprises a cat, a dog, a fish in the animal class, an automobile, a bicycle and the like in the vehicle class, and the like, without being limited thereto.

In the field of digital medicine, candidate classes include various organs and tissues, such as ear, nose, throat, intestines, skin, and the like.

Through the steps S701 to S702, the type of the target object in the target image can be identified according to the feature information contained in the fused image feature, so that the object type of the target object in the target image is determined, and the accuracy of detecting the object type of the target object is improved.

The target detection method of the embodiment of the application comprises the steps of obtaining a target image; dividing the target image to obtain at least two partial images; mapping all the partial images to a preset vector space to obtain an image embedding vector of each partial image, so that a plurality of image embedding vectors containing position information and partial image information can be obtained, and the feature quality of the image embedding vectors can be improved. Further, feature extraction is performed on the image embedded vector to obtain target image features, global attention operation on the target image can be conveniently achieved, image feature information of the target image can be conveniently extracted, and importance analysis is performed on the extracted image feature information to obtain target image features containing important image feature information. Performing feature sampling on the target image features in a non-layering mode to obtain first image features, second image features and third image features, wherein feature resolutions of the first image features, the second image features, the target image features and the third image features are sequentially increased; and carrying out feature fusion on the first image feature, the second image feature, the target image feature and the third image feature to obtain a fused image feature, wherein the mode can realize multi-scale sampling of the target image feature to obtain image features under different scales, and the feature richness of the image features is improved. Finally, detecting the object position of the target image based on the fusion image characteristics to obtain position data of the target object, wherein the position data are used for representing the position of the target object in the target image; object type detection is carried out on the target image based on the fusion image characteristics to obtain the object type of the target object, and the detection accuracy of the position and the type of the target object can be improved, so that the focus area and the focus type in the medical image can be effectively detected, the disease information corresponding to the medical image can be identified in a targeted manner, and the auxiliary identification of the disease can be improved.

Referring to fig. 8, an embodiment of the present application further provides an object detection apparatus, which may implement the above object detection method, where the apparatus includes:

an image acquisition module 801 for acquiring a target image;

the image segmentation module 802 is configured to perform segmentation processing on the target image to obtain at least two partial images;

the image mapping module 803 is configured to map all the partial images to a preset vector space, so as to obtain an image embedded vector of each partial image;

the feature extraction module 804 is configured to perform feature extraction on the image embedded vector to obtain a target image feature;

the feature sampling module 805 is configured to perform feature sampling on the target image feature to obtain a first image feature, a second image feature, and a third image feature, where feature resolutions of the first image feature, the second image feature, the target image feature, and the third image feature are sequentially increased;

the feature fusion module 806 is configured to perform feature fusion on the first image feature, the second image feature, the target image feature, and the third image feature to obtain a fused image feature;

a position detection module 807, configured to perform object position detection on the target image based on the fused image feature, to obtain position data of the target object, where the position data is used to characterize a position of the target object in the target image;

The class detection module 808 is configured to perform object class detection on the target image based on the fused image features, so as to obtain an object class of the target object.

The specific implementation of the target detection apparatus is substantially the same as the specific embodiment of the target detection method described above, and will not be described herein.

The embodiment of the application also provides electronic equipment, which comprises: the object detection device comprises a memory, a processor, a program stored on the memory and capable of running on the processor, and a data bus for realizing connection communication between the processor and the memory, wherein the program is executed by the processor to realize the object detection method. The electronic equipment can be any intelligent terminal including a tablet personal computer, a vehicle-mounted computer and the like.

Referring to fig. 9, fig. 9 illustrates a hardware structure of an electronic device according to another embodiment, the electronic device includes:

the processor 901 may be implemented by a general purpose CPU (central processing unit), a microprocessor, an application specific integrated circuit (ApplicationSpecificIntegratedCircuit, ASIC), or one or more integrated circuits, etc. for executing related programs to implement the technical solution provided by the embodiments of the present application;

The memory 902 may be implemented in the form of read-only memory (ReadOnlyMemory, ROM), static storage, dynamic storage, or random access memory (RandomAccessMemory, RAM). The memory 902 may store an operating system and other application programs, and when the technical solutions provided in the embodiments of the present disclosure are implemented by software or firmware, relevant program codes are stored in the memory 902, and the processor 901 invokes an object detection method for executing the embodiments of the present disclosure;

an input/output interface 903 for inputting and outputting information;

the communication interface 904 is configured to implement communication interaction between the device and other devices, and may implement communication in a wired manner (e.g. USB, network cable, etc.), or may implement communication in a wireless manner (e.g. mobile network, WIFI, bluetooth, etc.);

a bus 905 that transfers information between the various components of the device (e.g., the processor 901, the memory 902, the input/output interface 903, and the communication interface 904);

wherein the processor 901, the memory 902, the input/output interface 903 and the communication interface 904 are communicatively coupled to each other within the device via a bus 905.

The embodiment of the application also provides a computer readable storage medium, wherein the computer readable storage medium stores one or more programs, and the one or more programs can be executed by one or more processors to realize the target detection method.

The memory, as a non-transitory computer readable storage medium, may be used to store non-transitory software programs as well as non-transitory computer executable programs. In addition, the memory may include high-speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, the memory optionally includes memory remotely located relative to the processor, the remote memory being connectable to the processor through a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The embodiment of the application provides a target detection method, a target detection device, electronic equipment and a computer readable storage medium, which are used for acquiring a target image; dividing the target image to obtain at least two partial images; mapping all the partial images to a preset vector space to obtain an image embedding vector of each partial image, so that a plurality of image embedding vectors containing position information and partial image information can be obtained, and the feature quality of the image embedding vectors can be improved. Further, feature extraction is performed on the image embedded vector to obtain target image features, global attention operation on the target image can be conveniently achieved, image feature information of the target image can be conveniently extracted, and importance analysis is performed on the extracted image feature information to obtain target image features containing important image feature information. Performing feature sampling on the target image features in a non-layering mode to obtain first image features, second image features and third image features, wherein feature resolutions of the first image features, the second image features, the target image features and the third image features are sequentially increased; and carrying out feature fusion on the first image feature, the second image feature, the target image feature and the third image feature to obtain a fused image feature, wherein the mode can realize multi-scale sampling of the target image feature to obtain image features under different scales, and the feature richness of the image features is improved. Finally, detecting the object position of the target image based on the fusion image characteristics to obtain position data of the target object, wherein the position data are used for representing the position of the target object in the target image; object type detection is carried out on the target image based on the fusion image characteristics to obtain the object type of the target object, and the detection accuracy of the position and the type of the target object can be improved, so that the focus area and the focus type in the medical image can be effectively detected, the disease information corresponding to the medical image can be identified in a targeted manner, and the auxiliary identification of the disease can be improved.

The embodiments described in the embodiments of the present application are for more clearly describing the technical solutions of the embodiments of the present application, and do not constitute a limitation on the technical solutions provided by the embodiments of the present application, and those skilled in the art can know that, with the evolution of technology and the appearance of new application scenarios, the technical solutions provided by the embodiments of the present application are equally applicable to similar technical problems.

It will be appreciated by those skilled in the art that the solutions shown in fig. 1-7 are not limiting on the embodiments of the application and may include more or fewer steps than shown, or certain steps may be combined, or different steps.

The above described apparatus embodiments are merely illustrative, wherein the units illustrated as separate components may or may not be physically separate, i.e. may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

Those of ordinary skill in the art will appreciate that all or some of the steps of the methods, systems, functional modules/units in the devices disclosed above may be implemented as software, firmware, hardware, and suitable combinations thereof.

The terms "first," "second," "third," "fourth," and the like in the description of the application and in the above figures, if any, are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the application described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

It should be understood that in the present application, "at least one (item)" means one or more, and "a plurality" means two or more. "and/or" for describing the association relationship of the association object, the representation may have three relationships, for example, "a and/or B" may represent: only a, only B and both a and B are present, wherein a, B may be singular or plural. The character "/" generally indicates that the context-dependent object is an "or" relationship. "at least one of" or the like means any combination of these items, including any combination of single item(s) or plural items(s). For example, at least one (one) of a, b or c may represent: a, b, c, "a and b", "a and c", "b and c", or "a and b and c", wherein a, b, c may be single or plural.

In the several embodiments provided by the present application, it should be understood that the disclosed apparatus and method may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, the above-described division of units is merely a logical function division, and there may be another division manner in actual implementation, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.

The units described above as separate components may or may not be physically separate, and components shown as units may or may not be physical units, may be located in one place, or may be distributed over a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in the embodiments of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be embodied in essence or a part contributing to the prior art or all or part of the technical solution in the form of a software product stored in a storage medium, including multiple instructions to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the method of the various embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing a program.

The preferred embodiments of the present application have been described above with reference to the accompanying drawings, and are not thereby limiting the scope of the claims of the embodiments of the present application. Any modifications, equivalent substitutions and improvements made by those skilled in the art without departing from the scope and spirit of the embodiments of the present application shall fall within the scope of the claims of the embodiments of the present application.

Claims

1. A method of target detection, the method comprising:

acquiring a target image;

dividing the target image to obtain at least two partial images;

2. The method according to claim 1, wherein mapping all the partial images to a preset vector space to obtain an image embedded vector of each partial image, comprises:

flattening the partial image to obtain a preliminary embedding vector;

3. The method for detecting an object according to claim 1, wherein the feature extraction of the image embedding vector to obtain the object image feature comprises:

splicing the image embedded vectors to obtain fusion embedded vectors;

4. The method according to claim 3, wherein the performing attention calculation on the preliminary image coding feature based on the attention layer to obtain an intermediate image coding feature includes:

5. The method for detecting an object according to claim 1, wherein the feature sampling the object image feature to obtain a first image feature, a second image feature, and a third image feature includes:

6. The target detection method according to claim 1, wherein the performing object position detection on the target image based on the fused image features to obtain position data of a target object includes:

7. The object detection method according to any one of claims 1 to 6, wherein the object class detection of the object image based on the fused image features, to obtain an object class of the object, includes:

8. An object detection device, the device comprising:

the image acquisition module is used for acquiring a target image;

9. An electronic device comprising a memory storing a computer program and a processor that when executing the computer program implements the object detection method according to any one of claims 1 to 7.

10. A computer-readable storage medium storing a computer program, characterized in that the computer program, when executed by a processor, implements the object detection method of any one of claims 1 to 7.