CN115082430B - Image analysis method and device and electronic equipment - Google Patents

Image analysis method and device and electronic equipment Download PDF

Info

Publication number
CN115082430B
CN115082430B CN202210851146.1A CN202210851146A CN115082430B CN 115082430 B CN115082430 B CN 115082430B CN 202210851146 A CN202210851146 A CN 202210851146A CN 115082430 B CN115082430 B CN 115082430B
Authority
CN
China
Prior art keywords
image
layer
visual
sequence
analyzed
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210851146.1A
Other languages
Chinese (zh)
Other versions
CN115082430A (en
Inventor
朱优松
陈志扬
赵朝阳
李朝闻
王金桥
唐明
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Automation of Chinese Academy of Science
Original Assignee
Institute of Automation of Chinese Academy of Science
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Automation of Chinese Academy of Science filed Critical Institute of Automation of Chinese Academy of Science
Priority to CN202210851146.1A priority Critical patent/CN115082430B/en
Publication of CN115082430A publication Critical patent/CN115082430A/en
Application granted granted Critical
Publication of CN115082430B publication Critical patent/CN115082430B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/0002Inspection of images, e.g. flaw detection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T9/00Image coding
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/30Subject of image; Context of image processing
    • G06T2207/30196Human being; Person

Abstract

The invention provides an image analysis method, an image analysis device and electronic equipment, and relates to the technical field of computer vision, wherein the image analysis method comprises the following steps: acquiring an image to be analyzed of a target vision task; inputting the image to be analyzed into a visual analysis model, and obtaining an attribute sequence of each target object characteristic aiming at a target visual task in the image to be analyzed output by the visual analysis model; the visual analysis model is used for extracting the target object characteristics in the image characteristics of the image to be analyzed based on the target visual task and generating an attribute sequence for describing the target object characteristics; the visual analysis model is obtained by training based on a sample image corresponding to the target visual task and label data corresponding to the sample image. The technical scheme provided by the invention can unify different visual tasks into the problem of sequence description of the object in the image to be analyzed, and the visual analysis model has universality on various visual tasks, thereby improving the processing efficiency of the visual tasks and reducing the development cost.

Description

Image analysis method and device and electronic equipment
Technical Field
The invention relates to the technical field of computer vision, in particular to an image analysis method and device and electronic equipment.
Background
Computer vision is the science of using computers to simulate human visual systems, enabling computers to have capabilities similar to human extraction, processing, understanding and analysis of images and image sequences, playing an increasingly important role in the fields of security, manufacturing, government affairs, medical treatment and the like.
In computer vision tasks, unlike the features that natural language can be modeled as sequence-to-sequence, definition of vision tasks is usually very different, and different vision tasks need to design different model structures to be processed respectively. Although a unified visual analysis model is proposed to process a plurality of visual tasks, such as multi-task learning, the method mainly trains a dedicated model structure of a plurality of tasks and a shared backbone model network together, only limited and fixed task combinations can be processed, once a new task appears, a new model structure needs to be additionally designed and retrained, the task processing efficiency is low, and the development cost is high.
Disclosure of Invention
The invention provides an image analysis method, an image analysis device and electronic equipment, which are used for solving the defects of low processing efficiency and high development cost caused by the fact that different visual tasks need to be processed by using models with different structures in the prior art and realizing unified serialized representation of output results of various visual tasks.
The invention provides an image analysis method, which comprises the following steps:
acquiring an image to be analyzed of a target visual task;
inputting the image to be analyzed into a visual analysis model, and obtaining an attribute sequence of each target object characteristic aiming at the target visual task in the image to be analyzed output by the visual analysis model;
the visual analysis model is used for extracting target object features in image features of the image to be analyzed based on the target visual task and generating an attribute sequence describing the target object features;
the visual analysis model is obtained by training based on a sample image corresponding to the target visual task and label data corresponding to the sample image.
According to an image analysis method provided by the present invention, the inputting the image to be analyzed into a visual analysis model, and obtaining an attribute sequence of each target object feature for a target visual task in the image to be analyzed output by the visual analysis model, includes:
inputting the image to be analyzed into an image feature coding layer of the visual analysis model to obtain the image features of the image to be analyzed output by the image feature coding layer;
inputting the image features into a self-attention decoding layer of the visual analysis model, and obtaining target object features aiming at the target visual task in the image features output by the self-attention decoding layer;
and inputting the target object characteristics into a sequence generation layer of the visual analysis model, and obtaining an attribute sequence of the target object characteristics output by the sequence generation layer.
According to the image analysis method provided by the invention, the sequence generation layer outputs the attribute sequence of the target object characteristic according to a time sequence; the sequence generation layer comprises a sequence self-attention layer, an image mutual attention layer and a linear layer;
the sequence self-attention layer is used for carrying out self-attention calculation by taking the input features at the current moment as query, the input features at the current moment and all the previous input features as key values;
the image mutual attention layer is used for performing mutual attention calculation on the output of the sequence self-attention layer by taking the image characteristics as key values;
and the linear layer is used for carrying out numerical processing on the output of the image mutual attention layer to obtain the attribute sequence.
According to an image analysis method provided by the present invention, the inputting the target object feature into a sequence generation layer of the visual analysis model, and obtaining a sequence of attributes of the target object feature output by the sequence generation layer, includes:
inputting the output characteristic of the image mutual attention layer at the previous moment as the input characteristic of the current moment into the sequence self-attention layer to obtain the self-attention value of the sequence self-attention layer at the current moment, wherein the input characteristic of the sequence self-attention layer at the initial moment is the characteristic of the target object;
inputting the self-attention value of the current moment into the image mutual attention layer to obtain the output characteristic of the current moment output by the image mutual attention layer;
and inputting the output characteristics of the current moment into the linear layer to obtain attribute elements of the current moment output by the linear layer, wherein the attribute elements form the attribute sequence.
According to an image analysis method provided by the present invention, the inputting the image to be analyzed into an image feature coding layer of the visual analysis model to obtain the image feature of the image to be analyzed output by the image feature coding layer includes:
inputting the image to be analyzed into a residual error network layer of an image feature coding layer of the visual analysis model to obtain an initial image feature of the image to be analyzed, which is output by the residual error network layer, wherein the residual error network layer is used for mapping the image to be analyzed to an image feature space;
inputting the initial image features into a self-attention coding layer of the image feature coding layer, and obtaining the image features of the image to be analyzed output by the self-attention coding layer, wherein the self-attention coding layer is used for self-attention coding.
According to an image analysis method provided by the present invention, the length of the attribute sequence is determined based on the target vision task.
The present invention also provides an image analysis apparatus comprising:
the acquisition module is used for acquiring an image to be analyzed of the target vision task;
the analysis module is used for inputting the image to be analyzed into a visual analysis model and obtaining an attribute sequence of each target object characteristic aiming at the target visual task in the image to be analyzed output by the visual analysis model;
the visual analysis model is used for extracting target object features in the image features of the image to be analyzed based on the target visual task and generating an attribute sequence describing the target object features; the visual analysis model is obtained by training based on a sample image corresponding to the target visual task and label data corresponding to the sample image.
According to an image analysis apparatus provided by the present invention, the analysis module includes:
the first analysis unit is used for inputting the image to be analyzed into an image feature coding layer of the visual analysis model and obtaining the image features of the image to be analyzed output by the image feature coding layer;
a second analysis unit, configured to input the image feature into a self-attention decoding layer of the visual analysis model, and obtain a target object feature for the target visual task in the image feature output by the self-attention decoding layer;
and the third analysis unit is used for inputting the target object characteristics into the sequence generation layer of the visual analysis model and obtaining the attribute sequence of the target object characteristics output by the sequence generation layer.
The present invention also provides an electronic device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor implements the image analysis method as described in any one of the above methods when executing the computer program.
The invention also provides a non-transitory computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements an image analysis method as described in any of the above.
According to the image analysis method, the image analysis device and the electronic equipment, the image to be analyzed is obtained and is input into the visual analysis model, so that the attribute sequence of each target object characteristic aiming at the target visual task in the image to be analyzed output by the visual analysis model can be obtained. The visual analysis model is obtained based on a sample image corresponding to a target visual task and label data corresponding to the sample image, the target object characteristics in the image characteristics of the image to be analyzed can be extracted based on the target visual task, and an attribute sequence describing the characteristics of the target object is generated.
Drawings
In order to more clearly illustrate the technical solutions of the present invention or the prior art, the drawings needed for the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.
FIG. 1 is a schematic flow chart of an image analysis method provided by the present invention;
FIG. 2 is a schematic diagram of a visual analysis model provided by the present invention;
FIG. 3 is a second schematic flow chart of an image analysis method provided by the present invention;
FIG. 4 is a schematic diagram of the structure of a sequence generation layer provided by the present invention;
FIG. 5 is a third schematic flow chart of an image analysis method provided by the present invention;
FIG. 6 is a schematic diagram of the operation of the sequence generation layer provided by the present invention;
FIG. 7 is a schematic diagram of the working principle of the visual analysis model provided by the present invention;
FIG. 8 is a schematic diagram of the logic for the sequence generation layer provided by the present invention in performing object detection tasks;
FIG. 9 is a schematic diagram of the operating logic of the sequence generating layer provided by the present invention when performing a human body posture estimation task;
FIG. 10 is a schematic diagram of an image analysis apparatus provided in the present invention;
fig. 11 is a schematic physical structure diagram of an electronic device provided in the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is obvious that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
In natural language processing, many large generic language models are presented, and language tasks can often be modeled as sequence-to-sequence (Seq 2 Seq) generation problems, and a large number of different problems can be handled with the same self-attention Transformer model, such as the Transformer model, etc. A plurality of large-scale pre-training language models can be trained under a small amount of training data, and can be applied to a large amount of language tasks to meet the requirements of a large amount of voice tasks.
In view of the significant achievement in the field of natural language processing, if a similar general model structure can be designed in the visual task to process a plurality of visual tasks, the method will help to improve the processing efficiency of the visual task and reduce the development cost of the visual task processing.
However, in the field of computer vision, definition of visual tasks is often greatly different, and different from natural languages composed of sequences, the visual tasks are difficult to be uniformly represented in the same form, so that different model structures need to be designed and processed when different visual tasks are faced.
In the related art, most of unified large visual models used for visual tasks are pre-trained backbone network models, which can be used for extracting image features, but an additional unique structure still needs to be designed for each visual task. Multi-task learning is a way to train a dedicated model structure of multiple tasks and a shared backbone network model together, which can only process limited and fixed task combinations, and once a new task appears, a new model structure needs to be additionally designed and retrained.
Based on this, the embodiment of the present invention provides an image analysis method, which may obtain an image to be analyzed of a target visual task, input the image to be analyzed into a visual analysis model, perform, by the visual analysis model, extraction of a target object feature in image features on the image to be analyzed based on the target visual task, generate an attribute sequence describing the target object feature, and obtain an attribute sequence of each target object feature for the target visual task in the image to be analyzed output by the visual analysis model. The visual analysis model is obtained by training based on a sample image corresponding to the target visual task and label data corresponding to the sample image. The attribute sequence provides a general expression form of the visual task target, so that different visual tasks can be unified into a sequence description task of an object in an image to be analyzed, and the universality of the visual analysis model in various visual tasks is improved.
The image analysis method of the present invention is described below with reference to fig. 1 to 9. The image analysis method can be applied to electronic equipment such as a server, a mobile phone, a computer and the like, and can also be applied to an image analysis device arranged in the electronic equipment such as the server, the mobile phone, the computer and the like, and the image analysis device can be realized by software or combination of the software and hardware.
Fig. 1 schematically illustrates one of the flow diagrams of the image analysis method provided by the embodiment of the present invention, and referring to fig. 1, the image analysis method may include the following steps 110 to 120.
Step 110: and acquiring an image to be analyzed of the target visual task.
The target vision tasks may include, for example, target detection, human pose estimation, or image classification tasks. The image to be analyzed may be a picture or a frame image in a video. The electronic device may obtain the image to be analyzed of the target visual task in real time from an image acquisition device (such as a camera, etc.), or obtain the image to be analyzed from an image stored in a memory.
Step 120: the image to be analyzed is input into the visual analysis model, and an attribute sequence of each target object feature for the target visual task in the image to be analyzed output by the visual analysis model is obtained, wherein the visual analysis model can be used for extracting the target object feature in the image features of the image to be analyzed based on the target visual task and generating the attribute sequence describing the target object feature. The attribute sequence provides a representation of a generic visual task object, and the elements in the attribute sequence can be assigned different meanings according to different target visual tasks. Therefore, different visual tasks can be unified into a sequence description task of an object in an image to be analyzed, and the difference of the different visual tasks in the output form is unified.
On this basis, the visual analysis model can be obtained by training based on the sample image corresponding to the target visual task and the label data corresponding to the sample image, that is, for different target visual tasks, the visual analysis model for processing the target visual task can be obtained by only utilizing the sample image corresponding to the target visual task and the label data corresponding to the sample image to train the visual analysis model, without changing the structure of the visual analysis model, so that the visual analysis model can be conveniently expanded to various different target visual tasks, and has strong universality.
For example, the target vision task is target detection, for example, a pedestrian in an image is detected, a first sample image may be obtained, first label information is labeled to the pedestrian in the first sample image, then the first sample image after labeling processing is used as a training sample to perform training of a vision analysis model, and the trained vision analysis model can be used for pedestrian detection.
For another example, if the target vision task is replaced by human body posture estimation, the vision analysis model may be retrained according to the human body posture estimation task. Specifically, a second sample image can be obtained, the human skeleton points in the second sample image are labeled as second label information, then the labeled second sample image is used as a training sample for training a visual analysis model, and the trained visual analysis model can be used for human posture estimation.
Wherein the length of the sequence of attributes may be determined based on the target visual task. For example, when the target vision task is target detection, each target object needs to be represented by a category score, location coordinates (x, y), and the length w and width h of the detection box, the attribute sequence is a sequence with a length of 5. For another example, when the target visual task is human pose estimation, each object needs to be represented by a confidence level and 17 key points, and the attribute sequence is a sequence with a length of 35.
According to the image analysis method provided by the embodiment of the invention, the to-be-analyzed image of the target vision task is obtained, and the to-be-analyzed image is input into the vision analysis model, so that the attribute sequence of each target object characteristic aiming at the target vision task in the to-be-analyzed image output by the vision analysis model can be obtained. The visual analysis model is obtained based on a sample image corresponding to a target visual task and label data corresponding to the sample image through training, the target object characteristics in the image characteristics of an image to be analyzed can be extracted based on the target visual task, an attribute sequence describing the target object characteristics is generated, therefore, for any target visual task, the target object characteristics corresponding to the visual task in the image to be analyzed can be used as a basic unit to obtain an attribute sequence describing the target object characteristics, the attribute sequence provides a universal visual task target expression form, different visual tasks can be unified into a sequence description problem of an object in the image to be analyzed, the visual analysis model with the same structure can be used for realizing the processing of various different visual tasks, the universality of the visual analysis model on the various visual tasks is improved, the visual task processing efficiency is further improved, and the development cost is reduced.
Based on the image analysis method of the corresponding embodiment in fig. 1, fig. 2 schematically shows a structure of a visual analysis model, which may include, as shown in fig. 2, an image feature coding layer, a self-attention decoding layer, and a sequence generation layer, wherein the image feature coding layer may be used for image feature extraction, the self-attention decoding layer may be used for extracting object features from the image features, and the sequence generation layer may convert the object features into an attribute sequence.
Based on the structure of the visual analysis model in the embodiment corresponding to fig. 2, in an exemplary embodiment, fig. 3 exemplarily shows a second flowchart of the image analysis method provided by the present invention, and this exemplary embodiment provides an exemplary implementation of the above step 120, and as shown in fig. 3, the following steps 121 to 123 may be included.
Step 121: and inputting the image to be analyzed into an image characteristic coding layer of the visual analysis model to obtain the image characteristics of the image to be analyzed output by the image characteristic coding layer.
For example, when a target visual task is performed on a to-be-analyzed image I, the to-be-analyzed image I may be input to an image feature coding layer of a visual analysis model, and the image feature coding layer may extract image features of the to-be-analyzed image I by using a self-attention mechanism and output the image features of the to-be-analyzed image I.
For example, the image feature coding layer may include a residual network layer and a self-attention coding layer, where the residual network layer may be used to map an image to be analyzed to an image feature space, and the self-attention coding layer may be used to perform self-attention coding, that is, a self-attention mechanism may be used to code input data. Accordingly, step 121 may include: inputting an image to be analyzed into a residual error network layer of an image feature coding layer of a visual analysis model, and obtaining initial image features of the image to be analyzed output by the residual error network layer; inputting the initial image characteristics into a self-attention coding layer of the image characteristic coding layer, and obtaining the image characteristics of the image to be analyzed output from the attention coding layer.
Step 122: and inputting the image features into a self-attention decoding layer of the visual analysis model, and obtaining target object features aiming at the target visual task in the image features output from the attention decoding layer.
The self-attention decoding layer can be used for processing target object features, for example, after the image features are input into the self-attention decoding layer of the visual analysis model for target detection by a target visual task, such as target pedestrian detection, the self-attention decoding layer can enrich detail information of higher layers of the image features based on the self-attention mechanism, and extract the features of the target pedestrian from the image features.
Step 123: and inputting the target object characteristics into the sequence generation layer of the visual analysis model to obtain the attribute sequence of the target object characteristics output by the sequence generation layer.
The sequence generation layer may be configured to perform a serialization process on each target object feature to generate a series of attributes describing the target object, i.e., an attribute sequence, which may be presented in the form of a scalar number.
Specifically, after obtaining the target object features, each target object feature is fed into a sequence generation layer, which performs a serialization process on the input target object features, converts the target object features from an image representation space to a scalar numerical space, obtains an attribute sequence of each target object feature, and describes the target object in the target vision task in a scalar number.
In one example embodiment, the sequence generation layer may output the attribute sequence of the target object feature in time series. Fig. 4 exemplarily shows a structural diagram of the sequence generation layer, and as shown in fig. 4, the sequence generation layer may include a sequence self-attention layer, an image mutual attention layer, and a linear layer. The sequence self-attention layer can be used for performing self-attention calculation by taking the input features of the current time as Query, the input features of the current time and all the input features before the current time as Key-Value pairs (Key-Value); the image mutual attention layer can be used for performing mutual attention calculation on the output of the sequence self-attention layer by taking the image characteristics as Key values (keys); the linear layer may be used to perform a digitization process on the output of the image mutual attention layer to obtain an attribute sequence.
Based on the sequence generation layer in the embodiment corresponding to fig. 4, in the image analysis method provided in the embodiment of the present invention, in the process of analyzing the image to be analyzed I by using the visual analysis model to obtain the attribute sequence of each target object feature for the target visual task in the image to be analyzed output by the visual analysis model, the image feature coding layer first performs coding on the image to be analyzed from the image to be analyzedExtracting image features from image I
Figure 385424DEST_PATH_IMAGE001
Then, a series of target object characteristics are obtained by utilizing the self-attention decoding layer
Figure 952672DEST_PATH_IMAGE002
Where O represents a set of target objects output by the visual analysis model, each element in the set
Figure 109984DEST_PATH_IMAGE003
Is a target object. Next, each target object feature is fed into a sequence-generating layer, which generates a series of attributes
Figure 762682DEST_PATH_IMAGE004
And T represents the number of attribute elements in the attribute sequence. According to the difference of the target visual tasks, the attribute elements are the visual task output which is expected by the target visual tasks and used for describing the target object, namely, the target object can be described by an attribute sequence related to the visual tasks, and according to the difference of the target visual tasks, the definition of the attribute elements in the attribute sequence can be flexibly changed to meet diversified requirements. Based on this, the structure of the visual analysis model provided by the embodiment of the invention fully considers hierarchical structure information in the image, and can organize image features by taking 'objects' as basic units; meanwhile, the structure uses a new prediction layer, namely a sequence generation layer, and can convert the characteristics of the target object from an image representation space to a scalar numerical value space, so that the structure can be conveniently expanded to various different visual tasks.
Based on the sequence generation layer of the embodiment of fig. 4, fig. 5 exemplarily shows a third flow chart of the image analysis method provided by the present invention, and this exemplary embodiment provides an exemplary implementation of the above step 123, and as shown in fig. 5, steps 1231 to 1233 may be included.
Step 1231: and inputting the output characteristic of the image mutual attention layer at the previous moment as the input characteristic of the current moment into the sequence self-attention layer to obtain the self-attention value of the current moment output by the sequence self-attention layer. Wherein, the input feature of the sequence from the initial moment of the attention layer is the target object feature.
Step 1232: and inputting the self-attention value at the current moment into the image mutual attention layer to obtain the output characteristic of the image mutual attention layer at the current moment.
Step 1233: and inputting the output characteristics of the current moment into the linear layer to obtain the attribute elements of the current moment output by the linear layer. The attribute elements are used to construct attribute sequences.
Based on the structure of the sequence generation layer of the corresponding embodiment of fig. 4 and the method of the corresponding embodiment of fig. 5, fig. 6 schematically illustrates the working principle of the sequence generation layer, and referring to fig. 6, the sequence generation layer may be characterized by a single target object of the target object
Figure 398062DEST_PATH_IMAGE005
For inputting, output attribute sequence with arbitrary length to describe the target object, the target object characteristics
Figure 85396DEST_PATH_IMAGE005
Is a vector with dimension of 1 x d, wherein d is a positive integer and represents the characteristic dimension of the visual analysis model. The length and definition of the sequence of attributes may vary from target vision task to target vision task. And performing prediction output of the attribute sequence in time sequence in the sequence generation layer. Specifically, will
Figure 148030DEST_PATH_IMAGE005
After the input sequence has been layered, the sequence has been layered
Figure 288024DEST_PATH_IMAGE006
As input of initial time
Figure 727096DEST_PATH_IMAGE007
Output characteristics for calculating the initial time
Figure 770400DEST_PATH_IMAGE008
. For each time t after, the sequence generation layer will output the characteristics of the output of the previous time
Figure 269515DEST_PATH_IMAGE009
As a new input feature
Figure 631226DEST_PATH_IMAGE010
I.e. by
Figure 608409DEST_PATH_IMAGE011
. And for the input features at each moment, processing the input features by a sequence self-attention layer and an image mutual attention layer in sequence. In particular, the sequence self-attention layer may characterize the input at time t
Figure 4755DEST_PATH_IMAGE012
Performing self-attention calculation by using input characteristics of Query at time t and all previous times as Key Value pairs (Key-Value) to obtain self-attention Value at time t
Figure 674771DEST_PATH_IMAGE013
Then will be
Figure 523778DEST_PATH_IMAGE013
Inputting an image mutual attention layer, wherein the image mutual attention layer is used for analyzing the image characteristics of the image I to be analyzed
Figure 570232DEST_PATH_IMAGE014
Performing mutual attention calculation as key values to obtain output characteristics at t moment
Figure 821085DEST_PATH_IMAGE015
. The output characteristic of each time t can be subjected to the numerical processing of the linear layer to obtain the attribute element output at each time
Figure 172256DEST_PATH_IMAGE016
To finally obtain
Figure 508559DEST_PATH_IMAGE017
Property sequence of
Figure 93124DEST_PATH_IMAGE018
. For example, the quantization process of the linear layer may be to multiply the features of the input linear layer by a weight matrix to obtain corresponding attribute elements.
Assuming that the target visual task requires the output of a sequence of attributes of length T, the input features at each time instant
Figure 198483DEST_PATH_IMAGE019
Self-attention value
Figure 210302DEST_PATH_IMAGE020
And output characteristics
Figure 768322DEST_PATH_IMAGE021
May be determined using the following formula one, formula two, and formula three, respectively.
Specifically, the formula one is
Figure 156578DEST_PATH_IMAGE022
The second formula is
Figure 116444DEST_PATH_IMAGE023
Formula III is
Figure 299163DEST_PATH_IMAGE024
Wherein the content of the first and second substances,
Figure 610059DEST_PATH_IMAGE025
t represents the number of attribute elements in the attribute sequence, one attribute element is output at each time T, selfAttn () represents a self-attention computation function, crossAttn () represents a mutual-attention computation function,
Figure 536427DEST_PATH_IMAGE026
the weight of the query is represented by,
Figure 117843DEST_PATH_IMAGE027
the weight of the key-value pair is represented,
Figure 471464DEST_PATH_IMAGE028
indicating input characteristics at time t
Figure 269656DEST_PATH_IMAGE029
And a mosaic matrix of input features at all times before time t.
Based on the working principle, the attribute sequence output by the sequence generation layer can be interpreted into any required meaning according to the specific visual task, and network parameters related to the specific visual task are not required. That is to say, the sequence generation layer outputs a universal target representation form, uses an attribute sequence as the output of the visual task, and can generalize and apply the same visual analysis model structure to different visual tasks, wherein the visual analysis model has strong multi-visual task expansibility.
The following describes an application of the image analysis method provided by the embodiment of the present invention to a specific visual task, taking a target detection task and a human body posture estimation task as examples.
FIG. 7 is a schematic diagram illustrating the operation principle of a visual analysis model, and referring to FIG. 7, an image I to be analyzed is input into an image feature coding layer, and a residual network layer and a self-attention coding layer of the image feature coding layer first extract image features from the image I to be analyzed
Figure 468556DEST_PATH_IMAGE001
Figure 668593DEST_PATH_IMAGE001
Then input into the attention decoding layer, a set F of a series of target object features output from the attention decoding layer is obtained, and then each target object feature in F may be input into a sequence generation layer, which generates an attribute sequence of each target object feature.And training the target according to the target visual task requirement and the visual analysis model, wherein each attribute element in the attribute sequence can be regarded as any required attribute so as to meet the requirement of the target visual task. For the target detection task, the attribute sequence can be defined as the category score and the position and size of the detection rectangular box; for the human pose estimation task, the attribute sequence may be defined as the confidence and the coordinates of each keypoint.
Referring to fig. 8, fig. 8 is a schematic diagram illustrating the operation logic of the sequence generation layer when executing the target detection task, and referring to fig. 8, for the target detection task, each target object needs to be represented by an attribute sequence with a length of 5, and the attribute elements are the category score, the coordinate values x and y of the target detection box, and the width w and the height h, respectively.
Before the visual analysis model provided by the embodiment of the invention is used for executing the target detection task, the visual analysis model needs to be trained based on the target detection task. In an example embodiment of the invention, a training objective may be constructed based on a binary matching penalty. In particular, all target objects may be grouped together
Figure 193115DEST_PATH_IMAGE030
And performing binary matching with a set O of all output objects of the visual analysis model to enable each target object to correspond to a unique output object, wherein the matching target is to minimize the sum of losses of all training samples. It can be assumed that an object is output
Figure 213024DEST_PATH_IMAGE031
The matched target object is
Figure 12353DEST_PATH_IMAGE032
When is coming into contact with
Figure 801317DEST_PATH_IMAGE033
Is an empty collector
Figure 995276DEST_PATH_IMAGE034
When the representation is not matched with the target object, the vision isOptimization during analytical model training
Figure 502481DEST_PATH_IMAGE035
And
Figure 574342DEST_PATH_IMAGE036
with minimal loss in between.
Illustratively, for the output category score, the focus loss function, focal () may be used for supervision, and for coordinate regression, both may be used
Figure 217813DEST_PATH_IMAGE037
Loss function and Generalized Intersection over Union (GIoU) loss function
Figure 349717DEST_PATH_IMAGE038
And (6) supervision is carried out. Therein are
Figure 78638DEST_PATH_IMAGE039
The loss function represents the absolute value of the difference between the predicted value and the true value for minimizing the error, also known as the minimum absolute deviation (LAD).
Based on the above, aiming at the target detection task, the target detection loss function in the training of the visual analysis model
Figure 954190DEST_PATH_IMAGE040
Can be expressed as:
Figure 983326DEST_PATH_IMAGE041
wherein N represents the number of output objects when the target detection task is executed,
Figure 754973DEST_PATH_IMAGE042
a target detection loss value representing the ith output object,
Figure 502349DEST_PATH_IMAGE043
can be obtained byAnd calculating to obtain the formula IV.
Equation four can be expressed as:
Figure 650434DEST_PATH_IMAGE044
wherein, the first and the second end of the pipe are connected with each other,
Figure 35541DEST_PATH_IMAGE045
representing the ith object marked in the training sample;
Figure 978089DEST_PATH_IMAGE046
representing an indication function when conditions
Figure 947182DEST_PATH_IMAGE047
The value is 1 when the result is true, otherwise, the value is 0;
Figure 898958DEST_PATH_IMAGE048
to represent
Figure 637107DEST_PATH_IMAGE031
And
Figure 16136DEST_PATH_IMAGE049
loss of focus in between;
Figure 472525DEST_PATH_IMAGE050
to represent
Figure 227991DEST_PATH_IMAGE051
And
Figure 820646DEST_PATH_IMAGE052
the error between;
Figure 636156DEST_PATH_IMAGE053
represent
Figure 812797DEST_PATH_IMAGE054
And
Figure 106375DEST_PATH_IMAGE055
generalized cross-over ratio loss between.
When performing visual analysis model training for a target detection task, a first sample image may be obtained, first label information is labeled to a detection target in the first sample image, and then the labeled first sample image is used as a training sample, and a loss function is used
Figure 819116DEST_PATH_IMAGE056
And (5) training the visual analysis model, wherein the trained visual analysis model can be used for detecting the detection target.
With reference to fig. 7 and 8, when performing target detection, an image to be analyzed that needs target detection may be input to an image feature encoding layer for image feature extraction, the extracted image features are input to an attention decoding layer for processing, so as to obtain target object features, and then each target object feature input sequence generation layer is subjected to serialization processing, so as to obtain an attribute sequence of each target object feature. Such as for target object features
Figure 539947DEST_PATH_IMAGE057
Can be prepared by
Figure 970929DEST_PATH_IMAGE058
Input features as initial moments of sequence-generating layers
Figure 802619DEST_PATH_IMAGE059
An input sequence generation layer, which is processed by a sequence self-attention layer and an image mutual attention layer to obtain output characteristics
Figure 369866DEST_PATH_IMAGE060
Figure 261599DEST_PATH_IMAGE061
After linear layer transformation, a first attribute element is obtained, which may be defined as a classification score. At the same time, the user can select the desired position,
Figure 914297DEST_PATH_IMAGE060
as input features for the next moment
Figure 549678DEST_PATH_IMAGE062
An input sequence generation layer, which is processed by a sequence self-attention layer and an image mutual attention layer to obtain output characteristics
Figure 738476DEST_PATH_IMAGE063
Figure 801110DEST_PATH_IMAGE064
After the linear layer transformation, a second attribute element is obtained, and the attribute element can be defined as the position abscissa x of the target detection frame. By analogy, the sequence generation layer may output the position ordinate y of the target detection frame and the width w and height h of the target detection frame in sequence. For example, the detected target object is the object a in fig. 7, and the corresponding attribute sequence is the sequence 72.
With reference to fig. 7, fig. 9 is a schematic diagram illustrating the operation logic of the sequence generation layer in executing the human body posture estimation task, and with reference to fig. 9, for the human body posture estimation task, each target object needs to be represented by an attribute sequence with a length of 35, and the attribute elements are the confidence levels con and the coordinates of 17 key points respectively
Figure 941104DEST_PATH_IMAGE065
Wherein, k is the index of the key point,
Figure 114596DEST_PATH_IMAGE066
representing the coordinates of the kth keypoint.
Before the visual analysis model provided by the embodiment of the invention is used for executing the human body posture estimation task, the visual analysis model needs to be trained based on the human body posture estimation task. In an exemplary embodiment of the present invention, a training target may be constructed by using binary matching loss, and the construction principle may refer to the description of the above-mentioned implementation of the target detection task, which is not described herein again.
Exemplary ofFor the confidence con of the output, a Binary Cross Entropy (BCE) loss function may be used
Figure 656436DEST_PATH_IMAGE067
Monitoring is carried out; for the key point coordinates, one can use
Figure 155551DEST_PATH_IMAGE068
Loss function and target keypoint similarity (OKS) loss function
Figure 517262DEST_PATH_IMAGE069
And (6) supervision is carried out.
Based on the method, aiming at the human body posture estimation task, the posture estimation loss function in the process of training the visual analysis model
Figure 760024DEST_PATH_IMAGE070
Can be expressed as:
Figure 421950DEST_PATH_IMAGE071
wherein M represents the number of output objects when executing the human body posture estimation task,
Figure 826386DEST_PATH_IMAGE072
representing the attitude estimation penalty value for the ith output object,
Figure 173929DEST_PATH_IMAGE073
can be calculated by the following formula five.
Equation five can be expressed as:
Figure 954803DEST_PATH_IMAGE074
wherein, the first and the second end of the pipe are connected with each other,
Figure 471235DEST_PATH_IMAGE075
representing an indication function of when the condition is
Figure 46573DEST_PATH_IMAGE076
The value is 1 when the result is true, otherwise, the value is 0;
Figure 382876DEST_PATH_IMAGE077
to represent
Figure 701862DEST_PATH_IMAGE003
And indicating a binary cross entropy loss between functions;
Figure 72801DEST_PATH_IMAGE078
to represent
Figure 819040DEST_PATH_IMAGE051
And
Figure 908218DEST_PATH_IMAGE079
the error between;
Figure 30895DEST_PATH_IMAGE080
to represent
Figure 256340DEST_PATH_IMAGE081
And
Figure 940525DEST_PATH_IMAGE082
the similarity between them is lost.
When the visual analysis model training aiming at the human body posture estimation task is carried out, a second sample image can be obtained, second label information is marked on human body key points in the second sample image, then the second sample image after marking processing is used as a training sample, and a loss function is utilized
Figure 985841DEST_PATH_IMAGE083
And (4) training a visual analysis model, wherein the trained visual analysis model can be used for detecting key points of the human body.
With reference to fig. 7 and 9, when estimating the human body posture, the image to be analyzed, which needs to be estimated, may be input to the image feature encoding layer for image mappingAnd image feature extraction, namely inputting the extracted image features into a self-attention decoding layer for processing to obtain target object features, and then performing serialization processing on each target object feature input sequence generation layer to obtain an attribute sequence of each target object feature. Such as for target object characteristics
Figure 912209DEST_PATH_IMAGE084
Can be prepared by
Figure 992160DEST_PATH_IMAGE085
Input features as initial moments of sequence-generating layers
Figure 345781DEST_PATH_IMAGE086
Input sequence generation layer, respectively processing by sequence self-attention layer and image mutual attention layer to obtain output characteristics
Figure 143973DEST_PATH_IMAGE060
Figure 874032DEST_PATH_IMAGE087
After the linear layer transformation, a first attribute element is obtained, which may be defined as a confidence con. At the same time, the user can select the required time,
Figure 808490DEST_PATH_IMAGE088
as input features for the next moment
Figure 333012DEST_PATH_IMAGE089
Input sequence generation layer, respectively processing by sequence self-attention layer and image mutual attention layer to obtain output characteristics
Figure 352920DEST_PATH_IMAGE090
Figure 396924DEST_PATH_IMAGE090
After linear layer transformation, a second attribute element is obtained, and the attribute element can be defined as the abscissa of the position of the first human body key point
Figure 451467DEST_PATH_IMAGE091
. By analogy, the sequence generation layer can continuously output the vertical coordinate of the position of the first human body key point in turn
Figure 881312DEST_PATH_IMAGE092
And coordinates of other human keypoint locations. For example, the detected human key points are key points in the human body 73 in fig. 7, and the corresponding attribute sequence is the sequence 74.
According to the image analysis method provided by the embodiment of the invention, the to-be-analyzed image of the target vision task is obtained, and the to-be-analyzed image is input into the vision analysis model obtained by training based on the sample image corresponding to the target vision task and the label data corresponding to the sample image, so that the attribute sequence of each target object characteristic aiming at the target vision task in the to-be-analyzed image output by the vision analysis model can be obtained. On one hand, various different visual tasks can be redefined to describe the attribute sequence generation problem of each target object, the task output forms of different visual tasks are unified, the output attribute sequence can be flexibly defined as the meaning which can describe the target object and is required by the target visual task according to the difference of the target visual tasks, and the diversified requirements can be met, so that the visual analysis model with the same structure can be used for processing various different visual tasks, the universality of the visual analysis model on various visual tasks is improved, the visual task processing efficiency is further improved, and the development cost is reduced. On the other hand, hierarchical structure information in the image to be analyzed is fully considered, the image features can be organized by taking 'objects' as basic units, and the accuracy of image feature extraction is improved.
The following describes the image analysis apparatus provided by the present invention, and the image analysis apparatus described below and the image analysis method described above may be referred to in correspondence with each other.
Fig. 10 schematically illustrates a structural diagram of an image analysis apparatus according to an embodiment of the present invention, and referring to fig. 10, the image analysis apparatus 1000 may include an obtaining module 1010 and an analyzing module 1020. The obtaining module 1010 may be configured to obtain an image to be analyzed of a target visual task; the analysis module 1020 may be configured to input the image to be analyzed into the visual analysis model, and obtain an attribute sequence of each target object feature for the target visual task in the image to be analyzed output by the visual analysis model; the visual analysis model is used for extracting target object features in image features of an image to be analyzed based on a target visual task and generating an attribute sequence describing the target object features; the visual analysis model is obtained by training based on the sample image corresponding to the target visual task and the label data corresponding to the sample image.
In an example embodiment, the analysis module 1020 may include: the first analysis unit is used for inputting the image to be analyzed into an image characteristic coding layer of the visual analysis model and obtaining the image characteristics of the image to be analyzed output by the image characteristic coding layer; the second analysis unit is used for inputting the image characteristics into a self-attention decoding layer of the visual analysis model and obtaining target object characteristics aiming at the target visual task in the image characteristics output by the self-attention decoding layer; and the third analysis unit is used for inputting the target object characteristics into the sequence generation layer of the visual analysis model and obtaining the attribute sequence of the target object characteristics output by the sequence generation layer.
In an example embodiment, the sequence generation layer may output the attribute sequence of the target object feature in time series; the sequence generation layer may include a sequence self-attention layer, an image mutual attention layer, and a linear layer. The sequence self-attention layer can be used for performing self-attention calculation by taking the input features at the current moment as query and the input features at the current moment and all the previous input features as key values; the image mutual attention layer can be used for carrying out mutual attention calculation on the output of the sequence self-attention layer by taking the image characteristics as key values; the linear layer may be used to perform a digitization process on the output of the image mutual attention layer to obtain an attribute sequence.
In an example embodiment, the third analysis unit may include: the first analysis subunit may be configured to input, as the input feature at the current time, the output feature at the previous time of the image mutual attention layer into the sequence self-attention layer, and obtain a self-attention value at the current time, where the input feature at the initial time of the sequence self-attention layer is a target object feature; the second analysis subunit is configured to input the self-attention value at the current time into the image mutual-attention layer, and obtain an output feature of the image mutual-attention layer at the current time; and the third analysis subunit is configured to input the output characteristic at the current time into the linear layer, to obtain attribute elements output by the linear layer at the current time, where the attribute elements form an attribute sequence.
In an example embodiment, the first analysis unit may include: the fourth analysis subunit is configured to input the image to be analyzed into a residual network layer of an image feature coding layer of the visual analysis model, and obtain an initial image feature of the image to be analyzed, which is output by the residual network layer, where the residual network layer may be used to map the image to be analyzed to an image feature space; and the fifth analysis subunit is used for inputting the initial image characteristics into a self-attention coding layer of the image characteristic coding layer, and obtaining the image characteristics of the image to be analyzed output from the attention coding layer, wherein the self-attention coding layer can be used for self-attention coding.
In an example embodiment, the length of the sequence of attributes output by the visual analytics model is determined based on the target visual task.
Fig. 11 illustrates a physical structure diagram of an electronic device, and as shown in fig. 11, the electronic device 1100 may include: a processor (processor) 1110, a Communication Interface (Communication Interface) 1120, a memory (memory) 1130, and a Communication bus 1140, wherein the processor 1110, the Communication Interface 1120, and the memory 1130 complete Communication with each other through the Communication bus 1140. Processor 1110 may invoke logic instructions in memory 1130 to perform the image analysis methods provided by the above-described method embodiments, which may include: acquiring an image to be analyzed of a target visual task; inputting the image to be analyzed into a visual analysis model, and obtaining an attribute sequence of each target object characteristic aiming at a target visual task in the image to be analyzed output by the visual analysis model; the visual analysis model is used for extracting the target object features in the image features of the image to be analyzed based on the target visual task and generating an attribute sequence for describing the target object features; the visual analysis model is obtained by training based on the sample image corresponding to the target visual task and the label data corresponding to the sample image.
In addition, the logic instructions in the memory 1130 may be implemented in software functional units and stored in a computer readable storage medium when sold or used as a stand-alone product. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
In another aspect, the present invention also provides a computer program product, the computer program product comprising a computer program, the computer program being storable on a non-transitory computer-readable storage medium, wherein when the computer program is executed by a processor, the computer is capable of executing the image analysis method provided by the above method embodiments, and the method may include: acquiring an image to be analyzed of a target visual task; inputting the image to be analyzed into a visual analysis model, and obtaining an attribute sequence of each target object characteristic aiming at a target visual task in the image to be analyzed output by the visual analysis model; the visual analysis model is used for extracting the target object features in the image features of the image to be analyzed based on the target visual task and generating an attribute sequence for describing the target object features; the visual analysis model is obtained by training based on the sample image corresponding to the target visual task and the label data corresponding to the sample image.
In yet another aspect, the present invention also provides a non-transitory computer-readable storage medium having stored thereon a computer program, which when executed by a processor is implemented to perform the image analysis method provided by the above method embodiments, the method may include: acquiring an image to be analyzed of a target visual task; inputting the image to be analyzed into a visual analysis model, and obtaining an attribute sequence of each target object characteristic aiming at a target visual task in the image to be analyzed output by the visual analysis model; the visual analysis model is used for extracting target object features in image features of an image to be analyzed based on a target visual task and generating an attribute sequence describing the target object features; the visual analysis model is obtained by training based on the sample image corresponding to the target visual task and the label data corresponding to the sample image.
The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment may be implemented by software plus a necessary general hardware platform, and may also be implemented by hardware. Based on the understanding, the above technical solutions substantially or otherwise contributing to the prior art may be embodied in the form of a software product, which may be stored in a computer-readable storage medium, such as ROM/RAM, magnetic disk, optical disk, etc., and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method according to the various embodiments or some parts of the embodiments.
Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims (10)

1. An image analysis method, comprising:
acquiring an image to be analyzed of a target vision task;
inputting the image to be analyzed into a visual analysis model, and obtaining an attribute sequence of each target object characteristic aiming at the target visual task in the image to be analyzed output by the visual analysis model;
the visual analysis model is used for extracting target object features in the image features of the image to be analyzed based on the target visual task and generating an attribute sequence describing the target object features; the attribute sequence is presented in the form of a scalar number;
the visual analysis model is obtained by training based on a sample image corresponding to the target visual task and label data corresponding to the sample image.
2. The image analysis method according to claim 1, wherein the inputting the image to be analyzed into a visual analysis model, and obtaining the attribute sequence of each target object feature for a target visual task in the image to be analyzed output by the visual analysis model comprises:
inputting the image to be analyzed into an image feature coding layer of the visual analysis model to obtain the image features of the image to be analyzed output by the image feature coding layer;
inputting the image features into a self-attention decoding layer of the visual analysis model, and obtaining target object features aiming at the target visual task in the image features output by the self-attention decoding layer;
and inputting the target object characteristics into a sequence generation layer of the visual analysis model, and obtaining an attribute sequence of the target object characteristics output by the sequence generation layer.
3. The image analysis method according to claim 2, wherein the sequence generation layer outputs the attribute sequence of the target object feature in time series; the sequence generation layer comprises a sequence self-attention layer, an image mutual attention layer and a linear layer;
the sequence self-attention layer is used for performing self-attention calculation by taking the input characteristics at the current moment as query and the input characteristics at the current moment and all the previous input characteristics as key value pairs;
the image mutual attention layer is used for performing mutual attention calculation on the output of the sequence self-attention layer by taking the image characteristics as key values;
and the linear layer is used for carrying out numerical processing on the output of the image mutual attention layer to obtain the attribute sequence.
4. The image analysis method of claim 3, wherein the inputting the target object feature into a sequence-generating layer of the visual analysis model, obtaining a sequence of attributes of the target object feature output by the sequence-generating layer, comprises:
inputting the output characteristic of the image mutual attention layer at the previous moment as the input characteristic of the current moment into the sequence self-attention layer to obtain the self-attention value of the sequence self-attention layer at the current moment, wherein the input characteristic of the sequence self-attention layer at the initial moment is the characteristic of the target object;
inputting the self-attention value at the current moment into the image mutual attention layer to obtain the output characteristic of the image mutual attention layer at the current moment;
and inputting the output characteristics of the current moment into the linear layer to obtain attribute elements of the current moment output by the linear layer, wherein the attribute elements form the attribute sequence.
5. The image analysis method according to claim 2, wherein the inputting the image to be analyzed into an image feature coding layer of the visual analysis model to obtain the image features of the image to be analyzed output by the image feature coding layer comprises:
inputting the image to be analyzed into a residual error network layer of an image feature coding layer of the visual analysis model, and obtaining an initial image feature of the image to be analyzed, which is output by the residual error network layer, wherein the residual error network layer is used for mapping the image to be analyzed to an image feature space;
inputting the initial image features into a self-attention coding layer of the image feature coding layer, and obtaining the image features of the image to be analyzed output by the self-attention coding layer, wherein the self-attention coding layer is used for self-attention coding.
6. The image analysis method according to any one of claims 1 to 5, wherein the length of the sequence of attributes is determined based on the target visual task.
7. An image analysis apparatus, comprising:
the acquisition module is used for acquiring an image to be analyzed of the target vision task;
the analysis module is used for inputting the image to be analyzed into a visual analysis model and obtaining an attribute sequence of each target object characteristic aiming at the target visual task in the image to be analyzed output by the visual analysis model;
the visual analysis model is used for extracting target object features in image features of the image to be analyzed based on the target visual task and generating an attribute sequence describing the target object features; the attribute sequence is presented in the form of a scalar number; the visual analysis model is obtained by training based on a sample image corresponding to the target visual task and label data corresponding to the sample image.
8. The image analysis apparatus according to claim 7, wherein the analysis module comprises:
the first analysis unit is used for inputting the image to be analyzed into an image feature coding layer of the visual analysis model and obtaining the image features of the image to be analyzed output by the image feature coding layer;
the second analysis unit is used for inputting the image features into a self-attention decoding layer of the visual analysis model and obtaining target object features aiming at the target visual task in the image features output by the self-attention decoding layer;
and the third analysis unit is used for inputting the target object characteristics into the sequence generation layer of the visual analysis model and obtaining the attribute sequence of the target object characteristics output by the sequence generation layer.
9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the image analysis method according to any one of claims 1 to 6 when executing the computer program.
10. A non-transitory computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, implements the image analysis method according to any one of claims 1 to 6.
CN202210851146.1A 2022-07-20 2022-07-20 Image analysis method and device and electronic equipment Active CN115082430B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210851146.1A CN115082430B (en) 2022-07-20 2022-07-20 Image analysis method and device and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210851146.1A CN115082430B (en) 2022-07-20 2022-07-20 Image analysis method and device and electronic equipment

Publications (2)

Publication Number Publication Date
CN115082430A CN115082430A (en) 2022-09-20
CN115082430B true CN115082430B (en) 2022-12-06

Family

ID=83259426

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210851146.1A Active CN115082430B (en) 2022-07-20 2022-07-20 Image analysis method and device and electronic equipment

Country Status (1)

Country Link
CN (1) CN115082430B (en)

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112992308A (en) * 2021-03-25 2021-06-18 腾讯科技(深圳)有限公司 Training method of medical image report generation model and image report generation method
CN113449538A (en) * 2020-03-24 2021-09-28 顺丰科技有限公司 Visual model training method, device, equipment and storage medium
CN113705325A (en) * 2021-06-30 2021-11-26 天津大学 Deformable single-target tracking method and device based on dynamic compact memory embedding
CN113761888A (en) * 2021-04-27 2021-12-07 腾讯科技(深圳)有限公司 Text translation method and device, computer equipment and storage medium
CN113792112A (en) * 2020-07-31 2021-12-14 北京京东尚科信息技术有限公司 Visual language task processing system, training method, device, equipment and medium
CN114529758A (en) * 2022-01-25 2022-05-24 哈尔滨工业大学 Multi-modal emotion analysis method based on contrast learning and multi-head self-attention mechanism
CN114550313A (en) * 2022-02-18 2022-05-27 北京百度网讯科技有限公司 Image processing method, neural network, and training method, device, and medium thereof
CN114638960A (en) * 2022-03-22 2022-06-17 平安科技(深圳)有限公司 Model training method, image description generation method and device, equipment and medium

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8489627B1 (en) * 2008-08-28 2013-07-16 Adobe Systems Incorporated Combined semantic description and visual attribute search
JP6670698B2 (en) * 2016-07-04 2020-03-25 日本電信電話株式会社 Image recognition model learning device, image recognition device, method, and program
US11195048B2 (en) * 2020-01-23 2021-12-07 Adobe Inc. Generating descriptions of image relationships
CN112257727B (en) * 2020-11-03 2023-10-27 西南石油大学 Feature image extraction method based on deep learning self-adaptive deformable convolution
CN112418330A (en) * 2020-11-26 2021-02-26 河北工程大学 Improved SSD (solid State drive) -based high-precision detection method for small target object
CN114332479A (en) * 2021-12-23 2022-04-12 浪潮(北京)电子信息产业有限公司 Training method of target detection model and related device
CN114443763A (en) * 2022-01-06 2022-05-06 山东大学 Big data synchronization method based on distributed network
CN114581682A (en) * 2022-02-22 2022-06-03 阿里巴巴(中国)有限公司 Image feature extraction method, device and equipment based on self-attention mechanism

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113449538A (en) * 2020-03-24 2021-09-28 顺丰科技有限公司 Visual model training method, device, equipment and storage medium
CN113792112A (en) * 2020-07-31 2021-12-14 北京京东尚科信息技术有限公司 Visual language task processing system, training method, device, equipment and medium
CN112992308A (en) * 2021-03-25 2021-06-18 腾讯科技(深圳)有限公司 Training method of medical image report generation model and image report generation method
CN113761888A (en) * 2021-04-27 2021-12-07 腾讯科技(深圳)有限公司 Text translation method and device, computer equipment and storage medium
CN113705325A (en) * 2021-06-30 2021-11-26 天津大学 Deformable single-target tracking method and device based on dynamic compact memory embedding
CN114529758A (en) * 2022-01-25 2022-05-24 哈尔滨工业大学 Multi-modal emotion analysis method based on contrast learning and multi-head self-attention mechanism
CN114550313A (en) * 2022-02-18 2022-05-27 北京百度网讯科技有限公司 Image processing method, neural network, and training method, device, and medium thereof
CN114638960A (en) * 2022-03-22 2022-06-17 平安科技(深圳)有限公司 Model training method, image description generation method and device, equipment and medium

Also Published As

Publication number Publication date
CN115082430A (en) 2022-09-20

Similar Documents

Publication Publication Date Title
JP6893233B2 (en) Image-based data processing methods, devices, electronics, computer-readable storage media and computer programs
CN108460338B (en) Human body posture estimation method and apparatus, electronic device, storage medium, and program
CN107766894B (en) Remote sensing image natural language generation method based on attention mechanism and deep learning
CN110472002B (en) Text similarity obtaining method and device
CN110795527B (en) Candidate entity ordering method, training method and related device
CN110990596B (en) Multi-mode hash retrieval method and system based on self-adaptive quantization
CN116543404A (en) Table semantic information extraction method, system, equipment and medium based on cell coordinate optimization
CN110263218B (en) Video description text generation method, device, equipment and medium
CN114022900A (en) Training method, detection method, device, equipment and medium for detection model
CN116152611B (en) Multistage multi-scale point cloud completion method, system, equipment and storage medium
CN114663915A (en) Image human-object interaction positioning method and system based on Transformer model
CN110399547B (en) Method, apparatus, device and storage medium for updating model parameters
CN110705490A (en) Visual emotion recognition method
CN116402063A (en) Multi-modal irony recognition method, apparatus, device and storage medium
CN113240033B (en) Visual relation detection method and device based on scene graph high-order semantic structure
Sur Tpsgtr: Neural-symbolic tensor product scene-graph-triplet representation for image captioning
CN115082430B (en) Image analysis method and device and electronic equipment
CN111611395B (en) Entity relationship identification method and device
CN114140831B (en) Human body posture estimation method and device, electronic equipment and storage medium
CN114707518A (en) Semantic fragment-oriented target emotion analysis method, device, equipment and medium
CN110633363B (en) Text entity recommendation method based on NLP and fuzzy multi-criterion decision
CN113821610A (en) Information matching method, device, equipment and storage medium
CN113569867A (en) Image processing method and device, computer equipment and storage medium
CN117173731B (en) Model training method, image processing method and related device
CN113673635B (en) Hand-drawn sketch understanding deep learning method based on self-supervision learning task

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant