CN115082430B

CN115082430B - Image analysis method and device and electronic equipment

Info

Publication number: CN115082430B
Application number: CN202210851146.1A
Authority: CN
Inventors: 朱优松; 陈志扬; 赵朝阳; 李朝闻; 王金桥; 唐明
Original assignee: Institute of Automation of Chinese Academy of Science
Current assignee: Zhongke Zidong Taichu Beijing Technology Co ltd
Priority date: 2022-07-20
Filing date: 2022-07-20
Publication date: 2022-12-06
Anticipated expiration: 2042-07-20
Also published as: CN115082430A

Abstract

The invention provides an image analysis method, an image analysis device and electronic equipment, and relates to the technical field of computer vision, wherein the image analysis method comprises the following steps: acquiring an image to be analyzed of a target vision task; inputting the image to be analyzed into a visual analysis model, and obtaining an attribute sequence of each target object characteristic aiming at a target visual task in the image to be analyzed output by the visual analysis model; the visual analysis model is used for extracting the target object characteristics in the image characteristics of the image to be analyzed based on the target visual task and generating an attribute sequence for describing the target object characteristics; the visual analysis model is obtained by training based on a sample image corresponding to the target visual task and label data corresponding to the sample image. The technical scheme provided by the invention can unify different visual tasks into the problem of sequence description of the object in the image to be analyzed, and the visual analysis model has universality on various visual tasks, thereby improving the processing efficiency of the visual tasks and reducing the development cost.

Description

Image analysis method and device and electronic equipment

Technical Field

The invention relates to the technical field of computer vision, in particular to an image analysis method and device and electronic equipment.

Background

Computer vision is the science of using computers to simulate human visual systems, enabling computers to have capabilities similar to human extraction, processing, understanding and analysis of images and image sequences, playing an increasingly important role in the fields of security, manufacturing, government affairs, medical treatment and the like.

In computer vision tasks, unlike the features that natural language can be modeled as sequence-to-sequence, definition of vision tasks is usually very different, and different vision tasks need to design different model structures to be processed respectively. Although a unified visual analysis model is proposed to process a plurality of visual tasks, such as multi-task learning, the method mainly trains a dedicated model structure of a plurality of tasks and a shared backbone model network together, only limited and fixed task combinations can be processed, once a new task appears, a new model structure needs to be additionally designed and retrained, the task processing efficiency is low, and the development cost is high.

Disclosure of Invention

The invention provides an image analysis method, an image analysis device and electronic equipment, which are used for solving the defects of low processing efficiency and high development cost caused by the fact that different visual tasks need to be processed by using models with different structures in the prior art and realizing unified serialized representation of output results of various visual tasks.

The invention provides an image analysis method, which comprises the following steps:

acquiring an image to be analyzed of a target visual task;

inputting the image to be analyzed into a visual analysis model, and obtaining an attribute sequence of each target object characteristic aiming at the target visual task in the image to be analyzed output by the visual analysis model;

the visual analysis model is used for extracting target object features in image features of the image to be analyzed based on the target visual task and generating an attribute sequence describing the target object features;

the visual analysis model is obtained by training based on a sample image corresponding to the target visual task and label data corresponding to the sample image.

According to an image analysis method provided by the present invention, the inputting the image to be analyzed into a visual analysis model, and obtaining an attribute sequence of each target object feature for a target visual task in the image to be analyzed output by the visual analysis model, includes:

inputting the image to be analyzed into an image feature coding layer of the visual analysis model to obtain the image features of the image to be analyzed output by the image feature coding layer;

inputting the image features into a self-attention decoding layer of the visual analysis model, and obtaining target object features aiming at the target visual task in the image features output by the self-attention decoding layer;

and inputting the target object characteristics into a sequence generation layer of the visual analysis model, and obtaining an attribute sequence of the target object characteristics output by the sequence generation layer.

According to the image analysis method provided by the invention, the sequence generation layer outputs the attribute sequence of the target object characteristic according to a time sequence; the sequence generation layer comprises a sequence self-attention layer, an image mutual attention layer and a linear layer;

the sequence self-attention layer is used for carrying out self-attention calculation by taking the input features at the current moment as query, the input features at the current moment and all the previous input features as key values;

the image mutual attention layer is used for performing mutual attention calculation on the output of the sequence self-attention layer by taking the image characteristics as key values;

and the linear layer is used for carrying out numerical processing on the output of the image mutual attention layer to obtain the attribute sequence.

According to an image analysis method provided by the present invention, the inputting the target object feature into a sequence generation layer of the visual analysis model, and obtaining a sequence of attributes of the target object feature output by the sequence generation layer, includes:

inputting the output characteristic of the image mutual attention layer at the previous moment as the input characteristic of the current moment into the sequence self-attention layer to obtain the self-attention value of the sequence self-attention layer at the current moment, wherein the input characteristic of the sequence self-attention layer at the initial moment is the characteristic of the target object;

inputting the self-attention value of the current moment into the image mutual attention layer to obtain the output characteristic of the current moment output by the image mutual attention layer;

and inputting the output characteristics of the current moment into the linear layer to obtain attribute elements of the current moment output by the linear layer, wherein the attribute elements form the attribute sequence.

According to an image analysis method provided by the present invention, the inputting the image to be analyzed into an image feature coding layer of the visual analysis model to obtain the image feature of the image to be analyzed output by the image feature coding layer includes:

inputting the image to be analyzed into a residual error network layer of an image feature coding layer of the visual analysis model to obtain an initial image feature of the image to be analyzed, which is output by the residual error network layer, wherein the residual error network layer is used for mapping the image to be analyzed to an image feature space;

inputting the initial image features into a self-attention coding layer of the image feature coding layer, and obtaining the image features of the image to be analyzed output by the self-attention coding layer, wherein the self-attention coding layer is used for self-attention coding.

According to an image analysis method provided by the present invention, the length of the attribute sequence is determined based on the target vision task.

The present invention also provides an image analysis apparatus comprising:

the acquisition module is used for acquiring an image to be analyzed of the target vision task;

the analysis module is used for inputting the image to be analyzed into a visual analysis model and obtaining an attribute sequence of each target object characteristic aiming at the target visual task in the image to be analyzed output by the visual analysis model;

the visual analysis model is used for extracting target object features in the image features of the image to be analyzed based on the target visual task and generating an attribute sequence describing the target object features; the visual analysis model is obtained by training based on a sample image corresponding to the target visual task and label data corresponding to the sample image.

According to an image analysis apparatus provided by the present invention, the analysis module includes:

the first analysis unit is used for inputting the image to be analyzed into an image feature coding layer of the visual analysis model and obtaining the image features of the image to be analyzed output by the image feature coding layer;

a second analysis unit, configured to input the image feature into a self-attention decoding layer of the visual analysis model, and obtain a target object feature for the target visual task in the image feature output by the self-attention decoding layer;

and the third analysis unit is used for inputting the target object characteristics into the sequence generation layer of the visual analysis model and obtaining the attribute sequence of the target object characteristics output by the sequence generation layer.

The present invention also provides an electronic device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor implements the image analysis method as described in any one of the above methods when executing the computer program.

The invention also provides a non-transitory computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements an image analysis method as described in any of the above.

According to the image analysis method, the image analysis device and the electronic equipment, the image to be analyzed is obtained and is input into the visual analysis model, so that the attribute sequence of each target object characteristic aiming at the target visual task in the image to be analyzed output by the visual analysis model can be obtained. The visual analysis model is obtained based on a sample image corresponding to a target visual task and label data corresponding to the sample image, the target object characteristics in the image characteristics of the image to be analyzed can be extracted based on the target visual task, and an attribute sequence describing the characteristics of the target object is generated.

Drawings

In order to more clearly illustrate the technical solutions of the present invention or the prior art, the drawings needed for the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.

FIG. 1 is a schematic flow chart of an image analysis method provided by the present invention;

FIG. 2 is a schematic diagram of a visual analysis model provided by the present invention;

FIG. 3 is a second schematic flow chart of an image analysis method provided by the present invention;

FIG. 4 is a schematic diagram of the structure of a sequence generation layer provided by the present invention;

FIG. 5 is a third schematic flow chart of an image analysis method provided by the present invention;

FIG. 6 is a schematic diagram of the operation of the sequence generation layer provided by the present invention;

FIG. 7 is a schematic diagram of the working principle of the visual analysis model provided by the present invention;

FIG. 8 is a schematic diagram of the logic for the sequence generation layer provided by the present invention in performing object detection tasks;

FIG. 9 is a schematic diagram of the operating logic of the sequence generating layer provided by the present invention when performing a human body posture estimation task;

FIG. 10 is a schematic diagram of an image analysis apparatus provided in the present invention;

fig. 11 is a schematic physical structure diagram of an electronic device provided in the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is obvious that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In natural language processing, many large generic language models are presented, and language tasks can often be modeled as sequence-to-sequence (Seq 2 Seq) generation problems, and a large number of different problems can be handled with the same self-attention Transformer model, such as the Transformer model, etc. A plurality of large-scale pre-training language models can be trained under a small amount of training data, and can be applied to a large amount of language tasks to meet the requirements of a large amount of voice tasks.

In view of the significant achievement in the field of natural language processing, if a similar general model structure can be designed in the visual task to process a plurality of visual tasks, the method will help to improve the processing efficiency of the visual task and reduce the development cost of the visual task processing.

However, in the field of computer vision, definition of visual tasks is often greatly different, and different from natural languages composed of sequences, the visual tasks are difficult to be uniformly represented in the same form, so that different model structures need to be designed and processed when different visual tasks are faced.

In the related art, most of unified large visual models used for visual tasks are pre-trained backbone network models, which can be used for extracting image features, but an additional unique structure still needs to be designed for each visual task. Multi-task learning is a way to train a dedicated model structure of multiple tasks and a shared backbone network model together, which can only process limited and fixed task combinations, and once a new task appears, a new model structure needs to be additionally designed and retrained.

Based on this, the embodiment of the present invention provides an image analysis method, which may obtain an image to be analyzed of a target visual task, input the image to be analyzed into a visual analysis model, perform, by the visual analysis model, extraction of a target object feature in image features on the image to be analyzed based on the target visual task, generate an attribute sequence describing the target object feature, and obtain an attribute sequence of each target object feature for the target visual task in the image to be analyzed output by the visual analysis model. The visual analysis model is obtained by training based on a sample image corresponding to the target visual task and label data corresponding to the sample image. The attribute sequence provides a general expression form of the visual task target, so that different visual tasks can be unified into a sequence description task of an object in an image to be analyzed, and the universality of the visual analysis model in various visual tasks is improved.

The image analysis method of the present invention is described below with reference to fig. 1 to 9. The image analysis method can be applied to electronic equipment such as a server, a mobile phone, a computer and the like, and can also be applied to an image analysis device arranged in the electronic equipment such as the server, the mobile phone, the computer and the like, and the image analysis device can be realized by software or combination of the software and hardware.

Fig. 1 schematically illustrates one of the flow diagrams of the image analysis method provided by the embodiment of the present invention, and referring to fig. 1, the image analysis method may include the following steps 110 to 120.

Step 110: and acquiring an image to be analyzed of the target visual task.

The target vision tasks may include, for example, target detection, human pose estimation, or image classification tasks. The image to be analyzed may be a picture or a frame image in a video. The electronic device may obtain the image to be analyzed of the target visual task in real time from an image acquisition device (such as a camera, etc.), or obtain the image to be analyzed from an image stored in a memory.

Step 120: the image to be analyzed is input into the visual analysis model, and an attribute sequence of each target object feature for the target visual task in the image to be analyzed output by the visual analysis model is obtained, wherein the visual analysis model can be used for extracting the target object feature in the image features of the image to be analyzed based on the target visual task and generating the attribute sequence describing the target object feature. The attribute sequence provides a representation of a generic visual task object, and the elements in the attribute sequence can be assigned different meanings according to different target visual tasks. Therefore, different visual tasks can be unified into a sequence description task of an object in an image to be analyzed, and the difference of the different visual tasks in the output form is unified.

On this basis, the visual analysis model can be obtained by training based on the sample image corresponding to the target visual task and the label data corresponding to the sample image, that is, for different target visual tasks, the visual analysis model for processing the target visual task can be obtained by only utilizing the sample image corresponding to the target visual task and the label data corresponding to the sample image to train the visual analysis model, without changing the structure of the visual analysis model, so that the visual analysis model can be conveniently expanded to various different target visual tasks, and has strong universality.

For example, the target vision task is target detection, for example, a pedestrian in an image is detected, a first sample image may be obtained, first label information is labeled to the pedestrian in the first sample image, then the first sample image after labeling processing is used as a training sample to perform training of a vision analysis model, and the trained vision analysis model can be used for pedestrian detection.

For another example, if the target vision task is replaced by human body posture estimation, the vision analysis model may be retrained according to the human body posture estimation task. Specifically, a second sample image can be obtained, the human skeleton points in the second sample image are labeled as second label information, then the labeled second sample image is used as a training sample for training a visual analysis model, and the trained visual analysis model can be used for human posture estimation.

Wherein the length of the sequence of attributes may be determined based on the target visual task. For example, when the target vision task is target detection, each target object needs to be represented by a category score, location coordinates (x, y), and the length w and width h of the detection box, the attribute sequence is a sequence with a length of 5. For another example, when the target visual task is human pose estimation, each object needs to be represented by a confidence level and 17 key points, and the attribute sequence is a sequence with a length of 35.

According to the image analysis method provided by the embodiment of the invention, the to-be-analyzed image of the target vision task is obtained, and the to-be-analyzed image is input into the vision analysis model, so that the attribute sequence of each target object characteristic aiming at the target vision task in the to-be-analyzed image output by the vision analysis model can be obtained. The visual analysis model is obtained based on a sample image corresponding to a target visual task and label data corresponding to the sample image through training, the target object characteristics in the image characteristics of an image to be analyzed can be extracted based on the target visual task, an attribute sequence describing the target object characteristics is generated, therefore, for any target visual task, the target object characteristics corresponding to the visual task in the image to be analyzed can be used as a basic unit to obtain an attribute sequence describing the target object characteristics, the attribute sequence provides a universal visual task target expression form, different visual tasks can be unified into a sequence description problem of an object in the image to be analyzed, the visual analysis model with the same structure can be used for realizing the processing of various different visual tasks, the universality of the visual analysis model on the various visual tasks is improved, the visual task processing efficiency is further improved, and the development cost is reduced.

Based on the image analysis method of the corresponding embodiment in fig. 1, fig. 2 schematically shows a structure of a visual analysis model, which may include, as shown in fig. 2, an image feature coding layer, a self-attention decoding layer, and a sequence generation layer, wherein the image feature coding layer may be used for image feature extraction, the self-attention decoding layer may be used for extracting object features from the image features, and the sequence generation layer may convert the object features into an attribute sequence.

Based on the structure of the visual analysis model in the embodiment corresponding to fig. 2, in an exemplary embodiment, fig. 3 exemplarily shows a second flowchart of the image analysis method provided by the present invention, and this exemplary embodiment provides an exemplary implementation of the above step 120, and as shown in fig. 3, the following steps 121 to 123 may be included.

Step 121: and inputting the image to be analyzed into an image characteristic coding layer of the visual analysis model to obtain the image characteristics of the image to be analyzed output by the image characteristic coding layer.

For example, when a target visual task is performed on a to-be-analyzed image I, the to-be-analyzed image I may be input to an image feature coding layer of a visual analysis model, and the image feature coding layer may extract image features of the to-be-analyzed image I by using a self-attention mechanism and output the image features of the to-be-analyzed image I.

For example, the image feature coding layer may include a residual network layer and a self-attention coding layer, where the residual network layer may be used to map an image to be analyzed to an image feature space, and the self-attention coding layer may be used to perform self-attention coding, that is, a self-attention mechanism may be used to code input data. Accordingly, step 121 may include: inputting an image to be analyzed into a residual error network layer of an image feature coding layer of a visual analysis model, and obtaining initial image features of the image to be analyzed output by the residual error network layer; inputting the initial image characteristics into a self-attention coding layer of the image characteristic coding layer, and obtaining the image characteristics of the image to be analyzed output from the attention coding layer.

Step 122: and inputting the image features into a self-attention decoding layer of the visual analysis model, and obtaining target object features aiming at the target visual task in the image features output from the attention decoding layer.

The self-attention decoding layer can be used for processing target object features, for example, after the image features are input into the self-attention decoding layer of the visual analysis model for target detection by a target visual task, such as target pedestrian detection, the self-attention decoding layer can enrich detail information of higher layers of the image features based on the self-attention mechanism, and extract the features of the target pedestrian from the image features.

Step 123: and inputting the target object characteristics into the sequence generation layer of the visual analysis model to obtain the attribute sequence of the target object characteristics output by the sequence generation layer.

The sequence generation layer may be configured to perform a serialization process on each target object feature to generate a series of attributes describing the target object, i.e., an attribute sequence, which may be presented in the form of a scalar number.

Specifically, after obtaining the target object features, each target object feature is fed into a sequence generation layer, which performs a serialization process on the input target object features, converts the target object features from an image representation space to a scalar numerical space, obtains an attribute sequence of each target object feature, and describes the target object in the target vision task in a scalar number.

In one example embodiment, the sequence generation layer may output the attribute sequence of the target object feature in time series. Fig. 4 exemplarily shows a structural diagram of the sequence generation layer, and as shown in fig. 4, the sequence generation layer may include a sequence self-attention layer, an image mutual attention layer, and a linear layer. The sequence self-attention layer can be used for performing self-attention calculation by taking the input features of the current time as Query, the input features of the current time and all the input features before the current time as Key-Value pairs (Key-Value); the image mutual attention layer can be used for performing mutual attention calculation on the output of the sequence self-attention layer by taking the image characteristics as Key values (keys); the linear layer may be used to perform a digitization process on the output of the image mutual attention layer to obtain an attribute sequence.

Based on the sequence generation layer in the embodiment corresponding to fig. 4, in the image analysis method provided in the embodiment of the present invention, in the process of analyzing the image to be analyzed I by using the visual analysis model to obtain the attribute sequence of each target object feature for the target visual task in the image to be analyzed output by the visual analysis model, the image feature coding layer first performs coding on the image to be analyzed from the image to be analyzedExtracting image features from image I

Then, a series of target object characteristics are obtained by utilizing the self-attention decoding layer

Where O represents a set of target objects output by the visual analysis model, each element in the set

Is a target object. Next, each target object feature is fed into a sequence-generating layer, which generates a series of attributes

And T represents the number of attribute elements in the attribute sequence. According to the difference of the target visual tasks, the attribute elements are the visual task output which is expected by the target visual tasks and used for describing the target object, namely, the target object can be described by an attribute sequence related to the visual tasks, and according to the difference of the target visual tasks, the definition of the attribute elements in the attribute sequence can be flexibly changed to meet diversified requirements. Based on this, the structure of the visual analysis model provided by the embodiment of the invention fully considers hierarchical structure information in the image, and can organize image features by taking 'objects' as basic units; meanwhile, the structure uses a new prediction layer, namely a sequence generation layer, and can convert the characteristics of the target object from an image representation space to a scalar numerical value space, so that the structure can be conveniently expanded to various different visual tasks.

Based on the sequence generation layer of the embodiment of fig. 4, fig. 5 exemplarily shows a third flow chart of the image analysis method provided by the present invention, and this exemplary embodiment provides an exemplary implementation of the above step 123, and as shown in fig. 5, steps 1231 to 1233 may be included.

Step 1231: and inputting the output characteristic of the image mutual attention layer at the previous moment as the input characteristic of the current moment into the sequence self-attention layer to obtain the self-attention value of the current moment output by the sequence self-attention layer. Wherein, the input feature of the sequence from the initial moment of the attention layer is the target object feature.

Step 1232: and inputting the self-attention value at the current moment into the image mutual attention layer to obtain the output characteristic of the image mutual attention layer at the current moment.

Step 1233: and inputting the output characteristics of the current moment into the linear layer to obtain the attribute elements of the current moment output by the linear layer. The attribute elements are used to construct attribute sequences.

Based on the structure of the sequence generation layer of the corresponding embodiment of fig. 4 and the method of the corresponding embodiment of fig. 5, fig. 6 schematically illustrates the working principle of the sequence generation layer, and referring to fig. 6, the sequence generation layer may be characterized by a single target object of the target object

For inputting, output attribute sequence with arbitrary length to describe the target object, the target object characteristics

Is a vector with dimension of 1 x d, wherein d is a positive integer and represents the characteristic dimension of the visual analysis model. The length and definition of the sequence of attributes may vary from target vision task to target vision task. And performing prediction output of the attribute sequence in time sequence in the sequence generation layer. Specifically, will

After the input sequence has been layered, the sequence has been layered

As input of initial time

Output characteristics for calculating the initial time

. For each time t after, the sequence generation layer will output the characteristics of the output of the previous time

As a new input feature

I.e. by

. And for the input features at each moment, processing the input features by a sequence self-attention layer and an image mutual attention layer in sequence. In particular, the sequence self-attention layer may characterize the input at time t

Performing self-attention calculation by using input characteristics of Query at time t and all previous times as Key Value pairs (Key-Value) to obtain self-attention Value at time t

Then will be

Inputting an image mutual attention layer, wherein the image mutual attention layer is used for analyzing the image characteristics of the image I to be analyzed

Performing mutual attention calculation as key values to obtain output characteristics at t moment

. The output characteristic of each time t can be subjected to the numerical processing of the linear layer to obtain the attribute element output at each time

To finally obtain

Property sequence of

. For example, the quantization process of the linear layer may be to multiply the features of the input linear layer by a weight matrix to obtain corresponding attribute elements.

Assuming that the target visual task requires the output of a sequence of attributes of length T, the input features at each time instant

Self-attention value

And output characteristics

May be determined using the following formula one, formula two, and formula three, respectively.

Specifically, the formula one is

；

The second formula is

；

Formula III is

；

Wherein the content of the first and second substances,

t represents the number of attribute elements in the attribute sequence, one attribute element is output at each time T, selfAttn () represents a self-attention computation function, crossAttn () represents a mutual-attention computation function,

the weight of the query is represented by,

the weight of the key-value pair is represented,

indicating input characteristics at time t

And a mosaic matrix of input features at all times before time t.

Based on the working principle, the attribute sequence output by the sequence generation layer can be interpreted into any required meaning according to the specific visual task, and network parameters related to the specific visual task are not required. That is to say, the sequence generation layer outputs a universal target representation form, uses an attribute sequence as the output of the visual task, and can generalize and apply the same visual analysis model structure to different visual tasks, wherein the visual analysis model has strong multi-visual task expansibility.

The following describes an application of the image analysis method provided by the embodiment of the present invention to a specific visual task, taking a target detection task and a human body posture estimation task as examples.

FIG. 7 is a schematic diagram illustrating the operation principle of a visual analysis model, and referring to FIG. 7, an image I to be analyzed is input into an image feature coding layer, and a residual network layer and a self-attention coding layer of the image feature coding layer first extract image features from the image I to be analyzed

，

Then input into the attention decoding layer, a set F of a series of target object features output from the attention decoding layer is obtained, and then each target object feature in F may be input into a sequence generation layer, which generates an attribute sequence of each target object feature.And training the target according to the target visual task requirement and the visual analysis model, wherein each attribute element in the attribute sequence can be regarded as any required attribute so as to meet the requirement of the target visual task. For the target detection task, the attribute sequence can be defined as the category score and the position and size of the detection rectangular box; for the human pose estimation task, the attribute sequence may be defined as the confidence and the coordinates of each keypoint.

Referring to fig. 8, fig. 8 is a schematic diagram illustrating the operation logic of the sequence generation layer when executing the target detection task, and referring to fig. 8, for the target detection task, each target object needs to be represented by an attribute sequence with a length of 5, and the attribute elements are the category score, the coordinate values x and y of the target detection box, and the width w and the height h, respectively.

Before the visual analysis model provided by the embodiment of the invention is used for executing the target detection task, the visual analysis model needs to be trained based on the target detection task. In an example embodiment of the invention, a training objective may be constructed based on a binary matching penalty. In particular, all target objects may be grouped together

And performing binary matching with a set O of all output objects of the visual analysis model to enable each target object to correspond to a unique output object, wherein the matching target is to minimize the sum of losses of all training samples. It can be assumed that an object is output

The matched target object is

When is coming into contact with

Is an empty collector

When the representation is not matched with the target object, the vision isOptimization during analytical model training

And

with minimal loss in between.

Illustratively, for the output category score, the focus loss function, focal () may be used for supervision, and for coordinate regression, both may be used

Loss function and Generalized Intersection over Union (GIoU) loss function

And (6) supervision is carried out. Therein are

The loss function represents the absolute value of the difference between the predicted value and the true value for minimizing the error, also known as the minimum absolute deviation (LAD).

Based on the above, aiming at the target detection task, the target detection loss function in the training of the visual analysis model

Can be expressed as:

；

wherein N represents the number of output objects when the target detection task is executed,

a target detection loss value representing the ith output object,

can be obtained byAnd calculating to obtain the formula IV.

Equation four can be expressed as:

；

wherein, the first and the second end of the pipe are connected with each other,

representing the ith object marked in the training sample;

representing an indication function when conditions

The value is 1 when the result is true, otherwise, the value is 0;

to represent

And

loss of focus in between;

to represent

And

the error between;

represent

And

generalized cross-over ratio loss between.

When performing visual analysis model training for a target detection task, a first sample image may be obtained, first label information is labeled to a detection target in the first sample image, and then the labeled first sample image is used as a training sample, and a loss function is used

And (5) training the visual analysis model, wherein the trained visual analysis model can be used for detecting the detection target.

With reference to fig. 7 and 8, when performing target detection, an image to be analyzed that needs target detection may be input to an image feature encoding layer for image feature extraction, the extracted image features are input to an attention decoding layer for processing, so as to obtain target object features, and then each target object feature input sequence generation layer is subjected to serialization processing, so as to obtain an attribute sequence of each target object feature. Such as for target object features

Can be prepared by

Input features as initial moments of sequence-generating layers

An input sequence generation layer, which is processed by a sequence self-attention layer and an image mutual attention layer to obtain output characteristics

，

After linear layer transformation, a first attribute element is obtained, which may be defined as a classification score. At the same time, the user can select the desired position,

as input features for the next moment

，

After the linear layer transformation, a second attribute element is obtained, and the attribute element can be defined as the position abscissa x of the target detection frame. By analogy, the sequence generation layer may output the position ordinate y of the target detection frame and the width w and height h of the target detection frame in sequence. For example, the detected target object is the object a in fig. 7, and the corresponding attribute sequence is the sequence 72.

With reference to fig. 7, fig. 9 is a schematic diagram illustrating the operation logic of the sequence generation layer in executing the human body posture estimation task, and with reference to fig. 9, for the human body posture estimation task, each target object needs to be represented by an attribute sequence with a length of 35, and the attribute elements are the confidence levels con and the coordinates of 17 key points respectively

Wherein, k is the index of the key point,

representing the coordinates of the kth keypoint.

Before the visual analysis model provided by the embodiment of the invention is used for executing the human body posture estimation task, the visual analysis model needs to be trained based on the human body posture estimation task. In an exemplary embodiment of the present invention, a training target may be constructed by using binary matching loss, and the construction principle may refer to the description of the above-mentioned implementation of the target detection task, which is not described herein again.

Exemplary ofFor the confidence con of the output, a Binary Cross Entropy (BCE) loss function may be used

Monitoring is carried out; for the key point coordinates, one can use

Loss function and target keypoint similarity (OKS) loss function

And (6) supervision is carried out.

Based on the method, aiming at the human body posture estimation task, the posture estimation loss function in the process of training the visual analysis model

Can be expressed as:

；

wherein M represents the number of output objects when executing the human body posture estimation task,

representing the attitude estimation penalty value for the ith output object,

can be calculated by the following formula five.

Equation five can be expressed as:

；

representing an indication function of when the condition is

The value is 1 when the result is true, otherwise, the value is 0;

to represent

And indicating a binary cross entropy loss between functions;

to represent

And

the error between;

to represent

And

the similarity between them is lost.

When the visual analysis model training aiming at the human body posture estimation task is carried out, a second sample image can be obtained, second label information is marked on human body key points in the second sample image, then the second sample image after marking processing is used as a training sample, and a loss function is utilized

And (4) training a visual analysis model, wherein the trained visual analysis model can be used for detecting key points of the human body.

With reference to fig. 7 and 9, when estimating the human body posture, the image to be analyzed, which needs to be estimated, may be input to the image feature encoding layer for image mappingAnd image feature extraction, namely inputting the extracted image features into a self-attention decoding layer for processing to obtain target object features, and then performing serialization processing on each target object feature input sequence generation layer to obtain an attribute sequence of each target object feature. Such as for target object characteristics

Can be prepared by

Input features as initial moments of sequence-generating layers

Input sequence generation layer, respectively processing by sequence self-attention layer and image mutual attention layer to obtain output characteristics

，

After the linear layer transformation, a first attribute element is obtained, which may be defined as a confidence con. At the same time, the user can select the required time,

as input features for the next moment

，

After linear layer transformation, a second attribute element is obtained, and the attribute element can be defined as the abscissa of the position of the first human body key point

. By analogy, the sequence generation layer can continuously output the vertical coordinate of the position of the first human body key point in turn

And coordinates of other human keypoint locations. For example, the detected human key points are key points in the human body 73 in fig. 7, and the corresponding attribute sequence is the sequence 74.

According to the image analysis method provided by the embodiment of the invention, the to-be-analyzed image of the target vision task is obtained, and the to-be-analyzed image is input into the vision analysis model obtained by training based on the sample image corresponding to the target vision task and the label data corresponding to the sample image, so that the attribute sequence of each target object characteristic aiming at the target vision task in the to-be-analyzed image output by the vision analysis model can be obtained. On one hand, various different visual tasks can be redefined to describe the attribute sequence generation problem of each target object, the task output forms of different visual tasks are unified, the output attribute sequence can be flexibly defined as the meaning which can describe the target object and is required by the target visual task according to the difference of the target visual tasks, and the diversified requirements can be met, so that the visual analysis model with the same structure can be used for processing various different visual tasks, the universality of the visual analysis model on various visual tasks is improved, the visual task processing efficiency is further improved, and the development cost is reduced. On the other hand, hierarchical structure information in the image to be analyzed is fully considered, the image features can be organized by taking 'objects' as basic units, and the accuracy of image feature extraction is improved.

The following describes the image analysis apparatus provided by the present invention, and the image analysis apparatus described below and the image analysis method described above may be referred to in correspondence with each other.

Fig. 10 schematically illustrates a structural diagram of an image analysis apparatus according to an embodiment of the present invention, and referring to fig. 10, the image analysis apparatus 1000 may include an obtaining module 1010 and an analyzing module 1020. The obtaining module 1010 may be configured to obtain an image to be analyzed of a target visual task; the analysis module 1020 may be configured to input the image to be analyzed into the visual analysis model, and obtain an attribute sequence of each target object feature for the target visual task in the image to be analyzed output by the visual analysis model; the visual analysis model is used for extracting target object features in image features of an image to be analyzed based on a target visual task and generating an attribute sequence describing the target object features; the visual analysis model is obtained by training based on the sample image corresponding to the target visual task and the label data corresponding to the sample image.

In an example embodiment, the analysis module 1020 may include: the first analysis unit is used for inputting the image to be analyzed into an image characteristic coding layer of the visual analysis model and obtaining the image characteristics of the image to be analyzed output by the image characteristic coding layer; the second analysis unit is used for inputting the image characteristics into a self-attention decoding layer of the visual analysis model and obtaining target object characteristics aiming at the target visual task in the image characteristics output by the self-attention decoding layer; and the third analysis unit is used for inputting the target object characteristics into the sequence generation layer of the visual analysis model and obtaining the attribute sequence of the target object characteristics output by the sequence generation layer.

In an example embodiment, the sequence generation layer may output the attribute sequence of the target object feature in time series; the sequence generation layer may include a sequence self-attention layer, an image mutual attention layer, and a linear layer. The sequence self-attention layer can be used for performing self-attention calculation by taking the input features at the current moment as query and the input features at the current moment and all the previous input features as key values; the image mutual attention layer can be used for carrying out mutual attention calculation on the output of the sequence self-attention layer by taking the image characteristics as key values; the linear layer may be used to perform a digitization process on the output of the image mutual attention layer to obtain an attribute sequence.

In an example embodiment, the third analysis unit may include: the first analysis subunit may be configured to input, as the input feature at the current time, the output feature at the previous time of the image mutual attention layer into the sequence self-attention layer, and obtain a self-attention value at the current time, where the input feature at the initial time of the sequence self-attention layer is a target object feature; the second analysis subunit is configured to input the self-attention value at the current time into the image mutual-attention layer, and obtain an output feature of the image mutual-attention layer at the current time; and the third analysis subunit is configured to input the output characteristic at the current time into the linear layer, to obtain attribute elements output by the linear layer at the current time, where the attribute elements form an attribute sequence.

In an example embodiment, the first analysis unit may include: the fourth analysis subunit is configured to input the image to be analyzed into a residual network layer of an image feature coding layer of the visual analysis model, and obtain an initial image feature of the image to be analyzed, which is output by the residual network layer, where the residual network layer may be used to map the image to be analyzed to an image feature space; and the fifth analysis subunit is used for inputting the initial image characteristics into a self-attention coding layer of the image characteristic coding layer, and obtaining the image characteristics of the image to be analyzed output from the attention coding layer, wherein the self-attention coding layer can be used for self-attention coding.

In an example embodiment, the length of the sequence of attributes output by the visual analytics model is determined based on the target visual task.

Fig. 11 illustrates a physical structure diagram of an electronic device, and as shown in fig. 11, the electronic device 1100 may include: a processor (processor) 1110, a Communication Interface (Communication Interface) 1120, a memory (memory) 1130, and a Communication bus 1140, wherein the processor 1110, the Communication Interface 1120, and the memory 1130 complete Communication with each other through the Communication bus 1140. Processor 1110 may invoke logic instructions in memory 1130 to perform the image analysis methods provided by the above-described method embodiments, which may include: acquiring an image to be analyzed of a target visual task; inputting the image to be analyzed into a visual analysis model, and obtaining an attribute sequence of each target object characteristic aiming at a target visual task in the image to be analyzed output by the visual analysis model; the visual analysis model is used for extracting the target object features in the image features of the image to be analyzed based on the target visual task and generating an attribute sequence for describing the target object features; the visual analysis model is obtained by training based on the sample image corresponding to the target visual task and the label data corresponding to the sample image.

In addition, the logic instructions in the memory 1130 may be implemented in software functional units and stored in a computer readable storage medium when sold or used as a stand-alone product. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

In another aspect, the present invention also provides a computer program product, the computer program product comprising a computer program, the computer program being storable on a non-transitory computer-readable storage medium, wherein when the computer program is executed by a processor, the computer is capable of executing the image analysis method provided by the above method embodiments, and the method may include: acquiring an image to be analyzed of a target visual task; inputting the image to be analyzed into a visual analysis model, and obtaining an attribute sequence of each target object characteristic aiming at a target visual task in the image to be analyzed output by the visual analysis model; the visual analysis model is used for extracting the target object features in the image features of the image to be analyzed based on the target visual task and generating an attribute sequence for describing the target object features; the visual analysis model is obtained by training based on the sample image corresponding to the target visual task and the label data corresponding to the sample image.

In yet another aspect, the present invention also provides a non-transitory computer-readable storage medium having stored thereon a computer program, which when executed by a processor is implemented to perform the image analysis method provided by the above method embodiments, the method may include: acquiring an image to be analyzed of a target visual task; inputting the image to be analyzed into a visual analysis model, and obtaining an attribute sequence of each target object characteristic aiming at a target visual task in the image to be analyzed output by the visual analysis model; the visual analysis model is used for extracting target object features in image features of an image to be analyzed based on a target visual task and generating an attribute sequence describing the target object features; the visual analysis model is obtained by training based on the sample image corresponding to the target visual task and the label data corresponding to the sample image.

The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment may be implemented by software plus a necessary general hardware platform, and may also be implemented by hardware. Based on the understanding, the above technical solutions substantially or otherwise contributing to the prior art may be embodied in the form of a software product, which may be stored in a computer-readable storage medium, such as ROM/RAM, magnetic disk, optical disk, etc., and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method according to the various embodiments or some parts of the embodiments.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. An image analysis method, comprising:

acquiring an image to be analyzed of a target vision task;

the visual analysis model is used for extracting target object features in the image features of the image to be analyzed based on the target visual task and generating an attribute sequence describing the target object features; the attribute sequence is presented in the form of a scalar number;

2. The image analysis method according to claim 1, wherein the inputting the image to be analyzed into a visual analysis model, and obtaining the attribute sequence of each target object feature for a target visual task in the image to be analyzed output by the visual analysis model comprises:

3. The image analysis method according to claim 2, wherein the sequence generation layer outputs the attribute sequence of the target object feature in time series; the sequence generation layer comprises a sequence self-attention layer, an image mutual attention layer and a linear layer;

the sequence self-attention layer is used for performing self-attention calculation by taking the input characteristics at the current moment as query and the input characteristics at the current moment and all the previous input characteristics as key value pairs;

4. The image analysis method of claim 3, wherein the inputting the target object feature into a sequence-generating layer of the visual analysis model, obtaining a sequence of attributes of the target object feature output by the sequence-generating layer, comprises:

inputting the self-attention value at the current moment into the image mutual attention layer to obtain the output characteristic of the image mutual attention layer at the current moment;

5. The image analysis method according to claim 2, wherein the inputting the image to be analyzed into an image feature coding layer of the visual analysis model to obtain the image features of the image to be analyzed output by the image feature coding layer comprises:

inputting the image to be analyzed into a residual error network layer of an image feature coding layer of the visual analysis model, and obtaining an initial image feature of the image to be analyzed, which is output by the residual error network layer, wherein the residual error network layer is used for mapping the image to be analyzed to an image feature space;

6. The image analysis method according to any one of claims 1 to 5, wherein the length of the sequence of attributes is determined based on the target visual task.

7. An image analysis apparatus, comprising:

the visual analysis model is used for extracting target object features in image features of the image to be analyzed based on the target visual task and generating an attribute sequence describing the target object features; the attribute sequence is presented in the form of a scalar number; the visual analysis model is obtained by training based on a sample image corresponding to the target visual task and label data corresponding to the sample image.

8. The image analysis apparatus according to claim 7, wherein the analysis module comprises:

the second analysis unit is used for inputting the image features into a self-attention decoding layer of the visual analysis model and obtaining target object features aiming at the target visual task in the image features output by the self-attention decoding layer;

9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the image analysis method according to any one of claims 1 to 6 when executing the computer program.

10. A non-transitory computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, implements the image analysis method according to any one of claims 1 to 6.