CN113033580B

CN113033580B - Image processing method, device, storage medium and electronic equipment

Info

Publication number: CN113033580B
Application number: CN202110351439.9A
Authority: CN
Inventors: 吴昊; 陈嘉诚; 王长虎
Original assignee: Beijing Youzhuju Network Technology Co Ltd
Current assignee: Beijing Youzhuju Network Technology Co Ltd
Priority date: 2021-03-31
Filing date: 2021-03-31
Publication date: 2024-02-02
Anticipated expiration: 2041-03-31
Also published as: CN113033580A

Abstract

The disclosure relates to an image processing method, an image processing device, a storage medium and an electronic device, and provides an adaptive pooling mode supporting variable length characteristics, so that manpower and time required by pooling operation in an image processing process are reduced, image pooling efficiency is improved, and image processing efficiency is further improved. The image processing method comprises the following steps: acquiring corresponding target image characteristics in a target image to be processed; determining a position vector with the length consistent with the characteristic length of the target image, wherein the position vector comprises a plurality of position numbers which are sequentially arranged; converting the position vector into a two-dimensional position coding vector; determining a pooling coefficient of the target image feature according to the position coding vector and the sequence model; and carrying out dot multiplication on the pooling coefficient and the target image characteristic to obtain an image pooling result of the target image.

Description

Image processing method, device, storage medium and electronic equipment

Technical Field

The present disclosure relates to the field of image processing technologies, and in particular, to an image processing method, an image processing device, a storage medium, and an electronic apparatus.

Background

In the field of image processing, pooling can integrate feature points within a small neighborhood to obtain new features, also known as feature aggregation. The pooling methods in the related art include max-pooling (max-pooling), k-max-pooling (k-max pooling), average pooling (average pooling), and the like. In practical application, a corresponding pooling mode needs to be selected manually according to the type of the image feature extractor, and when the type of the image feature extractor changes, the corresponding pooling mode needs to be selected again, so that labor and time are consumed. Taking k-maximum value pooling as an example, multiple experiments are needed to try different k values to find out the optimal characteristic aggregation function, and the parameter adjustment cost is high.

Disclosure of Invention

This summary is provided to introduce a selection of concepts in a simplified form that are further described below in the detailed description. This summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.

In a first aspect, the present disclosure provides an image processing method, the method comprising:

acquiring corresponding target image characteristics in a target image to be processed;

determining a position vector with the length consistent with the length of the target image characteristic, wherein the position vector comprises a plurality of position numbers which are sequentially arranged, and the position numbers are in one-to-one correspondence with the target image characteristic;

converting the position vectors into two-dimensional position coding vectors, wherein each position coding vector is different, and the arrangement sequence of the position coding vectors is consistent with the arrangement sequence of the corresponding position vectors;

determining a pooling coefficient of the target image feature according to the position coding vector and the sequence model;

and carrying out dot multiplication on the pooling coefficient and the target image characteristic to obtain an image pooling result of the target image.

In a second aspect, the present disclosure provides an image processing apparatus, the apparatus comprising:

the acquisition module is used for acquiring corresponding target image characteristics in the target image to be processed;

the first determining module is used for determining a position vector with the length consistent with the length of the target image characteristic, wherein the position vector comprises a plurality of position numbers which are sequentially arranged, and the position numbers are in one-to-one correspondence with the target image characteristic;

the conversion module is used for converting the position vectors into two-dimensional position coding vectors, wherein each position coding vector is different, and the arrangement sequence of the position coding vectors is consistent with the arrangement sequence of the corresponding position vectors;

the second determining module is used for determining the pooling coefficient of the target image characteristic according to the position coding vector and the sequence model;

and the dot multiplication module is used for dot multiplying the pooling coefficient and the target image characteristic to obtain an image pooling result of the target image.

In a third aspect, the present disclosure provides a computer readable medium having stored thereon a computer program which when executed by a processing device implements the steps of the method described in the first aspect.

In a fourth aspect, the present disclosure provides an electronic device, including:

a storage device having a computer program stored thereon;

processing means for executing said computer program in said storage means to carry out the steps of the method described in the first aspect.

Through the technical scheme, the position vector with the length consistent with the characteristic length of the target image can be determined through the mode, and then the position vector is converted into the two-dimensional position coding vector, so that the pooling coefficient is determined according to the two-dimensional position coding vector, and the image pooling result of the target image is obtained. Therefore, the pooling coefficient can be adaptively changed according to the target image characteristics, when the extracted image characteristics are changed due to the change of the type of the characteristic extractor, the pooling mode is not required to be determined again, multiple experimental parameter adjustment searching can be avoided, and therefore labor and time consumed in the image pooling process are reduced. And the pooling coefficient can be adaptively changed according to the length of the target image feature, so that the adaptive pooling of the variable-length feature can be supported, the image features with different lengths can be flexibly pooled, and the pooling requirements under different scenes are met.

Additional features and advantages of the present disclosure will be set forth in the detailed description which follows.

Drawings

The above and other features, advantages, and aspects of embodiments of the present disclosure will become more apparent by reference to the following detailed description when taken in conjunction with the accompanying drawings. The same or similar reference numbers will be used throughout the drawings to refer to the same or like elements. It should be understood that the figures are schematic and that elements and components are not necessarily drawn to scale. In the drawings:

FIG. 1 is a flowchart illustrating an image processing method according to an exemplary embodiment of the present disclosure;

FIG. 2 is a process diagram of an image processing method according to an exemplary embodiment of the present disclosure;

FIG. 3 is a block diagram of an image processing apparatus according to an exemplary embodiment of the present disclosure;

fig. 4 is a block diagram of an electronic device, according to an exemplary embodiment of the present disclosure.

Detailed Description

Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While certain embodiments of the present disclosure have been shown in the accompanying drawings, it is to be understood that the present disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein, but are provided to provide a more thorough and complete understanding of the present disclosure. It should be understood that the drawings and embodiments of the present disclosure are for illustration purposes only and are not intended to limit the scope of the present disclosure.

It should be understood that the various steps recited in the method embodiments of the present disclosure may be performed in a different order and/or performed in parallel. Furthermore, method embodiments may include additional steps and/or omit performing the illustrated steps. The scope of the present disclosure is not limited in this respect.

The term "including" and variations thereof as used herein are intended to be open-ended, i.e., including, but not limited to. The term "based on" is based at least in part on. The term "one embodiment" means "at least one embodiment"; the term "another embodiment" means "at least one additional embodiment"; the term "some embodiments" means "at least some embodiments. Related definitions of other terms will be given in the description below.

It should be noted that the terms "first," "second," and the like in this disclosure are merely used to distinguish between different devices, modules, or units and are not used to define an order or interdependence of functions performed by the devices, modules, or units. It is further noted that references to "one" or "a plurality" in this disclosure are intended to be illustrative rather than limiting, and those of ordinary skill in the art will appreciate that "one or more" is intended to be understood as "one or more" unless the context clearly indicates otherwise.

The names of messages or information interacted between the various devices in the embodiments of the present disclosure are for illustrative purposes only and are not intended to limit the scope of such messages or information.

In the field of image processing, pooling can integrate feature points within a small neighborhood to obtain new features, also known as feature aggregation. Generalized image pooling can be understood as follows: for the feature set F extracted by the feature extractor, the dimension is denoted as n×d, where N is the number of feature elements in the feature set F (for a picture feature, N may be the number of grids of the feature, i.e., n=h×w, H is the height of the image feature map, W is the width of the image feature map, N may be the number of video frames for video, N may be the number of words or segmentation words for text), D represents the number of feature channels, and the image pooling operator Φ may be defined as:

by performing feature aggregation (i.e., pooling) on the N D-dimensional feature vectors of the feature set F in this dimension, the number (N) of feature elements, a global feature vector can be obtained.

Pooling is widely used in various image processing procedures. For example, in the image-text embedded model for learning visual representation and text representation by utilizing the correlation between an image and text, firstly, respectively extracting features of an input image and an input text, respectively aggregating (i.e. pooling) features of an image mode and a text mode, and mapping the aggregated features into a shared low-dimensional space, thereby realizing training of the image-text embedded model. Therefore, the trained image-text embedded model can be directly applied to scenes such as Yu Tuwen matching, searching and the like, for example, given query text, and finding out the picture most relevant to the text in the database. Or, the picture characterization and the text characterization obtained by training can provide high-quality content side characteristics for the recommendation system so as to improve the performance of the recommendation system and alleviate the cold start problem. Or, as a multi-mode pre-training task, the trained picture feature extractor or text feature extractor (such as convolutional neural network, sequence model, etc.) can be migrated to other visual or text tasks (such as picture object recognition, text classification, etc.), as an initial feature extractor, so as to reduce training difficulty.

The inventor finds that the characteristic aggregation mode has larger influence on the performance of the image-text embedding model in experiments. Meanwhile, the pooling mode without the learnable parameters, which is selected after carefully adjusting the parameters, is often better in effect and faster in speed than the existing common complex characteristic pooling modes with the learnable parameters. However, at present, when training the image-text embedded model, grid search (grid search) of two-pair combination needs to be performed on all possible pooling modes of two-side modal features, and a large number of repeated experiments are performed, so that the best setting can be found, and the performance of the whole model is maximized. It is noted that the data of the modes of video, picture and text are characterized, and the characteristics of some modes are naturally lengthened, and the existing pooling mode has no generality and does not have a universal pooling mode which can achieve the best effect on different modes. In the same mode, the distribution of feature sets extracted by different feature extractors may also have obvious differences, and a pooling module which is most suitable for one feature extractor may not necessarily obtain the best effect on another feature extractor. For example, for image feature modes, there are a cyclic neural network feature extractor (such as a GRU, an LSTM, etc.) and a Transformer feature extractor (such as a BERT, etc.), these two different feature extractors are different in principle, and the corresponding most suitable pooling modes are also different, so that the existing pooling modes cannot achieve the optimal effect on both of these two different feature extractors at the same time. Therefore, there is a need for a flexible adaptive pooling module that can automatically adjust the pooling format according to the feature modality type and the feature extractor type.

In addition, the inventor also researches and discovers that the pooling method in the related technology is only suitable for the picture data which can ensure that the characteristic length is unchanged during training, but cannot obtain better pooling effect when the input characteristic length is changed. Thus, there is also a need for a pooling approach that better supports the variable length features.

In view of this, the present disclosure provides an image processing method, apparatus, storage medium, and electronic device, so as to provide an adaptive pooling manner supporting variable length features, reduce manpower and time required for pooling operation in an image processing process, and improve image pooling efficiency, thereby improving image processing efficiency.

Fig. 1 is a flowchart illustrating an image processing method according to an exemplary embodiment of the present disclosure. Referring to fig. 1, the image processing method includes:

step 101, obtaining corresponding target image characteristics in a target image to be processed.

Step 102, determining a position vector with the length consistent with the characteristic length of the target image, wherein the position vector comprises a plurality of position numbers which are sequentially arranged, and the position numbers are in one-to-one correspondence with the characteristic of the target image.

Step 103, converting the position vectors into two-dimensional position coding vectors, wherein each position coding vector is different, and the arrangement sequence of the position coding vectors is consistent with the arrangement sequence of the corresponding position vectors.

104, determining a pooling coefficient of the target image characteristic according to the position coding vector and the sequence model;

and 105, performing dot multiplication on the pooling coefficient and the target image characteristic to obtain an image pooling result of the target image.

By the method, the position vector with the length consistent with the characteristic length of the target image can be determined, and then the position vector is converted into the two-dimensional position coding vector, so that the pooling coefficient is determined according to the two-dimensional position coding vector, and the image pooling result of the target image is obtained. Therefore, the pooling coefficient can be adaptively changed according to the target image characteristics, when the extracted image characteristics are changed due to the change of the type of the characteristic extractor, the pooling mode is not required to be determined again, multiple experimental parameter adjustment searching can be avoided, and therefore labor and time consumed in the image pooling process are reduced. And the pooling coefficient can be adaptively changed according to the length of the target image feature, so that the adaptive pooling of the variable-length feature can be supported, the image features with different lengths can be flexibly pooled, and the pooling requirements under different scenes are met.

In order to make those skilled in the art more understand the image processing manner provided in the present disclosure, the following details of the above steps are exemplified.

For example, in a context matching scenario, such as a scenario in which a corresponding text is retrieved through a picture, the acquiring of the target image to be processed may be in response to an image input operation triggered by the user, acquiring an image input by the user, or may be acquiring an image captured by the electronic device in real time under authorization of the user, or the like. Of course, in other image processing scenarios, the method of acquiring the target image to be processed may be any other possible image acquisition method, for example, downloading the disclosed image from the network, or acquiring the frame image in each frame of video as the target image, etc., which is not limited in the embodiments of the present disclosure.

After the target image to be processed is acquired, image features in the target image may be extracted by an image feature extractor (e.g., convolutional neural network, sequence model, etc.). Alternatively, in the case where the target image is a video frame image, the extracting the image feature in the target image may be extracting the image feature of each frame of video by a video feature extractor to obtain the target image feature. Therefore, the image processing method provided by the embodiment of the disclosure can be applied to the pooling processing of the frame image characteristics extracted from the video in the video text matching process.

In a possible manner, all the extracted image features may be subjected to subsequent processing as target image features. In addition, the inventor researches and discovers that, in the model training stage, the generalization capability of the model obtained by training can be improved by randomly discarding the characteristic elements (such as randomly discarding some grids in the characteristic grids for picture characteristics) before pooling. Therefore, in another possible manner, all image features corresponding to the target image to be processed may be acquired first, and then at least one image feature in all image features is randomly discarded to obtain the target image feature corresponding to the target image.

Then, a position vector having a length consistent with the target image feature length may be determined, the position vector including a plurality of position numbers sequentially arranged, the position numbers corresponding to the target image features one by one. For example, if the number of feature elements in the feature set F extracted by the feature extractor is N, and all the image features in the feature set are taken as target image features, that is, the target image feature length is N, the position vector can be determinedIs (1, 2, …, N).

The position vector may then be converted into a two-dimensional position-coded vector, such as in the example above, the position vector may be The position numbers of 1 to N are converted into two-dimensional position-coded vectors containing position information. Each position coding vector is different from the other, and the arrangement sequence of the position coding vectors is consistent with the arrangement sequence of the corresponding position vectors. It should be appreciated that the position-coded vector may preserve the magnitude relationship between position numbers in the position vector, so that in a subsequent pooling process, the position-coded vector may be dot-multiplied with image features in order to fit any pooling operation including max-pooling, mean-pooling, k-max-pooling, and achieve adaptive pooling of variable-length features.

In a possible manner, the obtaining of the target image features corresponding to the target image to be processed may be: and acquiring target image characteristics corresponding to at least one channel in the target image to be processed. Accordingly, converting the position vector into a two-dimensional position-coded vector may be: and determining a one-dimensional vector corresponding to each position number in the position vectors, wherein the dimension of the one-dimensional vector is consistent with the channel number of the target image feature, so as to obtain a two-dimensional position coding vector corresponding to the position vector.

For example, for the feature set F extracted by the feature extractor, the dimension is denoted as n×d, where N is the number of feature elements in the feature set F, i.e., the feature length, and D represents the number of feature channels. Under the condition that the characteristic length of each channel target image is N, the position vector corresponding to each channel target image characteristic Is (1, 2, …, N). For the position vector +.>Each position number in the image, and determining a one-dimensional vector corresponding to the position number and having the dimension consistent with the channel number of the target image feature, so as to obtain a two-dimensional position coding vector p corresponding to the position vector as +.>k is an integer greater than 1 and less than N. Wherein each element in the position-coding vector p (i.e.)>To->Any of which) is a one-dimensional vector of D x 1.

It should be understood that determining the specific value of each one-dimensional vector in the position-coding vector may be performed according to a predetermined position-coding function, or each position-coding vector may be given a learnable vector as its position-coding vector, so that the model learns the position-coding vector of each position-coding vector during the training process. Of course, the position vector may also be converted into a two-dimensional position-encoded vector according to other possible manners, which are not limited by the embodiments of the present disclosure. The two possible ways mentioned above are explained below.

In a possible manner, determining a one-dimensional vector corresponding to the position number, the dimension of which corresponds to the channel number of the target image feature, may be: and determining a one-dimensional vector with the dimension consistent with the channel number of the target image feature corresponding to the position number according to a position coding function, wherein the position coding function is used for determining element values of odd positions in the one-dimensional vector through first conversion calculation and determining element values of even positions in the one-dimensional vector through second conversion calculation.

For example, for the feature set F extracted by the feature extractor, the dimension is denoted as n×d, where N is the number of feature elements in the feature set F, i.e., the feature length, and D represents the number of feature channels. Under the condition that the characteristic length of each channel target image is N, the position vector corresponding to each channel target image characteristicIs (1, 2, …, N). In this case, the formula of the position-coding function may be:

wherein,

for the element values of even positions in the one-dimensional vector, sin (w _j Calculated by k), for the element values of the odd positions in the one-dimensional vector, cos (w) _j And k) calculating. Alternatively, in other possible ways, the first transformation calculation formula may be cos (w _j K), the second conversion calculation formula may be sin (w _j K), which are not limited by the disclosed embodiments.

Through the mode, the position coding function can convert the position numbers of the features after the feature is sequenced according to the size into continuous vectors, the sequence and relative distance information in the position numbers are reserved, and the learning difficulty and the processing difficulty of a subsequent sequence model can be effectively reduced.

In another possible manner, the image processing method provided by the embodiment of the present disclosure may be encapsulated in an image processing model, and determining a one-dimensional vector corresponding to the position number, where the dimension corresponds to the number of channels of the target image feature, may be: and determining a one-dimensional vector corresponding to the position number and having the dimension consistent with the channel number of the target image feature according to the pre-training parameters of the image processing model. The pre-training parameters of the image processing model can be obtained by training in the following way: and determining a sample vector for each position number in the sample position vector with the length consistent with the length of the sample image feature to obtain a sample position coding vector, wherein the sample vector is a sample one-dimensional vector with the dimension identical with the channel number of the sample image feature. And then, carrying out pooling processing on the sample image characteristics according to the sample position coding vector, and determining an image processing result according to the pooled sample image characteristics. And adjusting parameters of the image processing model according to the sample image processing result pre-marked in the sample image corresponding to the sample image characteristics.

For example, the image processing model may be a text matching model, and then the sample image processing result may be a text matching result corresponding to the sample image, or the image processing model may be an image classification model, and then the sample image processing result is an image classification result corresponding to the sample image, and so on.

Taking the image-text matching model as an example, the training process can comprise the following steps:

1. matching each of the image modality samples x in the sample dataset to each other _A And text modality sample x _B Composition of sample pairs (x _A ,x _B )。

2. For each paired sample (x _A ,x _B ) The following processing is performed:

2.1, extracting an image modal sample x by an image feature extractor _A To obtain a feature set F _A ＝f _A (x _A )。

2.2 for feature setsRandomly discarding a certain proportion of features to obtain a feature set +.>Wherein N is _A '<N _A 。

2.3 for feature set F _A The feature of the pooling obtained by using the pooling operator is that: v _A ＝φ _A (F _A ')。

2.4, extracting an image modal sample x by an image feature extractor _B To obtain a feature set F _B ＝f _B (x _B )。

2.5 for feature setsRandomly discarding a certain proportion of features to obtain a feature set +.>Wherein N is _B '<N _B 。

2.6 for feature set F _B The feature of the pooling obtained by using the pooling operator is that: v _B ＝φ _B (F _B ')。

2.7 sampling mismatched image Modal sample x _A And text modality sample x _B Obtaining a sample pair (x _A ,x _/B ) And (x) _/A ,x _B ) And the pooled features (v) obtained in the above manner _A ,v _/B ) Sum (v) _/A ,v _B ) Model loss is calculated using a metric learning loss function (e.g., triplet loss) such that the associated image and text-pooled features are more similar and the unassociated image and text-pooled features are more distant.

3. For each sample pair in step 2 (e.g., (x _A ,x _B )、(x _A ,x _/B ) Training a random gradient descent algorithm until the model converges.

In step 2.3, a sample vector may be determined for each position number in the sample position vector having a length that corresponds to the length of the sample image feature, to obtain a sample position encoded vector, and then the sample image feature is pooled according to the sample position encoded vector. Therefore, in the subsequent process, the model can automatically adjust parameters through training the random gradient descent algorithm to the process of model convergence so as to obtain more accurate sample coding vectors. In the model application stage, a one-dimensional vector with the dimension consistent with the channel number of the target image feature corresponding to the position number can be determined according to the pre-training parameters of the model so as to convert the position vector corresponding to the target image feature into a position coding vector, thereby facilitating the subsequent image processing.

After the position-coding vector is obtained, the pooling coefficient of the target image feature can be determined according to the position-coding vector and the sequence model. In a possible manner, the position-coding vector may be input into a sequence model to obtain a sequence processing result, and then the sequence processing result is normalized to obtain a pooling coefficient of the target image feature.

For example, the sequence model may include a bi-directional gated recurrent neural network model, i.e., a bi-directional GRU model. It should be appreciated that the bi-directional GRU model is a simple and efficient sequence model that has sufficient model expression capability to ensure that it finds the appropriate pooling coefficients during training without incurring excessive computational overhead. Of course, in other possible manners, the sequence model may include Transformer, LSTM (Long Short-Term Memory network), and the like, which is not limited by the embodiments of the present disclosure.

Taking a bidirectional GRU model as an example, the obtained position code vector p isThe position-coded vector may then be converted to a pooling coefficient θ by a bi-directional GRU. Wherein θ= (θ) ₁ ,…,θ _N ). The pooling coefficient θ can then be normalized by a softmax function such that +. >That is, the pooling coefficient of the target image feature may be determined by the following formula: θ' =softmax (biglu (p)).

FinallyThe pooling coefficient may be dot multiplied with the target image feature to obtain an image pooling result of the target image. For example, the target image features corresponding to at least one channel may be ranked. For each ordered characteristic channelBy combining the pooling coefficients θ' and + ->And performing dot multiplication to obtain the characteristic value after pooling. It should be appreciated that the same pooling coefficient θ' is used for the different channels.

For example, referring to fig. 2, the channel number D is 3 and the length of the target image feature is N. The target image features of each channel are first ranked. For example, the first channel is ranked 8, 9, 4 before and 9, 8, 4 after the ranking. For the determination of the pooling coefficient reference may be made to the lower part of fig. 2. Position vectors (1, …, k, …, N) are first determined that are consistent in length with the target image feature length. The position vector is then converted into a position-coded vector by a position-coding functionThen, the pooling coefficient (θ) can be obtained from the position-coded vector by the bidirectional GRU ₁ ,…,θ _k ,…,θ _N ). And finally, carrying out dot multiplication on the pooling coefficient and the ordered target image characteristics. Specifically, for a target image feature of a certain channel after sequencing, the target image feature is multiplied by a pooling coefficient correspondingly, and all multiplication results are added or weighted and added to obtain the pooling feature corresponding to the channel. Therefore, N D-dimensional feature vectors can be subjected to feature aggregation in the dimension of the number (N) of feature elements, and a global D-dimensional pooling feature is obtained.

In this way, the order information of the individual position numbers can be preserved by the position-coding function, so that the subsequent model uses this information to generate pooling coefficients for the eigenvalues of the different positions. Meanwhile, the bi-directional GRU model can process input sequences of any length and has the capability of fitting any pooling operations including maximum pooling, mean pooling and k-maximum pooling in training. Therefore, the pooling coefficient can be adaptively changed according to the length of the target image characteristic, multiple experimental parameter adjustment searching is avoided, and labor and time consumed in the image pooling process are reduced.

In addition, the image processing mode provided by the embodiment of the disclosure is applied to the image processing model, so that the parameter adjustment cost can be reduced while the accuracy of an image processing result is improved, and the self-adaptive pooling of the variable-length image features is realized.

For example, the image processing model is a graph-text matching model, and an experiment of graph-text matching is performed on a graph-text data set COCO-capture. The training set of the data set is about 12 ten thousand pictures, and the testing set is 5000 pictures, wherein each picture has 5 corresponding texts describing the content of the picture. VSE++ is adopted as a basic image-text matching method (using mean pooling), and the pooling mode provided by the present disclosure is added on the basis of the VSE++. Meanwhile, a more complex image-text matching method is adopted for comparison, and the comparison comprises SCAN and VSRN. The VSE++ is a classical method of a multi-modal embedded model, is commonly used for matching a language and picture text and matching a video text, and improves the performance of the multi-modal feature matching model by mining more difficult negative samples in a loss function. SCAN is a graph-text matching method for modeling word information in text and object information in pictures in a cross-mode manner by using an attention mechanism. VSRN is a graph-text matching method for aggregating objects and semantic information in pictures by using a neural network and then matching with text information. It should be appreciated that other factors in the execution of the methods in the experiment, such as the picture feature extractor (using the fast-RCNN pre-trained on COCO object detection), the text feature extractor (using bi-directional GRU), the size of the input picture (600 x1000 for each picture size at test), etc., are exactly the same. Finally, the results shown in table 1 can be obtained:

TABLE 1

Referring to table 1, it can be known that the pooling manner provided by the present disclosure can significantly improve the performance of the basic image-text matching model on the standard data set, and only adds very small additional calculation overhead, so that the simple basic method vse++ can surpass the more complex methods SCAN and VSRN proposed in recent years, and no additional parameter adjustment process is required, thereby improving the efficiency.

For another example, taking a video text matching scenario as an example, the pooling manner provided in the present disclosure is performed for the image feature of each frame image extracted from the video, and the obtained video text matching result is compared with the pooling manner not performed in the present disclosure. Specifically, a video text matching experiment was performed on the video text dataset MSR-VTT. The dataset training set comprises 6573 sections of videos, the test set comprises 2990 sections of videos, and the rest videos belong to the verification set. With 20 corresponding sentences of text describing the video content per video. VSE++ and HGR are adopted as a basic method for video text matching, and the feature pooling mode provided by the present disclosure is replaced on their respective model designs as a feature aggregation function of video and text. The HGR is a method for matching video texts through cross-mode information matching with different granularities. It should be appreciated that other factors in the execution of the methods in the experiment, such as the video feature extractor (all using ResNet-152 to extract the picture features of each frame of video), the text feature extractor (all using bi-directional GRU), the video sampling rate, etc., are exactly the same. Finally, the results shown in table 2 can be obtained:

TABLE 2

Video text matching method	Video recall (Top 1)	Video recall (Top 10)	Text recall (Top 1)
				VSE++	8.3％	24.0％	14.4％
VSE++ incorporates the present disclosure	8.7％	25.3％	16.0％
				HGR	8.9％	25.8％	14.3％
HGR incorporates the present disclosure	9.1％	25.9％	15.0％

Referring to table 2, it can be known that the pooling method provided by the disclosure not only can avoid multiple experimental parameter adjustment searches, support adaptive pooling of variable length features, but also can stably promote the performance of two basic video text matching methods on a standard data set.

Based on the same inventive concept, the embodiments of the present disclosure also provide an image processing apparatus. Referring to fig. 3, the image processing apparatus 300 includes:

an acquiring module 301, configured to acquire a corresponding target image feature in a target image to be processed;

a first determining module 302, configured to determine a position vector with a length consistent with a length of the target image feature, where the position vector includes a plurality of position numbers that are sequentially arranged, and the position numbers are in one-to-one correspondence with the target image feature;

a conversion module 303, configured to convert the position vectors into two-dimensional position-coded vectors, where each position-coded vector is different from the other position-coded vector, and an arrangement sequence of the position-coded vectors is consistent with an arrangement sequence of the corresponding position vectors;

A second determining module 304, configured to determine a pooling coefficient of the target image feature according to the position-coding vector and the sequence model;

and the dot multiplication module 305 is configured to dot multiply the pooling coefficient with the target image feature to obtain an image pooling result of the target image.

Optionally, the acquiring module 301 is configured to:

acquiring target image characteristics corresponding to at least one channel in a target image to be processed;

the conversion module 303 is configured to:

and determining a one-dimensional vector corresponding to the position number and having the dimension consistent with the channel number of the target image feature according to each position number in the position vectors so as to obtain a two-dimensional position coding vector corresponding to the position vector.

Optionally, the conversion module 303 is configured to:

and determining a one-dimensional vector corresponding to the position number and having the dimension consistent with the channel number of the target image feature according to a position coding function, wherein the position coding function is used for determining element values of odd positions in the one-dimensional vector through first conversion calculation and determining element values of even positions in the one-dimensional vector through second conversion calculation.

Optionally, the conversion module 303 is configured to:

Determining a one-dimensional vector corresponding to the position number and having the dimension consistent with the channel number of the target image feature according to the pre-training parameters of the image processing model, wherein the pre-training parameters of the image processing model are obtained by training in the following mode:

determining a sample vector for each position number in a sample position vector with the length consistent with the length of a sample image feature to obtain a sample position coding vector, wherein the sample vector is a sample one-dimensional vector with the dimension identical with the channel number of the sample image feature;

carrying out pooling treatment on the sample image characteristics according to the sample position coding vector, and determining an image processing result according to the pooled sample image characteristics;

and adjusting parameters of the image processing model according to the sample image processing result pre-marked in the sample image corresponding to the sample image characteristics.

Optionally, the acquiring module 301 is configured to:

acquiring all image features corresponding to a target image to be processed;

and randomly discarding at least one image feature in all the image features to obtain the target image feature corresponding to the target image.

Optionally, the second determining module 304 is configured to:

inputting the position coding vector into a sequence model to obtain a sequence processing result;

normalizing the sequence processing result to obtain a pooling coefficient of the target image feature.

Optionally, the sequence model comprises a two-way gated recurrent neural network model.

The specific manner in which the various modules perform the operations in the apparatus of the above embodiments have been described in detail in connection with the embodiments of the method, and will not be described in detail herein.

Based on the same inventive concept, the embodiments of the present disclosure also provide a computer-readable medium having stored thereon a computer program which, when executed by a processing device, implements the steps of any of the above-described image processing methods.

Based on the same inventive concept, the embodiments of the present disclosure further provide an electronic device, including:

a storage device having a computer program stored thereon;

processing means for executing the computer program in the storage means to implement the steps of any of the image processing methods described above.

Referring now to fig. 4, a schematic diagram of an electronic device 400 suitable for use in implementing embodiments of the present disclosure is shown. The terminal devices in the embodiments of the present disclosure may include, but are not limited to, mobile terminals such as mobile phones, notebook computers, digital broadcast receivers, PDAs (personal digital assistants), PADs (tablet computers), PMPs (portable multimedia players), in-vehicle terminals (e.g., in-vehicle navigation terminals), and the like, and stationary terminals such as digital TVs, desktop computers, and the like. The electronic device shown in fig. 4 is merely an example and should not be construed to limit the functionality and scope of use of the disclosed embodiments.

As shown in fig. 4, the electronic device 400 may include a processing means (e.g., a central processing unit, a graphics processor, etc.) 401, which may perform various suitable actions and processes according to a program stored in a Read Only Memory (ROM) 402 or a program loaded from a storage means 408 into a Random Access Memory (RAM) 403. In the RAM 403, various programs and data necessary for the operation of the electronic device 400 are also stored. The processing device 401, the ROM 402, and the RAM 403 are connected to each other by a bus 404. An input/output (I/O) interface 405 is also connected to bus 404.

In general, the following devices may be connected to the I/O interface 405: input devices 406 including, for example, a touch screen, touchpad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; an output device 407 including, for example, a Liquid Crystal Display (LCD), a speaker, a vibrator, and the like; storage 408 including, for example, magnetic tape, hard disk, etc.; and a communication device 409. The communication means 409 may allow the electronic device 400 to communicate with other devices wirelessly or by wire to exchange data. While fig. 4 shows an electronic device 400 having various means, it is to be understood that not all of the illustrated means are required to be implemented or provided. More or fewer devices may be implemented or provided instead.

In particular, according to embodiments of the present disclosure, the processes described above with reference to flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a non-transitory computer readable medium, the computer program comprising program code for performing the method shown in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network via communications device 409, or from storage 408, or from ROM 402. The above-described functions defined in the methods of the embodiments of the present disclosure are performed when the computer program is executed by the processing device 401.

It should be noted that the computer readable medium described in the present disclosure may be a computer readable signal medium or a computer readable storage medium, or any combination of the two. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples of the computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this disclosure, a computer-readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present disclosure, however, the computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, with the computer-readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, fiber optic cables, RF (radio frequency), and the like, or any suitable combination of the foregoing.

In some implementations, communications may be made using any currently known or future developed network protocol, such as HTTP (HyperText Transfer Protocol ), and may be interconnected with any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include a local area network ("LAN"), a wide area network ("WAN"), the internet (e.g., the internet), and peer-to-peer networks (e.g., ad hoc peer-to-peer networks), as well as any currently known or future developed networks.

The computer readable medium may be contained in the electronic device; or may exist alone without being incorporated into the electronic device.

The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to: acquiring corresponding target image characteristics in a target image to be processed; determining a position vector with the length consistent with the length of the target image characteristic, wherein the position vector comprises a plurality of position numbers which are sequentially arranged, and the position numbers are in one-to-one correspondence with the target image characteristic; converting the position vectors into two-dimensional position coding vectors, wherein each position coding vector is different, and the arrangement sequence of the position coding vectors is consistent with the arrangement sequence of the corresponding position vectors; determining a pooling coefficient of the target image feature according to the position coding vector and the sequence model; and carrying out dot multiplication on the pooling coefficient and the target image characteristic to obtain an image pooling result of the target image.

Computer program code for carrying out operations of the present disclosure may be written in one or more programming languages, including, but not limited to, an object oriented programming language such as Java, smalltalk, C ++ and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider).

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The modules described in the embodiments of the present disclosure may be implemented in software or hardware. The name of a module does not in some cases define the module itself.

The functions described above herein may be performed, at least in part, by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic components that may be used include: a Field Programmable Gate Array (FPGA), an Application Specific Integrated Circuit (ASIC), an Application Specific Standard Product (ASSP), a system on a chip (SOC), a Complex Programmable Logic Device (CPLD), and the like.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

According to one or more embodiments of the present disclosure, example 1 provides an image processing method, the method comprising:

According to one or more embodiments of the present disclosure, example 2 provides the method of example 1, the acquiring a target image feature corresponding to a target image to be processed, including:

the converting the position vector into a two-dimensional position-coded vector includes:

According to one or more embodiments of the present disclosure, example 3 provides the method of example 2, the determining a one-dimensional vector corresponding to the position number, having a dimension consistent with the number of channels of the target image feature, including:

According to one or more embodiments of the present disclosure, example 4 provides the method of example 2, the method encapsulated in an image processing model, the determining a one-dimensional vector corresponding to the location number having a dimension consistent with the number of channels of the target image feature, comprising:

According to one or more embodiments of the present disclosure, example 5 provides the method of any one of examples 1 to 4, the acquiring target image features corresponding to the target image to be processed, including:

acquiring all image features corresponding to a target image to be processed;

According to one or more embodiments of the present disclosure, example 6 provides the method of any one of examples 1-4, the determining the pooling coefficient of the target image feature from the position-coding vector and sequence model, comprising:

In accordance with one or more embodiments of the present disclosure, example 7 provides the method of any one of examples 1-4, the sequence model comprising a bi-directional gated recurrent neural network model.

According to one or more embodiments of the present disclosure, example 8 provides an image processing apparatus, the apparatus comprising:

According to one or more embodiments of the present disclosure, example 9 provides the apparatus of example 8, the acquisition module to:

the conversion module 303 is configured to:

Example 10 provides the apparatus of example 9, according to one or more embodiments of the disclosure, the conversion module to:

Example 11 provides the apparatus of example 9, according to one or more embodiments of the disclosure, the conversion module to:

According to one or more embodiments of the present disclosure, example 12 provides the apparatus of any one of examples 8-11, the acquisition module to:

acquiring all image features corresponding to a target image to be processed;

According to one or more embodiments of the present disclosure, example 13 provides the apparatus of any one of examples 8-11, the second determining module to:

Example 14 provides the apparatus of any of examples 8-11, the sequence model comprising a bi-directional gated recurrent neural network model, in accordance with one or more embodiments of the present disclosure.

According to one or more embodiments of the present disclosure, example 15 provides a computer-readable medium having stored thereon a computer program which, when executed by a processing device, implements the steps of the method of any of examples 1-7.

Example 16 provides an electronic device according to one or more embodiments of the present disclosure, comprising:

a storage device having a computer program stored thereon;

processing means for executing the computer program in the storage means to implement the steps of the method of any one of examples 1-7.

The foregoing description is only of the preferred embodiments of the present disclosure and description of the principles of the technology being employed. It will be appreciated by persons skilled in the art that the scope of the disclosure referred to in this disclosure is not limited to the specific combinations of features described above, but also covers other embodiments which may be formed by any combination of features described above or equivalents thereof without departing from the spirit of the disclosure. Such as those described above, are mutually substituted with the technical features having similar functions disclosed in the present disclosure (but not limited thereto).

Moreover, although operations are depicted in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order. In certain circumstances, multitasking and parallel processing may be advantageous. Likewise, while several specific implementation details are included in the above discussion, these should not be construed as limiting the scope of the present disclosure. Certain features that are described in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination.

Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are example forms of implementing the claims. The specific manner in which the various modules perform the operations in the apparatus of the above embodiments have been described in detail in connection with the embodiments of the method, and will not be described in detail herein.

Claims

1. An image processing method, the method comprising:

performing dot multiplication on the pooling coefficient and the target image characteristic to obtain an image pooling result of the target image;

wherein the converting the position vector into a two-dimensional position-coded vector includes:

for each position number in the position vectors, determining a one-dimensional vector corresponding to the position number and having a dimension consistent with the channel number of the target image feature, so as to obtain a two-dimensional position coding vector corresponding to the position vector;

The method is packaged in an image processing model, and the determining the one-dimensional vector corresponding to the position number and having the dimension consistent with the channel number of the target image feature comprises the following steps:

according to the sample image processing result pre-marked in the sample image corresponding to the sample image characteristics, adjusting parameters of the image processing model;

the image processing model comprises an image-text matching model.

2. The method according to claim 1, wherein the acquiring the target image feature corresponding to the target image to be processed includes:

and acquiring target image characteristics corresponding to at least one channel in the target image to be processed.

3. The method of claim 1, wherein determining a one-dimensional vector corresponding to the location number having a dimension consistent with the number of channels of the target image feature comprises:

4. A method according to any one of claims 1-3, wherein the acquiring the target image feature corresponding to the target image to be processed comprises:

acquiring all image features corresponding to a target image to be processed;

5. A method according to any of claims 1-3, wherein said determining the pooling coefficients of the target image feature from the position-coding vector and sequence model comprises:

6. A method according to any one of claims 1-3, wherein the sequence model comprises a two-way gated recurrent neural network model.

7. An image processing apparatus, characterized in that the apparatus comprises:

the point multiplication module is used for carrying out point multiplication on the pooling coefficient and the target image characteristic to obtain an image pooling result of the target image;

wherein, the conversion module is further used for:

wherein the device is encapsulated in an image processing model, the conversion module is specifically further configured to:

the image processing model comprises an image-text matching model.

8. A computer readable medium on which a computer program is stored, characterized in that the program, when being executed by a processing device, carries out the steps of the method according to any one of claims 1-6.

9. An electronic device, comprising:

a storage device having a computer program stored thereon;

processing means for executing said computer program in said storage means to carry out the steps of the method according to any one of claims 1-6.