CN113033580A

CN113033580A - Image processing method, image processing device, storage medium and electronic equipment

Info

Publication number: CN113033580A
Application number: CN202110351439.9A
Authority: CN
Inventors: 吴昊; 陈嘉诚; 王长虎
Original assignee: Beijing Youzhuju Network Technology Co Ltd
Current assignee: Beijing Youzhuju Network Technology Co Ltd
Priority date: 2021-03-31
Filing date: 2021-03-31
Publication date: 2021-06-25
Anticipated expiration: 2041-03-31
Also published as: CN113033580B

Abstract

The present disclosure relates to an image processing method, an image processing apparatus, a storage medium, and an electronic device, which provide an adaptive pooling mode supporting a variable length feature, reduce manpower and time required for pooling operation in an image processing process, improve image pooling efficiency, and further improve image processing efficiency. The image processing method comprises the following steps: acquiring corresponding target image characteristics in a target image to be processed; determining a position vector with the length consistent with the characteristic length of the target image, wherein the position vector comprises a plurality of position numbers which are sequentially arranged; converting the position vector into a two-dimensional position-coding vector; determining pooling coefficients of the target image features according to the position coding vectors and a sequence model; and performing point multiplication on the pooling coefficient and the target image characteristics to obtain an image pooling result of the target image.

Description

Image processing method, image processing device, storage medium and electronic equipment

Technical Field

The present disclosure relates to the field of image processing technologies, and in particular, to an image processing method and apparatus, a storage medium, and an electronic device.

Background

In the field of image processing, pooling may integrate feature points in a small neighborhood into new features, also referred to as feature aggregation. The pooling mode in the related art includes maximum pooling (max-pooling), k-maximum pooling (k-max-pooling), mean pooling (average pooling), and the like. In practical application, the corresponding pooling mode needs to be manually selected according to the type of the image feature extractor, and when the type of the image feature extractor changes, the corresponding pooling mode needs to be reselected, which consumes labor and time. Taking k-maximum pooling as an example, multiple experiments are needed to try different k values to find out the optimal characteristic aggregation function, and the parameter adjustment cost is high.

Disclosure of Invention

This summary is provided to introduce a selection of concepts in a simplified form that are further described below in the detailed description. This summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.

In a first aspect, the present disclosure provides a method of image processing, the method comprising:

acquiring corresponding target image characteristics in a target image to be processed;

determining a position vector with the length consistent with the target image characteristic length, wherein the position vector comprises a plurality of position numbers which are sequentially arranged, and the position numbers correspond to the target image characteristics one by one;

converting the position vectors into two-dimensional position coding vectors, wherein each position coding vector is different, and the arrangement sequence of the position coding vectors is consistent with the arrangement sequence of the corresponding position vectors;

determining pooling coefficients of the target image features according to the position coding vectors and a sequence model;

and performing point multiplication on the pooling coefficient and the target image characteristics to obtain an image pooling result of the target image.

In a second aspect, the present disclosure provides an image processing apparatus, the apparatus comprising:

the acquisition module is used for acquiring corresponding target image characteristics in a target image to be processed;

the first determining module is used for determining a position vector with the length consistent with the characteristic length of the target image, the position vector comprises a plurality of position numbers which are sequentially arranged, and the position numbers correspond to the characteristics of the target image one by one;

the conversion module is used for converting the position vectors into two-dimensional position coding vectors, wherein each position coding vector is different, and the arrangement sequence of the position coding vectors is consistent with that of the corresponding position vectors;

a second determining module, configured to determine pooling coefficients of the target image features according to the position coding vector and a sequence model;

and the dot multiplication module is used for performing dot multiplication on the pooling coefficient and the target image characteristics to obtain an image pooling result of the target image.

In a third aspect, the present disclosure provides a computer readable medium having stored thereon a computer program which, when executed by a processing apparatus, performs the steps of the method of the first aspect.

In a fourth aspect, the present disclosure provides an electronic device, comprising:

a storage device having a computer program stored thereon;

processing means for executing the computer program in the storage means to carry out the steps of the method of the first aspect.

Through the technical scheme, the position vector with the length consistent with the characteristic length of the target image can be determined through the mode, and then the position vector is converted into the two-dimensional position coding vector, so that the pooling coefficient is determined according to the two-dimensional position coding vector, and the image pooling result of the target image is obtained. Therefore, the pooling coefficient can be changed in a self-adaptive mode according to the target image characteristics, when the type of the characteristic extractor is changed, the extracted image characteristics are changed, the pooling mode does not need to be determined again, multiple times of experiment parameter adjusting and searching can be avoided, and therefore the consumed manpower and time in the image pooling process are reduced. Moreover, the pooling coefficient can be adaptively changed according to the length of the target image feature, so that adaptive pooling of the variable-length feature can be supported, image features with different lengths can be flexibly pooled, and the pooling requirements under different scenes can be met.

Additional features and advantages of the disclosure will be set forth in the detailed description which follows.

Drawings

The above and other features, advantages and aspects of various embodiments of the present disclosure will become more apparent by referring to the following detailed description when taken in conjunction with the accompanying drawings. Throughout the drawings, the same or similar reference numbers refer to the same or similar elements. It should be understood that the drawings are schematic and that elements and features are not necessarily drawn to scale. In the drawings:

FIG. 1 is a flow chart illustrating a method of image processing according to an exemplary embodiment of the present disclosure;

FIG. 2 is a process diagram illustrating a method of image processing according to an exemplary embodiment of the present disclosure;

FIG. 3 is a block diagram illustrating an image processing apparatus according to an exemplary embodiment of the present disclosure;

fig. 4 is a block diagram illustrating an electronic device according to an exemplary embodiment of the present disclosure.

Detailed Description

Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While certain embodiments of the present disclosure are shown in the drawings, it is to be understood that the present disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein, but rather are provided for a more thorough and complete understanding of the present disclosure. It should be understood that the drawings and embodiments of the disclosure are for illustration purposes only and are not intended to limit the scope of the disclosure.

It should be understood that the various steps recited in the method embodiments of the present disclosure may be performed in a different order, and/or performed in parallel. Moreover, method embodiments may include additional steps and/or omit performing the illustrated steps. The scope of the present disclosure is not limited in this respect.

The term "include" and variations thereof as used herein are open-ended, i.e., "including but not limited to". The term "based on" is "based, at least in part, on". The term "one embodiment" means "at least one embodiment"; the term "another embodiment" means "at least one additional embodiment"; the term "some embodiments" means "at least some embodiments". Relevant definitions for other terms will be given in the following description.

It should be noted that the terms "first", "second", and the like in the present disclosure are only used for distinguishing different devices, modules or units, and are not used for limiting the order or interdependence relationship of the functions performed by the devices, modules or units. It is further noted that references to "a", "an", and "the" modifications in the present disclosure are intended to be illustrative rather than limiting, and that those skilled in the art will recognize that "one or more" may be used unless the context clearly dictates otherwise.

The names of messages or information exchanged between devices in the embodiments of the present disclosure are for illustrative purposes only, and are not intended to limit the scope of the messages or information.

In the field of image processing, pooling may integrate feature points in a small neighborhood into new features, also referred to as feature aggregation. Generalized image pooling can be understood as follows: for a feature set F extracted by a feature extractor, the dimension is denoted as nxd, where N is the number of feature elements in the feature set F (for picture features, N may be the number of meshes of the features, that is, N is H × W, H is the height of an image feature map, and W is the width of the image feature map; for videos, N may be the number of video frames; for texts, N may be the number of characters or word segments), D represents the number of feature channels, and an image pooling operator Φ may be defined as:

by performing feature aggregation (i.e., pooling) on N D-dimensional feature vectors of the feature set F in the dimension of the number (N) of feature elements, a global feature vector can be obtained.

Pooling is widely used in a variety of image processing procedures. For example, in an image-text embedding model for learning visual representations and text representations by using the correlation between images and texts, first, feature extraction is performed on input images and input texts, then, features of an image modality and a text modality are aggregated (i.e., pooling processing) respectively, and the aggregated features are mapped to a shared low-dimensional space, thereby implementing training of the image-text embedding model. Therefore, the trained image-text embedding model can be directly applied to scenes such as image-text matching, retrieval and the like, for example, a query text is given, and the most relevant image in the database is found out. Or the trained picture representation and text representation can provide high-quality content side features for the recommendation system, so that the performance of the recommendation system is improved, and the cold start problem is relieved. Or, as a multi-modal pre-training task, the trained image feature extractor or text feature extractor (such as a convolutional neural network, a sequence model, etc.) may be migrated to other visual or text tasks (such as image object recognition, text classification, etc.), and used as an initial feature extractor, so as to reduce the training difficulty.

The inventor finds that the characteristic aggregation mode has a large influence on the performance of the image-text embedding model in experiments. Meanwhile, the pooling mode without the learnable parameters selected after the parameters are carefully adjusted has better effect and higher speed than the pooling modes with the complex characteristics and the learnable parameters which are commonly used at present. However, when the image-text embedded model is trained, a grid search (grid search) is performed on all possible pooling modes of modal features on two sides, and a large number of repeated experiments are performed to ensure that the optimal setting is found and the performance of the whole model is maximized. It should be noted that data of modalities such as video, pictures and text have characteristics, and the characteristics of some modalities are naturally variable, the existing pooling mode has no generality, and there is no universal pooling mode which can achieve the best effect on different modalities. In the same mode, the distribution of feature sets extracted by different feature extractors may also have obvious difference, and the most suitable pooling module on one feature extractor does not necessarily obtain the best effect on another feature extractor. For example, for an image feature modality, there are a recurrent neural network type feature extractor (such as GRU, LSTM, etc.) and a transform type feature extractor (such as BERT, etc.), which have different principles and different corresponding most suitable pooling manners, and the existing pooling manner cannot achieve the optimal effect on the two different feature extractors at the same time. Therefore, a flexible adaptive pooling module capable of automatically adjusting the pooling form according to the feature modality type and the feature extractor type is needed.

In addition, the inventor also researches and discovers that the pooling method in the related art is only suitable for picture data which can ensure that the characteristic length is not changed during training, but a good pooling effect cannot be obtained when the input characteristic length is changed. Therefore, we also need a pooling scheme that better supports the variable length feature.

In view of the above, the present disclosure provides an image processing method, an image processing apparatus, a storage medium, and an electronic device, so as to provide an adaptive pooling manner supporting a variable length feature, reduce manpower and time required for pooling operation in an image processing process, improve image pooling efficiency, and further improve image processing efficiency.

Fig. 1 is a flowchart illustrating an image processing method according to an exemplary embodiment of the present disclosure. Referring to fig. 1, the image processing method includes:

step 101, acquiring a corresponding target image feature in a target image to be processed.

And 102, determining a position vector with the length consistent with the length of the target image characteristic, wherein the position vector comprises a plurality of position numbers which are sequentially arranged, and the position numbers correspond to the target image characteristic one by one.

Step 103, converting the position vectors into two-dimensional position coding vectors, wherein each position coding vector is different, and the arrangement sequence of the position coding vectors is consistent with the arrangement sequence of the corresponding position vectors.

104, determining a pooling coefficient of the target image characteristic according to the position coding vector and the sequence model;

and 105, performing point multiplication on the pooling coefficient and the target image characteristics to obtain an image pooling result of the target image.

By the method, the position vector with the length consistent with the characteristic length of the target image can be determined, and then the position vector is converted into the two-dimensional position coding vector, so that the pooling coefficient is determined according to the two-dimensional position coding vector, and the image pooling result of the target image is obtained. Therefore, the pooling coefficient can be changed in a self-adaptive mode according to the target image characteristics, when the type of the characteristic extractor is changed, the extracted image characteristics are changed, the pooling mode does not need to be determined again, multiple times of experiment parameter adjusting and searching can be avoided, and therefore the consumed manpower and time in the image pooling process are reduced. Moreover, the pooling coefficient can be adaptively changed according to the length of the target image feature, so that adaptive pooling of the variable-length feature can be supported, image features with different lengths can be flexibly pooled, and the pooling requirements under different scenes can be met.

In order to make the image processing method provided by the present disclosure more understandable to those skilled in the art, the above steps are exemplified in detail below.

For example, in a context of image-text matching, such as a context of retrieving a corresponding text through a picture, acquiring a target image to be processed may be acquiring an image input by a user in response to an image input operation triggered by the user, or acquiring an image taken by the electronic device in real time if authorized by the user, and so on. Of course, in other image processing scenarios, acquiring the target image to be processed may also be any other possible image acquisition manner, such as downloading a public image from a network, or acquiring a frame image in each frame of video as the target image, and the like, which is not limited in this disclosure.

After the target image to be processed is acquired, image features in the target image may be extracted by an image feature extractor (such as a convolutional neural network, a sequence model, etc.). Alternatively, in the case where the target image is a video frame image, extracting the image features in the target image may be extracting the image features of each frame of video by a video feature extractor to obtain the target image features. Therefore, the image processing method provided by the embodiment of the disclosure can be applied to the video text matching process to perform pooling processing on the frame image features extracted from the video.

In a possible manner, all the extracted image features may be subjected to subsequent processing as target image features. In addition, the inventor researches and discovers that in the model training stage, random discarding of feature elements (for example, for picture features, randomly discarding some grids in the feature grid) before pooling can improve the generalization capability of the trained model. Therefore, in another possible manner, all image features corresponding to the target image to be processed may be obtained first, and then at least one image feature of all the image features may be discarded randomly to obtain the target image feature corresponding to the target image.

Then, a position vector having a length that matches the length of the target image feature may be determined, the position vector including a plurality of position numbers arranged in sequence, the position numbers corresponding to the target image features one to one. For example, the number of feature elements in the feature set F extracted by the feature extractor is N, and if all image features in the feature set are used as target image features, that is, the length of the target image features is N, the position vector can be determined

Is (1,2, …, N).

The position vector may then be converted into a two-dimensional position-coding vector, such as in the example above, the position vector may be converted into a two-dimensional position-coding vector

The position numbers of 1 to N in (1) are converted into two-dimensional position-coding vectors containing position information. Each position coding vector is different, and the arrangement sequence of the position coding vectors is consistent with the arrangement sequence of the corresponding position vectors. It should be understood thatThe position coding vector can keep the size relation among position numbers in the position vector, so that in the subsequent pooling process, the position coding vector and the image features with the arrangement sequence can be subjected to point multiplication to fit any pooling operation including maximum pooling, mean pooling and k-maximum pooling, and the adaptive pooling of the variable length features is realized.

In a possible manner, the obtaining of the target image feature corresponding to the target image to be processed may be: and acquiring the target image characteristics corresponding to at least one channel in the target image to be processed. Accordingly, converting the position vector into a two-dimensional position-coding vector may be: and aiming at each position number in the position vector, determining a one-dimensional vector corresponding to the position number and having dimensionality consistent with the channel number of the target image characteristic so as to obtain a two-dimensional position coding vector corresponding to the position vector.

For example, for a feature set F extracted by the feature extractor, the dimension is denoted as N × D, where N is the number of feature elements in the feature set F, that is, the feature length, and D represents the number of feature channels. Under the condition that the characteristic length of each channel target image is N, the position vector corresponding to each channel target image characteristic

Is (1,2, …, N). For the position vector

Determining a one-dimensional vector corresponding to the position number and having dimensionality consistent with the channel number of the target image characteristic, and obtaining a two-dimensional position coding vector p corresponding to the position vector as

k is an integer greater than 1 and less than N. Wherein each element in the position-coding vector p (i.e. the

To

Any of) are one-dimensional vectors of D × 1.

It should be understood that the specific value of each one-dimensional vector in the position coding vector may be determined according to a preset position coding function, or a learnable vector may be numbered for each position as the position coding vector thereof, so that the model learns the position coding vector of each position number in the training process. Of course, the position vector may also be converted into a two-dimensional position encoding vector according to other possible ways, which is not limited by the embodiment of the present disclosure. The two possibilities mentioned above are explained below.

In a possible manner, the one-dimensional vector corresponding to the position number and having a dimension consistent with the number of channels of the target image feature may be determined as follows: and determining a one-dimensional vector corresponding to the position number and having dimensionality consistent with the channel number of the target image characteristic according to a position coding function, wherein the position coding function is used for determining the element value at the odd position in the one-dimensional vector through first conversion calculation and determining the element value at the even position in the one-dimensional vector through second conversion calculation.

For example, for a feature set F extracted by a feature extractor, the dimension is denoted as N × D, where N is the number of feature elements in the feature set F, i.e., the feature length, and D represents the number of feature channels. Under the condition that the characteristic length of each channel target image is N, the position vector corresponding to each channel target image characteristic

Is (1,2, …, N). In this case, the formula of the position-coding function may be:

wherein the content of the first and second substances,

for element values at even positions in a one-dimensional vector, aFirst transformation calculates sin (w)_jK) calculating cos (w) by a second transformation for the values of the elements at odd positions in the one-dimensional vector_jAnd k) calculating. Alternatively, in other possible ways, the first conversion calculation formula may be cos (w)_jK), the second conversion calculation formula may be sin (w)_jK), which are not limited by the embodiments of the present disclosure.

Through the mode, the position coding function can convert the position numbers of the features which are sequenced according to the sizes into continuous vectors, and retains the sequence and relative distance information in the position numbers, so that the learning difficulty and the processing difficulty of a subsequent sequence model can be effectively reduced.

In another possible way, the image processing method provided by the embodiment of the present disclosure may be encapsulated in an image processing model, and the one-dimensional vector corresponding to the position number and having a dimension consistent with the number of channels of the target image feature may be determined as: and determining a one-dimensional vector corresponding to the position number and having dimension consistent with the channel number of the target image characteristic according to the pre-training parameters of the image processing model. The pre-training parameters of the image processing model can be obtained by training in the following way: and determining a sample vector aiming at each position number in the sample position vector with the length consistent with the length of the sample image characteristic to obtain a sample position coding vector, wherein the sample vector is a sample one-dimensional vector with the dimension same as the channel number of the sample image characteristic. And then, performing pooling processing on the sample image characteristics according to the sample position coding vector, and determining an image processing result according to the pooled sample image characteristics. And adjusting parameters of the image processing model according to the pre-labeled sample image processing result in the sample image corresponding to the image processing result and the sample image characteristic.

For example, the image processing model may be a graph-text matching model, and then the sample image processing result may be a text matching result corresponding to the sample image, or the image processing model may be an image classification model, and then the sample image processing result is an image classification result corresponding to the sample image, and so on.

Taking the graph-text matching model as an example, the training process may include the following processes:

1. matching each image modality sample x in the sample data set with each other_AAnd text modality sample x_BForm sample pairs (x)_A,x_B)。

2. For each paired sample (x)_A,x_B) The following processing is performed:

2.1, extracting image mode sample x by an image feature extractor_ATo obtain a feature set F_A＝f_A(x_A)。

2.2, to feature set

Randomly discarding a certain proportion of features to obtain a feature set

Wherein N is_A'<N_A。

2.3 for feature set F_A' obtaining pooled features using a pooling operator: v. of_A＝φ_A(F_A')。

2.4, extracting image mode sample x by an image feature extractor_BTo obtain a feature set F_B＝f_B(x_B)。

2.5, to feature set

Randomly discarding a certain proportion of features to obtain a feature set

Wherein N is_B'<N_B。

2.6 for feature set F_B' obtaining pooled features using a pooling operator: v. of_B＝φ_B(F_B')。

2.7 sampling unmatched image modality samples x_AAnd text modality sample x_BObtaining a sample pair (x)_A,x_/B) And (x)_/A,x_B) And the pooled features (v) obtained in the manner described above_A,v_/B) And (v)_/A,v_B) Model losses are computed using a metric learning loss function (e.g., triplet loss) to make the associated images more closely resemble and make the non-associated images more distant from the text-pooled features.

3. For each sample pair in step 2 (e.g., (x)_A,x_B)、(x_A,x_/B) Random gradient descent algorithm training until the model converges.

In step 2.3, a sample vector may be determined for each position number in a sample position vector having a length consistent with the length of the sample image feature to obtain a sample position encoding vector, and then the sample image feature may be pooled according to the sample position encoding vector. Therefore, in the subsequent process, the model can automatically adjust parameters to obtain more accurate sample coding vectors in the process of training to model convergence through the stochastic gradient descent algorithm. In the stage of model application, a one-dimensional vector corresponding to the position number and having dimensions consistent with the number of channels of the target image feature can be determined according to the pre-training parameters of the model, so that the position vector corresponding to the target image feature is converted into a position coding vector, and subsequent image processing is facilitated.

After the position coding vector is obtained, the pooling coefficient of the target image feature can be determined according to the position coding vector and the sequence model. In a possible approach, the position-coding vector may be input into a sequence model to obtain a sequence processing result, and then the sequence processing result is normalized to obtain pooling coefficients of the target image features.

Illustratively, the sequence model may comprise a bidirectional gated recurrent neural network model, i.e., a bidirectional GRU model. It should be appreciated that the bidirectional GRU model is a simple and efficient sequence model that has sufficient model representation capability to ensure that it can find suitable pooling coefficients during training without incurring excessive computational overhead. Of course, in other possible ways, the sequence model may include a transform, an LSTM (Long Short-Term Memory network), and the like, which is not limited in this disclosure.

Taking the bidirectional GRU model as an example, obtaining the position coding vector p as

The position-coded vector may then be converted to a pooling coefficient θ by a bi-directional GRU. Wherein θ ═ θ₁,…,θ_N). The pooling coefficient θ may then be normalized by a softmax function such that

That is, the pooling coefficient of the target image feature may be determined by the following formula: θ' ═ softmax (bigru (p)).

Finally, the pooling coefficient and the target image feature may be point-multiplied to obtain an image pooling result of the target image. For example, the target image features corresponding to at least one channel may be ranked. For each sorted feature channel

Can be obtained by mixing the pooling coefficients theta' with

And carrying out point multiplication to obtain the pooled characteristic value. It should be understood that the same pooling coefficients θ' are used for different channels.

For example, referring to fig. 2, the number of channels D is 3, and the length of the target image feature is N. The target image features of each channel are sorted first. For example, the first channel is ranked at 8, 9, and 4, and the second channel is ranked at 9, 8, and 4. For the determination of the pooling coefficients, reference may be made to the lower half of FIG. 2. A position vector (1, …, k, …, N) is first determined whose length coincides with the target image feature length. The position vector is then converted into a position-coded vector by means of a position-coding function

Then, can openDeriving pooling coefficients (θ) from position-coded vectors by traversing bi-directional GRUs₁,…,θ_k,…,θ_N). And finally, performing point multiplication on the pooling coefficient and the sorted target image characteristics. Specifically, for a target image feature after a certain channel is sorted, the target image feature is multiplied by the pooling coefficient correspondingly, and all multiplication results are added or weighted and added to obtain the pooling feature corresponding to the channel. Therefore, feature aggregation can be performed on N D-dimensional feature vectors in the dimension of the number (N) of feature elements, and a global D-dimensional pooling feature can be obtained.

In this way, the order information of the respective position numbers can be retained by the position coding function, so that the subsequent model uses this information to generate pooling coefficients for the feature values of different positions. Meanwhile, the bidirectional GRU model can handle input sequences of arbitrary length and has the ability to fit arbitrary pooling operations including max pooling, mean pooling, k-max pooling in training. Therefore, the pooling coefficient can be changed in a self-adaptive manner according to the length of the target image characteristic, and multiple times of experiment parameter adjustment searching is avoided, so that the consumed manpower and time in the image pooling process are reduced.

In addition, the image processing method provided by the embodiment of the disclosure is applied to the image processing model, so that the accuracy of the image processing result can be improved, the parameter adjusting cost can be reduced, and the adaptive pooling of the variable-length image features can be realized.

For example, the image processing model is a graph-text matching model, and a graph-text matching experiment is performed on a graph-text data set COCO-capture. The data set training set is about 12 ten thousand pictures in size, and the test set is 5000 pictures in size, wherein each picture has 5 corresponding texts describing the picture contents. VSE + + is adopted as a basic image-text matching method (using mean pooling), and the pooling mode provided by the disclosure is added on the basis of the method. Meanwhile, a more complex image-text matching method is adopted for comparison, including SCAN and VSRN. VSE + + is a classical method of a multi-modal embedded model, is commonly used for matching language and image texts and matching video texts, and improves the performance of the multi-modal feature matching model by mining more difficult negative samples in a loss function. The SCAN is a graph-text matching method for performing cross-modal modeling on word information in a text and object information in a picture by using an attention mechanism. The VSRN is a graphic and text matching method which uses a neural network to aggregate objects and semantic information in a picture and then matches the object and the semantic information with text information. It should be understood that other factors in the implementation of each method in the experiment, such as the picture feature extractor (using fast-RCNN pre-trained on COCO object detection), the text feature extractor (using bidirectional GRU), the size of the input picture (600 x1000 picture size under test), etc., are all identical. Finally, the results shown in table 1 can be obtained:

TABLE 1

As can be seen from table 1, the pooling method provided by the present disclosure can significantly improve the performance of the basic image-text matching model on the standard data set, and only adds very small additional computation overhead, so that the simple basic method VSE + + can exceed the more complex methods SCAN and VSRN proposed in recent years, and no additional parameter adjustment process is required, thereby improving the efficiency.

For another example, taking a video text matching scene as an example, the pooling provided by the present disclosure is performed on the image features of each frame of image extracted from the video, and the obtained video text matching result is compared with the pooling not provided by the present disclosure. Specifically, a video text matching experiment is carried out on the video text data set MSR-VTT. The training set of data set contains 6573 video segments, the test set contains 2990 video segments, and the remaining video segments belong to the verification set. Where each video has 20 corresponding sentences of text describing the video content. VSE + + and HGR are adopted as basic methods for video text matching, and the feature pooling mode provided by the present disclosure is replaced on the design of respective models of the VSE + + and HGR as feature aggregation functions of videos and texts. The HGR is a method for matching video texts through cross-modal information matching with different granularities. It should be understood that other factors in the implementation of each method in the experiment, such as the video feature extractor (both using the ResNet-152 to extract the picture features of each frame of video), the text feature extractor (both using the bidirectional GRU), the video sampling rate, etc., are all the same. Finally, the results shown in table 2 can be obtained:

TABLE 2

Video text matching method	Video recall rate (Top1)	Video recall rate (Top10)	Text recall rate (Top1)
				VSE++	8.3％	24.0％	14.4％
VSE + + binding to the present disclosure	8.7％	25.3％	16.0％
				HGR	8.9％	25.8％	14.3％
HGR incorporating the present disclosure	9.1％	25.9％	15.0％

As can be seen from table 2, the pooling method provided by the present disclosure not only can avoid multiple experiments to tune parameter search, support adaptive pooling for variable length features, but also can stably improve the performance of two basic video text matching methods on a standard data set.

Based on the same inventive concept, the embodiment of the disclosure also provides an image processing device. Referring to fig. 3, the image processing apparatus 300 includes:

an obtaining module 301, configured to obtain a corresponding target image feature in a target image to be processed;

a first determining module 302, configured to determine a position vector having a length consistent with the target image feature length, where the position vector includes a plurality of position numbers arranged in sequence, and the position numbers correspond to the target image features one to one;

a converting module 303, configured to convert the position vectors into two-dimensional position-coding vectors, where each of the position-coding vectors is different, and an arrangement order of the position-coding vectors is consistent with an arrangement order of the corresponding position vectors;

a second determining module 304, configured to determine pooling coefficients of the target image features according to the position-coding vector and a sequence model;

and a dot multiplication module 305, configured to perform dot multiplication on the pooling coefficient and the target image feature to obtain an image pooling result of the target image.

Optionally, the obtaining module 301 is configured to:

acquiring target image characteristics corresponding to at least one channel in a target image to be processed;

the conversion module 303 is configured to:

and determining a one-dimensional vector corresponding to the position number and having dimensionality consistent with the channel number of the target image characteristic aiming at each position number in the position vector to obtain a two-dimensional position coding vector corresponding to the position vector.

Optionally, the conversion module 303 is configured to:

and determining a one-dimensional vector corresponding to the position number and having dimensionality consistent with the channel number of the target image characteristic according to a position coding function, wherein the position coding function is used for determining the element value at the odd position in the one-dimensional vector through first conversion calculation and determining the element value at the even position in the one-dimensional vector through second conversion calculation.

Optionally, the conversion module 303 is configured to:

determining a one-dimensional vector corresponding to the position number and having dimensions consistent with the number of channels of the target image feature according to the pre-training parameters of the image processing model, wherein the pre-training parameters of the image processing model are obtained by training in the following way:

determining a sample vector aiming at each position number in a sample position vector with the length consistent with the length of the sample image characteristic to obtain a sample position coding vector, wherein the sample vector is a sample one-dimensional vector with the dimension same as the channel number of the sample image characteristic;

pooling the sample image features according to the sample position coding vector, and determining an image processing result according to the pooled sample image features;

and adjusting parameters of the image processing model according to the image processing result and a pre-labeled sample image processing result in the sample image corresponding to the sample image characteristic.

Optionally, the obtaining module 301 is configured to:

acquiring all image characteristics corresponding to a target image to be processed;

and randomly discarding at least one image feature in all the image features to obtain a target image feature corresponding to the target image.

Optionally, the second determining module 304 is configured to:

inputting the position coding vector into a sequence model to obtain a sequence processing result;

and normalizing the sequence processing result to obtain the pooling coefficient of the target image characteristic.

Optionally, the sequence model comprises a bidirectional gated recurrent neural network model.

With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.

Based on the same inventive concept, the disclosed embodiments also provide a computer readable medium, on which a computer program is stored, which when executed by a processing apparatus, implements the steps of any of the image processing methods described above.

Based on the same inventive concept, an embodiment of the present disclosure further provides an electronic device, including:

a storage device having a computer program stored thereon;

processing means for executing the computer program in the storage means to implement the steps of any of the image processing methods described above.

Referring now to FIG. 4, a block diagram of an electronic device 400 suitable for use in implementing embodiments of the present disclosure is shown. The terminal device in the embodiments of the present disclosure may include, but is not limited to, a mobile terminal such as a mobile phone, a notebook computer, a digital broadcast receiver, a PDA (personal digital assistant), a PAD (tablet computer), a PMP (portable multimedia player), a vehicle terminal (e.g., a car navigation terminal), and the like, and a stationary terminal such as a digital TV, a desktop computer, and the like. The electronic device shown in fig. 4 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present disclosure.

As shown in fig. 4, electronic device 400 may include a processing device (e.g., central processing unit, graphics processor, etc.) 401 that may perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM)402 or a program loaded from a storage device 408 into a Random Access Memory (RAM) 403. In the RAM 403, various programs and data necessary for the operation of the electronic apparatus 400 are also stored. The processing device 401, the ROM 402, and the RAM 403 are connected to each other via a bus 404. An input/output (I/O) interface 405 is also connected to bus 404.

Generally, the following devices may be connected to the I/O interface 405: input devices 406 including, for example, a touch screen, touch pad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; an output device 407 including, for example, a Liquid Crystal Display (LCD), a speaker, a vibrator, and the like; storage 408 including, for example, tape, hard disk, etc.; and a communication device 409. The communication means 409 may allow the electronic device 400 to communicate wirelessly or by wire with other devices to exchange data. While fig. 4 illustrates an electronic device 400 having various means, it is to be understood that not all illustrated means are required to be implemented or provided. More or fewer devices may alternatively be implemented or provided.

In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program carried on a non-transitory computer readable medium, the computer program containing program code for performing the method illustrated by the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network via the communication device 409, or from the storage device 408, or from the ROM 402. The computer program performs the above-described functions defined in the methods of the embodiments of the present disclosure when executed by the processing device 401.

It should be noted that the computer readable medium in the present disclosure can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In contrast, in the present disclosure, a computer readable signal medium may comprise a propagated data signal with computer readable program code embodied therein, either in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, optical cables, RF (radio frequency), etc., or any suitable combination of the foregoing.

In some embodiments, the communication may be performed using any currently known or future developed network Protocol, such as HTTP (HyperText Transfer Protocol), and may be interconnected with any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include a local area network ("LAN"), a wide area network ("WAN"), the Internet (e.g., the Internet), and peer-to-peer networks (e.g., ad hoc peer-to-peer networks), as well as any currently known or future developed network.

The computer readable medium may be embodied in the electronic device; or may exist separately without being assembled into the electronic device.

The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to: acquiring corresponding target image characteristics in a target image to be processed; determining a position vector with the length consistent with the target image characteristic length, wherein the position vector comprises a plurality of position numbers which are sequentially arranged, and the position numbers correspond to the target image characteristics one by one; converting the position vectors into two-dimensional position coding vectors, wherein each position coding vector is different, and the arrangement sequence of the position coding vectors is consistent with the arrangement sequence of the corresponding position vectors; determining pooling coefficients of the target image features according to the position coding vectors and a sequence model; and performing point multiplication on the pooling coefficient and the target image characteristics to obtain an image pooling result of the target image.

Computer program code for carrying out operations for the present disclosure may be written in any combination of one or more programming languages, including but not limited to an object oriented programming language such as Java, Smalltalk, C + +, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The modules described in the embodiments of the present disclosure may be implemented by software or hardware. Wherein the name of a module in some cases does not constitute a limitation on the module itself.

The functions described herein above may be performed, at least in part, by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic components that may be used include: field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), systems on a chip (SOCs), Complex Programmable Logic Devices (CPLDs), and the like.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

Example 1 provides, in accordance with one or more embodiments of the present disclosure, an image processing method, the method including:

Example 2 provides the method of example 1, wherein the obtaining of the target image feature corresponding to the target image to be processed includes:

the converting the position vector into a two-dimensional position-coding vector comprises:

Example 3 provides the method of example 2, wherein determining a one-dimensional vector corresponding to the position number and having a dimension consistent with the number of channels of the target image feature includes:

Example 4 provides the method of example 2, which is packaged in an image processing model, and the determining a one-dimensional vector corresponding to the position number and having a dimension consistent with the number of channels of the target image feature includes:

Example 5 provides the method of any one of examples 1 to 4, wherein the obtaining of the target image feature corresponding to the target image to be processed includes:

Example 6 provides the method of any one of examples 1-4, the determining pooling coefficients for the target image feature according to the position-coding vector and a sequence model, comprising:

Example 7 provides the method of any one of examples 1-4, the sequence model comprising a bi-directional gated recurrent neural network model, according to one or more embodiments of the present disclosure.

Example 8 provides an image processing apparatus according to one or more embodiments of the present disclosure, the apparatus including:

Example 9 provides the apparatus of example 8, the acquisition module to:

the conversion module 303 is configured to:

Example 10 provides the apparatus of example 9, the conversion module to:

Example 11 provides the apparatus of example 9, the conversion module to:

Example 12 provides the apparatus of any one of examples 8-11, the acquisition module to:

Example 13 provides the apparatus of any one of examples 8-11, the second determination module to:

Example 14 provides the apparatus of any one of examples 8-11, the sequence model comprising a bi-directional gated recurrent neural network model, according to one or more embodiments of the present disclosure.

Example 15 provides a computer readable medium having stored thereon a computer program that, when executed by a processing apparatus, performs the steps of the method of any of examples 1-7, in accordance with one or more embodiments of the present disclosure.

Example 16 provides, in accordance with one or more embodiments of the present disclosure, an electronic device, comprising:

a storage device having a computer program stored thereon;

processing means for executing the computer program in the storage means to carry out the steps of the method of any of examples 1-7.

The foregoing description is only exemplary of the preferred embodiments of the disclosure and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the disclosure herein is not limited to the particular combination of features described above, but also encompasses other embodiments in which any combination of the features described above or their equivalents does not depart from the spirit of the disclosure. For example, the above features and (but not limited to) the features disclosed in this disclosure having similar functions are replaced with each other to form the technical solution.

Further, while operations are depicted in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order. Under certain circumstances, multitasking and parallel processing may be advantageous. Likewise, while several specific implementation details are included in the above discussion, these should not be construed as limitations on the scope of the disclosure. Certain features that are described in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination.

Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are disclosed as example forms of implementing the claims. With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.

Claims

1. An image processing method, characterized in that the method comprises:

2. The method according to claim 1, wherein the obtaining of the target image feature corresponding to the target image to be processed comprises:

3. The method according to claim 2, wherein the determining a one-dimensional vector corresponding to the position number and having a dimension consistent with the number of channels of the target image feature comprises:

4. The method according to claim 2, wherein the method is encapsulated in an image processing model, and the determining a one-dimensional vector corresponding to the position number and having a dimension consistent with the number of channels of the target image feature comprises:

5. The method according to any one of claims 1 to 4, wherein the obtaining of the target image feature corresponding to the target image to be processed comprises:

6. The method according to any of claims 1-4, wherein said determining pooling coefficients for said target image feature based on said position-coding vector and a sequence model comprises:

7. The method of any one of claims 1-4, wherein the sequence model comprises a bi-directional gated recurrent neural network model.

8. An image processing apparatus, characterized in that the apparatus comprises:

9. A computer-readable medium, on which a computer program is stored, characterized in that the program, when being executed by processing means, carries out the steps of the method of any one of claims 1 to 7.

10. An electronic device, comprising:

a storage device having a computer program stored thereon;

processing means for executing the computer program in the storage means to carry out the steps of the method according to any one of claims 1 to 7.