CN111915542A

CN111915542A - Image content description method and system based on deep learning

Info

Publication number: CN111915542A
Application number: CN202010767475.9A
Authority: CN
Inventors: 汪礼君
Original assignee: Individual
Current assignee: Individual
Priority date: 2020-08-03
Filing date: 2020-08-03
Publication date: 2020-11-10

Abstract

The invention relates to the technical field of image description, and discloses an image content description method based on deep learning, which comprises the following steps: carrying out binarization processing on an image to be described by using a threshold-based binarization method, and refining the outline of the binarized image by using a non-maximum signal suppression method; extracting spatial domain features in the image contour region; calculating the weight of different features in the spatial domain features, and fusing the spatial domain features according to the given feature weight; performing multi-scale complex frequency domain transformation processing on the spatial domain fusion characteristics by using a multi-direction wavelet transformation method to obtain low-frequency sub-bands of the image; inputting the low-frequency sub-band into a pre-constructed self-adaptation Net model, and extracting image description characteristics; and inputting the extracted image description features into a pre-constructed Conv-C network model, and generating an image description text. The invention also provides an image content description system based on deep learning. The invention realizes the description of the image content.

Description

Image content description method and system based on deep learning

Technical Field

The invention relates to the technical field of image description, in particular to an image content description method and system based on deep learning.

Background

With the popularization of intelligent terminal devices and the explosive growth of multimedia applications, the generation and accumulation of corresponding data are increasing day by day, and how to better utilize and process the data has become a general concern. Images and text are the most common forms of data in daily life and are a major component of internet data, and related research make internal disorder or usurp tends to spread around image and text data.

Image description combines computer vision with natural language processing with the goal of enabling a computer to recognize image content and automatically generate natural language text to describe the image content, which can be viewed as a translation process from image to text. Unlike image recognition, the text generated by the image description more fully reflects the image information. Ideally, the descriptive text not only contains all target entities in the image, but also includes features, positions, actions between different entities, and even performs scene inference according to the image content, contacts background knowledge in the image, and the like.

For example, the GLA model combines global and local features of an image by utilizing Attention to describe a local target in the image more accurately, but the model cannot be trained end to end, so that each step in the model is independent, the result of each step affects the training result of the whole model, and meanwhile, the existing image description model cannot adaptively input the image size, so that the image size needs to be modified before image content description.

In view of this, how to accurately extract effective features of an image and improve an existing image description model to describe image content becomes a problem to be solved by those skilled in the art.

Disclosure of Invention

The invention provides an image content description method based on deep learning, which is characterized in that color features and histogram features of an image are respectively extracted in a spatial domain, multi-scale complex frequency domain transformation is carried out on the image by using multi-direction wavelet transformation to obtain shape features and texture features of the image, and the description of the image content is realized by using an improved image description model.

In order to achieve the above object, the present invention provides an image content description method based on deep learning, including:

acquiring an image to be described, and carrying out binarization processing on the image by using a threshold-based binarization method;

thinning the outline of the binary image by using a non-maximum signal suppression method;

extracting spatial domain features in the image contour region, wherein the spatial domain features comprise color features and gradient histogram features of an image;

calculating the importance of different features in the spatial domain features, giving weights to the different spatial domain features based on the importance of the features, and fusing the spatial domain features according to the given weights of the features;

performing multi-scale complex frequency domain transformation processing on the spatial domain fusion characteristics of the image by using a multi-direction wavelet transformation method to obtain a high-frequency sub-band and a low-frequency sub-band of the image;

inputting the low-frequency sub-band of the image into a pre-constructed self-adaptation Net model, and extracting image description characteristics;

and inputting the extracted image description features into a pre-constructed Conv-C network model, and generating an image description text.

Optionally, the threshold-based binarization method is as follows:

wherein:

g (x, y) is the gray value of a pixel point with coordinates (x, y) in the binary image after binarization;

f (x, y) is the gray value of the pixel point with the coordinate (x, y) in the original image;

t is a binary threshold, which is set to 110 by the present invention.

Optionally, the refining the contour of the binarized image based on the non-maximum signal suppression method includes:

calculating an angle value of each pixel point in the image, wherein a calculation formula of the angle value alpha is as follows:

wherein:

g_xthe gradient of the pixel point in the x direction;

g_ythe gradient of the pixel point in the y direction;

dividing to obtain four angle regions which are respectively 0-45 degrees, 45-90 degrees, 90-135 degrees and 135-180 degrees, and classifying all pixel point angles into four discrete angle regions;

and comparing two adjacent pixels in the same angle area of the central pixel on the outline in the image, if the central pixel is smaller than any one of the two adjacent pixels, discarding the pixel, and reserving the pixel only if the central pixel is larger than the two adjacent pixels.

Optionally, the extracting spatial domain features in the image contour region includes:

1) extracting color features of the image by using an HIS model:

wherein:

P_ijthe gray level of the ith color channel component in the image is j;

n is the number of all pixel points in the image;

M₁the first moment of the image color characteristic represents the mean value of the image color characteristic;

M₂the second moment of the image color characteristic represents the variance of the image color characteristic;

M₃the third moment of the image color characteristic represents the inclination of the image color characteristic;

2) performing Gamma normalization processing on the image, wherein the Gamma compression formula is as follows:

I(x，y)＝I(x，y)^gamma

wherein:

i (x, y) is an image to be described;

3) calculating the gradient value of each pixel point in the horizontal direction and the vertical direction:

G_x(x，y)＝H(x+1，y)-H(x-1，y)

G_y(x，y)＝H(x，y+1)-H(x，y-1)

wherein:

h (x, y) represents the pixel value of a certain pixel point of the image to be described;

G_x(x, y) and G_y(x, y) are the horizontal direction gradient and the vertical direction gradient at the pixel point (x, y) in the input image which are solved respectively;

4) the gradients of all pixel points in the image are sequentially connected end to end, the gradient value of the pixel of the whole image to be described is scanned by taking 8 pixels as step length, 36 scanning windows are arranged in the horizontal direction, and a 288-dimensional descriptor is formed and used as the gradient histogram feature of the image to be described.

Optionally, the calculating importance of different features in the spatial domain features and giving weights to the different spatial domain features based on the importance of the features includes:

randomly from a feature training sample set XSelecting a sample x_iThen from and x_iFinding k adjacent samples x from data samples of the same type_jSimultaneously from and x_iFinding k neighbor samples x from heterogeneous data samples_l；

Calculating weights of different features by using a feature weight calculation formula, wherein the features comprise color features and gradient histogram features of an image to be described, and the feature weight calculation formula is as follows:

wherein:

diff(A，x_j，x_l) Represents a sample x_jAnd x_lThe difference above feature a;

a is a spatial domain splicing characteristic vector of an image to be described;

m is the sampling times;

k is the number of the screened neighbor samples;

class(x_i) Denotes x_iBeing of the same kind, p (class (x)_i) ) represents and x_iProbability of being homogeneous;

p (C) represents and x_iProbability of being a different class;

setting the threshold value to be 0.01, removing the feature when the calculated weight value of the feature is less than 0.01, and finally, re-connecting and fusing the obtained features in series according to the feature weight to form a new feature set when the weight value of the feature is greater than or equal to 0.01 and is reserved.

Optionally, the performing multi-scale complex frequency domain transform processing on the spatial domain fusion features of the image by using a multi-direction wavelet transform method includes:

1) respectively establishing a multi-scale function phi_i，j(n₁，n₂) And complex wavelet function

Wherein:

n₁，n₂any two feature vectors of the fusion features in the spatial domain;

j, k represent the expansion and translation indices of the image, respectively;

i denotes the number of frequency subbands in 8 directions;

re represents a real part obtained by decomposing the image;

im represents an imaginary part obtained by decomposing the image;

2) decomposing the spatial domain fusion characteristics f (n) of the image by using a multi-scale function and a complex wavelet function:

wherein:

z represents a natural number set;

c_j，krepresenting a scale factor;

representing complex wavelet coefficients in an ith direction;

i denotes the number of frequency subbands in 8 directions, including 6 high frequency subbands and 2 low frequency subbands;

3) and taking the imaginary part obtained by decomposition as a high-frequency sub-band of the image, taking the real part obtained by decomposition as a low-frequency sub-band of the image, taking the average value of the two low-frequency sub-bands as low-frequency information to ensure that the low-frequency information is not lost, and preventing information loss caused when a certain direction is taken.

Optionally, the extracting image description features from the pre-constructed self-adaptation Net model includes:

the self-adaptation Net model comprises a plurality of convolution layers and a pooling layer, wherein the formula for extracting features by utilizing the convolution layers is as follows:

F＝W_f(I)

wherein:

i is a low-frequency subband of the image;

W_fmapping of low frequency subbands to convolutional layer input;

f is the image description feature extracted by the convolution layer;

the pooling layer of the model is composed of three pooling structures of 1 ring, 2 rings and 4 rings, X multiplied by Y feature maps obtained by the convolutional layer are pooled, and output features are connected together to form a final image description feature;

the three pooling structures of the adaptive pooling layer are not fixed in size, but change along with an input feature map, when the feature map of the size of the input adaptive pooling layer is 7 × 7, the pooling layer is equivalently formed by three pooling structures of 1 ring, 2 rings and 4 rings, each pooling structure performs adaptive adjustment of the pooling structure according to the size of the input feature map without limiting the size of an image, and the three pooling structures are respectively:

1. a pooling structure of 7 × 7 size at the center;

2. a pooling structure of pooling in the size of 4 multiplied by 4 at the center and annular pooling in the size of 4 multiplied by 4 at the inner ring of 7 multiplied by 7 at the outer ring;

3. the center is in a 2 multiplied by 2 size pooling structure, the outer ring is in a 4 multiplied by 4 inner ring 2 multiplied by 2 size annular pooling structure, the outer ring is in a 6 multiplied by 6 inner ring 4 multiplied by 4 size annular pooling structure, and the outer ring is in a 7 multiplied by 7 inner ring 6 multiplied by 6 size annular pooling structure.

Optionally, the generating of the image description text by using the pre-constructed Conv-C network model includes:

the Conv-C network model adopts an encoding and decoding network structure, and an encoder of the Conv-C network model extracts image characteristics F to be used for guiding generation of description texts by using a deep convolution network in a self-adaptation Net model; in the description generation part, a Conv-C network model adopts multilayer convolution network learning to construct a language model, and integrates text data characteristics of different layers in the process of introducing image characteristics for a prediction process by adopting an attention mechanism image characteristic input method;

the process of generating the image description text by using the Conv-C network model comprises the following steps:

1) expanding and mapping the image description characteristics, namely expanding the image description characteristics and adaptive mapping of a decoder network by n times to form n groups of key value pairs for an attention mechanism;

2) in the process of generating the description text, defining the description text to be generated as S_nAnd initializing it to S { < pad>，<pad >, < s >, (where < pad > and<s > are blank fill and sentence initial characters respectively, and the S vector is quantized through a word embedding mapping layer:

V_s＝W_eS

wherein:

W_eembedding weights for the mapped layers for the words;

V_sthe vectorized description text to be generated is obtained;

3) and obtaining the feature representation of the text to be generated by utilizing two layers of convolution:

wherein:

Conv_Aand Conv_BRespectively representing two different convolution operations, wherein the sizes of convolution kernels of the two operations are respectively 3 multiplied by 3 and 5 multiplied by 5;

4) the obtained convolution text characteristic is related with the image description characteristic by using an attention mechanism to obtain a proper image characteristicSign F_att：

Wherein:

f is the obtained image description characteristics;

5) and (3) integrating the multiple characteristics to finish a layer of convolution operation to obtain the output of the convolution layer:

6) superposing the L layers of convolution operation to obtain the final characteristic expression F_lFinally mapping to obtain the probability distribution of the next word, and generating the input vocabulary at the next moment by sampling:

x_t+1＝argmaxSoftmax(W_vF_l)

wherein:

W_vmapping the feature vector into the weight of each word in the dictionary;

x_t+1describing vocabulary for the image content generated at the time t + 1;

t is the current time;

F_lthe feature expression is obtained after L layers of convolution are added;

will finally obtain x_t+1Filling in the description text S to be generated_nAnd continuously iterating to finally obtain a generated image content description text.

In addition, to achieve the above object, the present invention also provides an image content description system based on deep learning, the system including:

image acquisition means for receiving an image to be described;

the image processor is used for carrying out binarization processing on the image by using a threshold-based binarization method and refining the outline of the binarized image by using a non-maximum signal suppression method; simultaneously extracting spatial domain features in the image contour region, and performing multi-scale complex frequency domain transformation processing on the spatial domain fusion features of the image by using a multi-direction wavelet transformation method to obtain a high-frequency sub-band and a low-frequency sub-band of the image;

the image content description device inputs the low-frequency sub-band of the image into a pre-constructed self-adaptation Net model, extracts image description characteristics, inputs the extracted image description characteristics into a pre-constructed Conv-C network model, and generates an image description text.

In addition, to achieve the above object, the present invention also provides a computer readable storage medium having stored thereon image content description instructions, which are executable by one or more processors to implement the steps of the implementation method of image content description based on deep learning as described above.

Compared with the prior art, the invention provides an image content description method based on deep learning, and the technology has the following advantages:

firstly, the existing image description model cannot adaptively input the image size, so that the image size needs to be modified before image content description is performed, whereas the existing adaptive pooling layer is composed of three pooling structures of 1 × 1, 2 × 2 and 4 × 4, all the inputted feature maps with the size of X × Y are pooled, and the outputted features are connected together to form (1+4+16) parameters. The invention designs a special self-adaptive pooling layer which is composed of three pooling structures of 1 ring, 2 rings and 4 rings, input characteristic graphs of X multiplied by Y size are pooled, output characteristics are connected together to form (1+2+4) parameters, compared with the existing pooling layer, the output parameters of the pooling layer are less, meanwhile, the three pooling structures of the pooling layer are not fixed in size but change along with the change of the input characteristic graphs, and when the characteristic graph of the input self-adaptive pooling layer size is 7 multiplied by 7, the pooling layer is equivalent to three pooling of a center 7 multiplied by 7 size, a center 4 multiplied by 4 size pooling, an outer ring 7 multiplied by 7 inner ring 4 size, a center 2 multiplied by 2 size pooling, an outer ring 4 multiplied by 4 size, an inner ring 2 multiplied by 2 size, an outer ring 4 multiplied by 4 size, an outer ring 6 multiplied by 6 size, an inner ring size, an outer ring 6 multiplied by 6 size and an outer ring size The pooling is configured to achieve a pooling level structure with the same scale for the output regardless of the same size of the input feature map.

Secondly, most information of the image is concentrated in low frequency, high frequency information still retains a large amount of redundant information, the color feature and the shape feature of the image have the characteristic of gradual change, but the classical wavelet transformation method can only provide information in three directions of horizontal, vertical and oblique during the process of decomposing the image, and the continuous change of the direction is difficult to adapt. The complex frequency domain transformation processing method of the invention decomposes the spatial fusion characteristics by utilizing the multi-scale function and the complex wavelet function, takes the imaginary part obtained by decomposition as the high-frequency sub-band of the image, takes the real part obtained by decomposition as the low-frequency sub-band of the image, and provides 8 selectable directions including 6 high-frequency sub-bands and 2 low-frequency sub-bands, so that the carried redundant information is very limited. Meanwhile, in order to ensure that low-frequency information is not lost, the invention takes the average value of two low-frequency sub-bands as low-frequency information, and prevents information loss caused by taking a certain direction.

The invention provides a Conv-C network model to carry on the image description text to produce at the same time, adopt the multilayer convolution network to study and construct the language model, adopt the image characteristic input method of the attention mechanism, integrate the text data characteristic of different levels in the course of introducing the image characteristic and use for the prediction process, the said model of the invention expands and maps the image description characteristic at first, namely the image description characteristic expands n times with the adaptive mapping of the decoder network, form n groups of key value pairs to use for the attention mechanism, through decoding and combining the attention mechanism to process the image characteristic multiunit adaptation, have realized the choice to the image characteristic to a certain extent, thus utilize the image content information included in the image characteristic vector better, and can be aligned conveniently in the text generating process; meanwhile, the model generates an image content description text through multiple convolutions, in one convolution operation, the association between elements in a receptive field can be realized through convolution operation, the operation distances between the elements are equal, so that the operation distance between different positions in a sequence is reduced by k times, and k is the size of a convolution kernel, and the problem of imbalance of operation when the sequence elements at different positions are associated is relieved to a certain extent.

Drawings

Fig. 1 is a schematic flowchart of an image content description method based on deep learning according to an embodiment of the present invention;

fig. 2 is a schematic structural diagram of an image content description system based on deep learning according to an embodiment of the present invention;

the implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.

Detailed Description

It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

The color features and histogram features of the image are respectively extracted in a spatial domain, multi-scale complex frequency domain transformation is carried out on the image by using multi-direction wavelet transformation, and the description of the image content is realized by utilizing an improved image description model. Referring to fig. 1, a schematic diagram of an image content description method based on deep learning according to an embodiment of the present invention is provided.

In this embodiment, the image content description method based on deep learning includes:

and S1, acquiring an image to be described, carrying out binarization processing on the image by using a threshold-based binarization method, and refining the outline of the binarized image by using a non-maximum signal suppression method.

Firstly, the invention obtains an image to be described, and carries out binarization processing on the image by using a threshold-based binarization method, wherein the threshold-based binarization method comprises the following steps:

wherein:

t is a binary threshold value which is set as 110 by the invention;

further, the invention calculates the angle value of each pixel point in the image, and the calculation formula of the angle value alpha is as follows:

wherein:

g_xthe gradient of the pixel point in the x direction;

g_ythe gradient of the pixel point in the y direction;

and comparing two adjacent pixels in the same angle area of the central pixel on the contour in the image, if the central pixel is smaller than any one of the two adjacent pixels, discarding the pixel, and only if the central pixel is larger than the two adjacent pixels, reserving the pixel, thereby realizing the thinning of the contour of the binary image by using a non-maximum signal suppression method.

And S2, extracting the space domain features in the image contour region, wherein the space domain features comprise color features and gradient histogram features of the image.

Further, the invention extracts the spatial domain feature in the image contour region, and the extraction process of the spatial domain feature is as follows:

1) extracting color features of the image by using an HIS model:

wherein:

P_ijthe gray level of the ith color channel component in the image is j;

n is the number of all pixel points in the image;

I(x，y)＝I(x，y)^gamma

wherein:

i (x, y) is an image to be described;

Gx(x，y)＝H(x+1，y)-H(x-1，y)

G_y(x，y)＝H(x，y+1)-H(x，y-1)

wherein:

And S3, calculating the importance of different features in the spatial domain features, giving the weights of the different spatial domain features based on the importance of the features, and fusing the spatial domain features according to the given feature weights.

Further, the color features and gradient histogram features of the obtained image are spliced in series to obtain a feature training sample set X of the image to be described, wherein the feature training sample set X is { X ═ X }_iN, where x is 1_i＝{A₁，...，A_xThe feature is the spatial domain splicing feature extracted from the ith image to be described;

further, the invention randomly selects a sample X from a feature training sample set X_iThen from and x_iFinding k adjacent samples x from data samples of the same type_jSimultaneously from and x_iFinding k neighbor samples x from heterogeneous data samples_l；

wherein:

m is the sampling times;

k is the number of the screened neighbor samples;

p (C) represents and x_iProbability of being a different class;

And S4, performing multi-scale complex frequency domain transformation processing on the spatial domain fusion characteristics of the image by using a multi-direction wavelet transformation method to obtain a high-frequency sub-band and a low-frequency sub-band of the image.

Furthermore, the invention uses a multidirectional wavelet transform method to perform multi-scale complex frequency domain transform processing on the spatial domain fusion characteristics of the image, and the process of the multi-scale complex frequency domain transform processing is as follows:

Wherein:

i denotes the number of frequency subbands in 8 directions;

re represents a real part obtained by decomposing the image;

im represents an imaginary part obtained by decomposing the image;

wherein:

z represents a natural number set;

c_j，krepresenting a scale factor;

representing complex wavelet coefficients in an ith direction;

And S5, inputting the low-frequency sub-band of the image into a pre-constructed self-adaptation Net model, and extracting image description features.

Further, inputting a low-frequency sub-band of the image into a pre-constructed self-adaptation Net model, and extracting image description characteristics of the image to be described;

F＝W_f(I)

wherein:

i is a low-frequency subband of the image;

W_fmapping of low frequency subbands to convolutional layer input;

f is the image description feature extracted by the convolution layer;

compared with the existing model, the pooling layer of the model is composed of three pooling structures of 1 ring, 2 rings and 4 rings, X multiplied by Y feature maps obtained by the convolutional layer are pooled, and output features are connected together to form a final image description feature;

1. a pooling structure of 7 × 7 size at the center;

For a feature map of the input adaptive pooling layer size of 20 × 20, the three pooling structures are:

1. a pooling structure of central 20 x 20 size;

2. a pooling structure of pooling in the size of 10 multiplied by 10 at the center and annular pooling in the size of 10 multiplied by 10 at the inner ring of 20 multiplied by 20 at the outer ring;

3. a pooling structure of pooling in the size of 5 × 5 at the center, annular pooling in the size of 5 × 5 at the outer ring 10 × 10 at the inner ring, annular pooling in the size of 10 × 10 at the outer ring 15 × 15 at the inner ring, and annular pooling in the size of 15 × 15 at the outer ring 20 × 720 at the inner ring.

And S6, inputting the extracted image description features into a pre-constructed Conv-C network model, and generating an image description text.

Further, the extracted image description features are input into a pre-constructed Conv-C network model to generate an image description text;

meanwhile, the image features are selected to a certain extent by carrying out multi-group adaptive decoding on the image features and processing by combining an attention mechanism, so that image content information contained in image feature vectors is better utilized, and the text generation process can be conveniently aligned;

2) in the process of generating the description text, defining the description text to be generated as S_nAnd is initialized to S { < pad >,<s>therein of<pad>And<s>filling blank and sentence initial characters respectively, and quantizing S vectors through a word embedding mapping layer:

V_s＝W_eS

wherein:

W_eembedding weights for the mapped layers for the words;

V_sthe vectorized description text to be generated is obtained;

wherein:

4) the obtained convolution text characteristic is related with the image description characteristic by using an attention mechanism to obtain a proper image characteristic F_att：

Wherein:

f is the obtained image description characteristics;

6) superposing the L layers of convolution operations to obtain the final characteristic expression E_lFinally mapping to obtain the probability distribution of the next word, and generating the input vocabulary at the next moment by sampling:

x_t+1＝argmax Softmax(W_vF_l)

wherein:

W_vmapping the feature vector into the weight of each word in the dictionary;

x_t+1describing vocabulary for the image content generated at the time t + 1;

t is the current time;

F_lthe feature expression is obtained after L layers of convolution are added;

the invention will finally obtain x_t+1Filling in the description text S to be generated_nAnd continuously iterating to finally obtain a generated image content description text.

The following describes embodiments of the present invention through an algorithmic experiment and tests of the inventive treatment method. The hardware testing environment of the algorithm is deployed in a tensorflow deep learning framework, a processor is an Intel (R) core (TM) i5-7700 CPU 8 core, a display card is GeForce GTX1040, a display memory 8G, a development environment is python3.5, and a development tool is an Anaconda scientific calculation library; the comparison algorithm models are a ResNet-F5 model, a VGG-F3 model and a CNN model.

In the algorithm experiment, an image description data set issued by MS COCO in 2014 is adopted in a data set experiment, the data set experiment totally comprises 8 ten thousand groups of training set samples and 4 ten thousand groups of verification set samples, and each group of samples comprises one image and corresponding 5 sentences of manually marked English description texts. In the experiment, 8 ten thousand groups of training data are used as a training set for training a model, and 500 pieces of training data are randomly selected from a verification set sample to be used as a test set for evaluating the effect of the model.

In the aspect of data set processing, the original image size is kept in the input process for image data, and the image data is input into an image coding network after being normalized and standardized. For text data, a dictionary is constructed based on description texts in the entire training set, and words in which the occurrence frequency is greater than or equal to 5 are retained as a training dictionary, and the dictionary dimension used in the experiment is about 14500. During the process the words not contained in the dictionary are replaced with < UNK >, the text sequence is filled up to a fixed length with < PAD >, set to 25 in the experiment, resulting in the final training set. The image data in the data set are respectively input into the training model, the generated image content description text is compared with the description text in the training set, and the comparison result is the image content description accuracy of the algorithm model.

According to the experimental result, the accuracy of describing the image content of the ResNet-F5 model is 81.93%, the accuracy of describing the image content of the VGG-F3 model is 76.25%, the accuracy of describing the image content of the CNN model is 86.78%, the accuracy of describing the image content of the algorithm is 89.21%, and compared with a comparison algorithm, the image content description method based on deep learning provided by the invention has higher accuracy of scotching the image content.

The invention also provides an image content description system based on deep learning. Referring to fig. 2, a schematic diagram of an internal structure of an image content description system based on deep learning according to an embodiment of the present invention is provided.

In the present embodiment, the image content description system 1 based on deep learning includes at least an image acquisition device 11, an image processor 12, an image content description device 13, a communication bus 14, and a network interface 15.

The image capturing device 11 may be a PC (Personal Computer), a terminal device such as a smart phone, a tablet Computer, or a mobile Computer, or may be a server.

Image processor 12 includes at least one type of readable storage medium including flash memory, a hard disk, a multi-media card, a card-type memory (e.g., SD or DX memory, etc.), a magnetic memory, a magnetic disk, an optical disk, and the like. The image processor 12 may in some embodiments be an internal storage unit of the deep learning based image content description system 1, for example a hard disk of the deep learning based image content description system 1. The image processor 12 may also be an external storage device of the deep learning based image content description system 1 in other embodiments, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like, provided on the deep learning based image content description system 1. Further, the image processor 12 may also include both an internal storage unit and an external storage device of the depth learning based image content description system 1. The image processor 12 can be used not only to store application software installed in the deep learning-based image content description system 1 and various kinds of data, but also to temporarily store data that has been output or is to be output.

The image content description device 13 may be, in some embodiments, a Central Processing Unit (CPU), a controller, a microcontroller, a microprocessor or other data Processing chip for executing program codes stored in the image processor 12 or Processing data, such as image content description program instructions.

The communication bus 14 is used to enable connection communication between these components.

The network interface 15 may optionally include a standard wired interface, a wireless interface (e.g., WI-FI interface), and is typically used to establish a communication link between the system 1 and other electronic devices.

Optionally, the system 1 may further comprise a user interface, which may comprise a Display (Display), an input unit such as a Keyboard (Keyboard), and optionally a standard wired interface, a wireless interface. Alternatively, in some embodiments, the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode) touch device, or the like. The display, which may also be referred to as a display screen or display unit, is suitable for displaying information processed in the deep learning based image content description system 1 and for displaying a visualized user interface.

Fig. 2 only shows the image content description system 1 with the components 11-15 and based on deep learning, it will be understood by those skilled in the art that the structure shown in fig. 1 does not constitute a limitation of the image content description system 1 based on deep learning, and may comprise fewer or more components than shown, or combine certain components, or a different arrangement of components.

In the embodiment of apparatus 1 shown in fig. 2, image processor 12 has stored therein program instructions describing the content of images based on deep learning; the steps of the image content description device 13 executing the image content description program instructions stored in the image processor 12 are the same as the implementation method of the image content description method based on the deep learning, and are not described here.

Furthermore, an embodiment of the present invention also provides a computer-readable storage medium having stored thereon image content description program instructions executable by one or more processors to implement the following operations:

It should be noted that the above-mentioned numbers of the embodiments of the present invention are merely for description, and do not represent the merits of the embodiments. And the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, apparatus, article, or method that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, apparatus, article, or method. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, apparatus, article, or method that includes the element.

Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium (e.g., ROM/RAM, magnetic disk, optical disk) as described above and includes instructions for enabling a terminal device (e.g., a mobile phone, a computer, a server, or a network device) to execute the method according to the embodiments of the present invention.

The above description is only a preferred embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims

1. An image content description method based on deep learning, characterized in that the method comprises:

2. The image content description method based on deep learning as claimed in claim 1, wherein the binarization method based on threshold value is:

wherein:

t is a binary threshold, which is set to 110 by the present invention.

3. The image content description method based on deep learning of claim 2, wherein the refining of the contour of the binarized image based on the non-maximum signal suppression method comprises:

wherein:

g_xthe gradient of the pixel point in the x direction;

g_ythe gradient of the pixel point in the y direction;

4. The image content description method based on deep learning of claim 3, wherein the extracting the spatial domain features in the image contour region comprises:

1) extracting color features of the image by using an HIS model:

wherein:

P_ijthe gray level of the ith color channel component in the image is j;

n is the number of all pixel points in the image;

I(x，y)＝I(x，y)^gamma

wherein:

i (x, y) is an image to be described;

G_x(x，y)＝H(x+1，y)-H(x-1，y)

G_y(x，y)＝H(x，y+1)-H(x，y-1)

wherein:

5. The method as claimed in claim 4, wherein the calculating the importance of different features in the spatial domain features and giving weights to the different spatial domain features based on the importance of the features comprises:

randomly selecting sample X from feature training sample set X_iThen from and x_iFinding k adjacent samples x from data samples of the same type_jSimultaneously from and x_iFinding k neighbor samples x from heterogeneous data samples_l；

wherein:

m is the sampling times;

k is the number of the screened neighbor samples;

p (C) represents and x_iProbability of being a different class;

6. The image content description method based on deep learning of claim 5, wherein the multi-scale complex frequency domain transform processing of the spatial domain fusion features of the image by the multi-direction wavelet transform method comprises:

Wherein:

i denotes the number of frequency subbands in 8 directions;

re represents a real part obtained by decomposing the image;

im represents an imaginary part obtained by decomposing the image;

wherein:

z represents a natural number set;

c_j，krepresenting a scale factor;

representing complex wavelet coefficients in an ith direction;

7. The image content description method based on deep learning of claim 6, wherein the extracting image description features by using the pre-constructed self-adaptation Net model comprises:

F＝W_f(I)

wherein:

i is a low-frequency subband of the image;

W_fmapping of low frequency subbands to convolutional layer input;

f is the image description feature extracted by the convolution layer;

the three pooling structures of the adaptive pooling layer are not fixed in size, but change along with the input feature map, and when the feature map of the input adaptive pooling layer size is 7 × 7, the pooling layer is equivalently formed by three pooling structures of 1 ring, 2 rings and 4 rings, wherein the three pooling structures are respectively:

1. a pooling structure of 7 × 7 size at the center;

2. a pooling structure of pooling in the size of 4 multiplied by 4 at the center and annular pooling in the size of 4 multiplied by 4 at the inner ring of 7 multiplied by 7 at the outer ring; 3. the center is in a 2 multiplied by 2 size pooling structure, the outer ring is in a 4 multiplied by 4 inner ring 2 multiplied by 2 size annular pooling structure, the outer ring is in a 6 multiplied by 6 inner ring 4 multiplied by 4 size annular pooling structure, and the outer ring is in a 7 multiplied by 7 inner ring 6 multiplied by 6 size annular pooling structure.

8. The image content description method based on deep learning of claim 7, wherein the generation of the image description text by using the pre-constructed Conv-C network model comprises:

2) in the process of generating the description text, defining the description text to be generated as S_nAnd initialized to S ═ tone<pad>，<pad>，<s>Therein of<pad>And<s>filling blank and sentence initial characters respectively, and quantizing S vectors through a word embedding mapping layer:

V_s＝W_eS

wherein:

W_eembedding weights for the mapped layers for the words;

V_sthe vectorized description text to be generated is obtained;

wherein:

Wherein:

f is the obtained image description characteristics;

x_t+1＝argmax Softmax(W_vF_l)

wherein:

W_vmapping the feature vector into the weight of each word in the dictionary;

x_t+1describing vocabulary for the image content generated at the time t + 1;

t is the current time;

F_lthe feature expression is obtained after L layers of convolution are added;

will finally obtain x_t+1Filling in the description text S to be generated_nIn the method, the generated image content is obtained finally through continuous iterationThe text is described.

9. An image content description system based on deep learning, characterized in that the system comprises:

image acquisition means for receiving an image to be described;

10. A computer readable storage medium having stored thereon image content description program instructions executable by one or more processors to implement the steps of a method of implementing a deep learning based image content description as claimed in any one of claims 1 to 8.