CN111915542A - Image content description method and system based on deep learning - Google Patents

Image content description method and system based on deep learning Download PDF

Info

Publication number
CN111915542A
CN111915542A CN202010767475.9A CN202010767475A CN111915542A CN 111915542 A CN111915542 A CN 111915542A CN 202010767475 A CN202010767475 A CN 202010767475A CN 111915542 A CN111915542 A CN 111915542A
Authority
CN
China
Prior art keywords
image
features
description
feature
spatial domain
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
CN202010767475.9A
Other languages
Chinese (zh)
Inventor
汪礼君
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Individual
Original Assignee
Individual
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Individual filed Critical Individual
Priority to CN202010767475.9A priority Critical patent/CN111915542A/en
Publication of CN111915542A publication Critical patent/CN111915542A/en
Withdrawn legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T5/00Image enhancement or restoration
    • G06T5/40Image enhancement or restoration by the use of histogram techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T3/00Geometric image transformation in the plane of the image
    • G06T3/40Scaling the whole image or part thereof
    • G06T3/4038Scaling the whole image or part thereof for image mosaicing, i.e. plane images composed of plane sub-images
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/40Analysis of texture
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/90Determination of colour characteristics
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20212Image combination
    • G06T2207/20221Image fusion; Image merging

Abstract

The invention relates to the technical field of image description, and discloses an image content description method based on deep learning, which comprises the following steps: carrying out binarization processing on an image to be described by using a threshold-based binarization method, and refining the outline of the binarized image by using a non-maximum signal suppression method; extracting spatial domain features in the image contour region; calculating the weight of different features in the spatial domain features, and fusing the spatial domain features according to the given feature weight; performing multi-scale complex frequency domain transformation processing on the spatial domain fusion characteristics by using a multi-direction wavelet transformation method to obtain low-frequency sub-bands of the image; inputting the low-frequency sub-band into a pre-constructed self-adaptation Net model, and extracting image description characteristics; and inputting the extracted image description features into a pre-constructed Conv-C network model, and generating an image description text. The invention also provides an image content description system based on deep learning. The invention realizes the description of the image content.

Description

Image content description method and system based on deep learning
Technical Field
The invention relates to the technical field of image description, in particular to an image content description method and system based on deep learning.
Background
With the popularization of intelligent terminal devices and the explosive growth of multimedia applications, the generation and accumulation of corresponding data are increasing day by day, and how to better utilize and process the data has become a general concern. Images and text are the most common forms of data in daily life and are a major component of internet data, and related research make internal disorder or usurp tends to spread around image and text data.
Image description combines computer vision with natural language processing with the goal of enabling a computer to recognize image content and automatically generate natural language text to describe the image content, which can be viewed as a translation process from image to text. Unlike image recognition, the text generated by the image description more fully reflects the image information. Ideally, the descriptive text not only contains all target entities in the image, but also includes features, positions, actions between different entities, and even performs scene inference according to the image content, contacts background knowledge in the image, and the like.
For example, the GLA model combines global and local features of an image by utilizing Attention to describe a local target in the image more accurately, but the model cannot be trained end to end, so that each step in the model is independent, the result of each step affects the training result of the whole model, and meanwhile, the existing image description model cannot adaptively input the image size, so that the image size needs to be modified before image content description.
In view of this, how to accurately extract effective features of an image and improve an existing image description model to describe image content becomes a problem to be solved by those skilled in the art.
Disclosure of Invention
The invention provides an image content description method based on deep learning, which is characterized in that color features and histogram features of an image are respectively extracted in a spatial domain, multi-scale complex frequency domain transformation is carried out on the image by using multi-direction wavelet transformation to obtain shape features and texture features of the image, and the description of the image content is realized by using an improved image description model.
In order to achieve the above object, the present invention provides an image content description method based on deep learning, including:
acquiring an image to be described, and carrying out binarization processing on the image by using a threshold-based binarization method;
thinning the outline of the binary image by using a non-maximum signal suppression method;
extracting spatial domain features in the image contour region, wherein the spatial domain features comprise color features and gradient histogram features of an image;
calculating the importance of different features in the spatial domain features, giving weights to the different spatial domain features based on the importance of the features, and fusing the spatial domain features according to the given weights of the features;
performing multi-scale complex frequency domain transformation processing on the spatial domain fusion characteristics of the image by using a multi-direction wavelet transformation method to obtain a high-frequency sub-band and a low-frequency sub-band of the image;
inputting the low-frequency sub-band of the image into a pre-constructed self-adaptation Net model, and extracting image description characteristics;
and inputting the extracted image description features into a pre-constructed Conv-C network model, and generating an image description text.
Optionally, the threshold-based binarization method is as follows:
Figure BDA0002615235680000021
wherein:
g (x, y) is the gray value of a pixel point with coordinates (x, y) in the binary image after binarization;
f (x, y) is the gray value of the pixel point with the coordinate (x, y) in the original image;
t is a binary threshold, which is set to 110 by the present invention.
Optionally, the refining the contour of the binarized image based on the non-maximum signal suppression method includes:
calculating an angle value of each pixel point in the image, wherein a calculation formula of the angle value alpha is as follows:
Figure BDA0002615235680000022
wherein:
gxthe gradient of the pixel point in the x direction;
gythe gradient of the pixel point in the y direction;
dividing to obtain four angle regions which are respectively 0-45 degrees, 45-90 degrees, 90-135 degrees and 135-180 degrees, and classifying all pixel point angles into four discrete angle regions;
and comparing two adjacent pixels in the same angle area of the central pixel on the outline in the image, if the central pixel is smaller than any one of the two adjacent pixels, discarding the pixel, and reserving the pixel only if the central pixel is larger than the two adjacent pixels.
Optionally, the extracting spatial domain features in the image contour region includes:
1) extracting color features of the image by using an HIS model:
Figure BDA0002615235680000031
Figure BDA0002615235680000032
Figure BDA0002615235680000033
wherein:
Pijthe gray level of the ith color channel component in the image is j;
n is the number of all pixel points in the image;
M1the first moment of the image color characteristic represents the mean value of the image color characteristic;
M2the second moment of the image color characteristic represents the variance of the image color characteristic;
M3the third moment of the image color characteristic represents the inclination of the image color characteristic;
2) performing Gamma normalization processing on the image, wherein the Gamma compression formula is as follows:
I(x,y)=I(x,y)gamma
wherein:
i (x, y) is an image to be described;
3) calculating the gradient value of each pixel point in the horizontal direction and the vertical direction:
Gx(x,y)=H(x+1,y)-H(x-1,y)
Gy(x,y)=H(x,y+1)-H(x,y-1)
wherein:
h (x, y) represents the pixel value of a certain pixel point of the image to be described;
Gx(x, y) and Gy(x, y) are the horizontal direction gradient and the vertical direction gradient at the pixel point (x, y) in the input image which are solved respectively;
4) the gradients of all pixel points in the image are sequentially connected end to end, the gradient value of the pixel of the whole image to be described is scanned by taking 8 pixels as step length, 36 scanning windows are arranged in the horizontal direction, and a 288-dimensional descriptor is formed and used as the gradient histogram feature of the image to be described.
Optionally, the calculating importance of different features in the spatial domain features and giving weights to the different spatial domain features based on the importance of the features includes:
randomly from a feature training sample set XSelecting a sample xiThen from and xiFinding k adjacent samples x from data samples of the same typejSimultaneously from and xiFinding k neighbor samples x from heterogeneous data samplesl
Calculating weights of different features by using a feature weight calculation formula, wherein the features comprise color features and gradient histogram features of an image to be described, and the feature weight calculation formula is as follows:
Figure BDA0002615235680000041
Figure BDA0002615235680000042
wherein:
diff(A,xj,xl) Represents a sample xjAnd xlThe difference above feature a;
a is a spatial domain splicing characteristic vector of an image to be described;
m is the sampling times;
k is the number of the screened neighbor samples;
class(xi) Denotes xiBeing of the same kind, p (class (x)i) ) represents and xiProbability of being homogeneous;
p (C) represents and xiProbability of being a different class;
setting the threshold value to be 0.01, removing the feature when the calculated weight value of the feature is less than 0.01, and finally, re-connecting and fusing the obtained features in series according to the feature weight to form a new feature set when the weight value of the feature is greater than or equal to 0.01 and is reserved.
Optionally, the performing multi-scale complex frequency domain transform processing on the spatial domain fusion features of the image by using a multi-direction wavelet transform method includes:
1) respectively establishing a multi-scale function phii,j(n1,n2) And complex wavelet function
Figure BDA0002615235680000043
Figure BDA0002615235680000044
Figure BDA0002615235680000045
Wherein:
n1,n2any two feature vectors of the fusion features in the spatial domain;
j, k represent the expansion and translation indices of the image, respectively;
i denotes the number of frequency subbands in 8 directions;
re represents a real part obtained by decomposing the image;
im represents an imaginary part obtained by decomposing the image;
2) decomposing the spatial domain fusion characteristics f (n) of the image by using a multi-scale function and a complex wavelet function:
Figure BDA0002615235680000051
wherein:
z represents a natural number set;
cj,krepresenting a scale factor;
Figure BDA0002615235680000052
representing complex wavelet coefficients in an ith direction;
n1,n2any two feature vectors of the fusion features in the spatial domain;
j, k represent the expansion and translation indices of the image, respectively;
i denotes the number of frequency subbands in 8 directions, including 6 high frequency subbands and 2 low frequency subbands;
3) and taking the imaginary part obtained by decomposition as a high-frequency sub-band of the image, taking the real part obtained by decomposition as a low-frequency sub-band of the image, taking the average value of the two low-frequency sub-bands as low-frequency information to ensure that the low-frequency information is not lost, and preventing information loss caused when a certain direction is taken.
Optionally, the extracting image description features from the pre-constructed self-adaptation Net model includes:
the self-adaptation Net model comprises a plurality of convolution layers and a pooling layer, wherein the formula for extracting features by utilizing the convolution layers is as follows:
F=Wf(I)
wherein:
i is a low-frequency subband of the image;
Wfmapping of low frequency subbands to convolutional layer input;
f is the image description feature extracted by the convolution layer;
the pooling layer of the model is composed of three pooling structures of 1 ring, 2 rings and 4 rings, X multiplied by Y feature maps obtained by the convolutional layer are pooled, and output features are connected together to form a final image description feature;
the three pooling structures of the adaptive pooling layer are not fixed in size, but change along with an input feature map, when the feature map of the size of the input adaptive pooling layer is 7 × 7, the pooling layer is equivalently formed by three pooling structures of 1 ring, 2 rings and 4 rings, each pooling structure performs adaptive adjustment of the pooling structure according to the size of the input feature map without limiting the size of an image, and the three pooling structures are respectively:
1. a pooling structure of 7 × 7 size at the center;
2. a pooling structure of pooling in the size of 4 multiplied by 4 at the center and annular pooling in the size of 4 multiplied by 4 at the inner ring of 7 multiplied by 7 at the outer ring;
3. the center is in a 2 multiplied by 2 size pooling structure, the outer ring is in a 4 multiplied by 4 inner ring 2 multiplied by 2 size annular pooling structure, the outer ring is in a 6 multiplied by 6 inner ring 4 multiplied by 4 size annular pooling structure, and the outer ring is in a 7 multiplied by 7 inner ring 6 multiplied by 6 size annular pooling structure.
Optionally, the generating of the image description text by using the pre-constructed Conv-C network model includes:
the Conv-C network model adopts an encoding and decoding network structure, and an encoder of the Conv-C network model extracts image characteristics F to be used for guiding generation of description texts by using a deep convolution network in a self-adaptation Net model; in the description generation part, a Conv-C network model adopts multilayer convolution network learning to construct a language model, and integrates text data characteristics of different layers in the process of introducing image characteristics for a prediction process by adopting an attention mechanism image characteristic input method;
the process of generating the image description text by using the Conv-C network model comprises the following steps:
1) expanding and mapping the image description characteristics, namely expanding the image description characteristics and adaptive mapping of a decoder network by n times to form n groups of key value pairs for an attention mechanism;
2) in the process of generating the description text, defining the description text to be generated as SnAnd initializing it to S { < pad>,<pad >, < s >, (where < pad > and<s > are blank fill and sentence initial characters respectively, and the S vector is quantized through a word embedding mapping layer:
Vs=WeS
wherein:
Weembedding weights for the mapped layers for the words;
Vsthe vectorized description text to be generated is obtained;
3) and obtaining the feature representation of the text to be generated by utilizing two layers of convolution:
Figure BDA0002615235680000061
wherein:
ConvAand ConvBRespectively representing two different convolution operations, wherein the sizes of convolution kernels of the two operations are respectively 3 multiplied by 3 and 5 multiplied by 5;
4) the obtained convolution text characteristic is related with the image description characteristic by using an attention mechanism to obtain a proper image characteristicSign Fatt
Figure BDA0002615235680000062
Wherein:
f is the obtained image description characteristics;
5) and (3) integrating the multiple characteristics to finish a layer of convolution operation to obtain the output of the convolution layer:
Figure BDA0002615235680000063
6) superposing the L layers of convolution operation to obtain the final characteristic expression FlFinally mapping to obtain the probability distribution of the next word, and generating the input vocabulary at the next moment by sampling:
xt+1=argmaxSoftmax(WvFl)
wherein:
Wvmapping the feature vector into the weight of each word in the dictionary;
xt+1describing vocabulary for the image content generated at the time t + 1;
t is the current time;
Flthe feature expression is obtained after L layers of convolution are added;
will finally obtain xt+1Filling in the description text S to be generatednAnd continuously iterating to finally obtain a generated image content description text.
In addition, to achieve the above object, the present invention also provides an image content description system based on deep learning, the system including:
image acquisition means for receiving an image to be described;
the image processor is used for carrying out binarization processing on the image by using a threshold-based binarization method and refining the outline of the binarized image by using a non-maximum signal suppression method; simultaneously extracting spatial domain features in the image contour region, and performing multi-scale complex frequency domain transformation processing on the spatial domain fusion features of the image by using a multi-direction wavelet transformation method to obtain a high-frequency sub-band and a low-frequency sub-band of the image;
the image content description device inputs the low-frequency sub-band of the image into a pre-constructed self-adaptation Net model, extracts image description characteristics, inputs the extracted image description characteristics into a pre-constructed Conv-C network model, and generates an image description text.
In addition, to achieve the above object, the present invention also provides a computer readable storage medium having stored thereon image content description instructions, which are executable by one or more processors to implement the steps of the implementation method of image content description based on deep learning as described above.
Compared with the prior art, the invention provides an image content description method based on deep learning, and the technology has the following advantages:
firstly, the existing image description model cannot adaptively input the image size, so that the image size needs to be modified before image content description is performed, whereas the existing adaptive pooling layer is composed of three pooling structures of 1 × 1, 2 × 2 and 4 × 4, all the inputted feature maps with the size of X × Y are pooled, and the outputted features are connected together to form (1+4+16) parameters. The invention designs a special self-adaptive pooling layer which is composed of three pooling structures of 1 ring, 2 rings and 4 rings, input characteristic graphs of X multiplied by Y size are pooled, output characteristics are connected together to form (1+2+4) parameters, compared with the existing pooling layer, the output parameters of the pooling layer are less, meanwhile, the three pooling structures of the pooling layer are not fixed in size but change along with the change of the input characteristic graphs, and when the characteristic graph of the input self-adaptive pooling layer size is 7 multiplied by 7, the pooling layer is equivalent to three pooling of a center 7 multiplied by 7 size, a center 4 multiplied by 4 size pooling, an outer ring 7 multiplied by 7 inner ring 4 size, a center 2 multiplied by 2 size pooling, an outer ring 4 multiplied by 4 size, an inner ring 2 multiplied by 2 size, an outer ring 4 multiplied by 4 size, an outer ring 6 multiplied by 6 size, an inner ring size, an outer ring 6 multiplied by 6 size and an outer ring size The pooling is configured to achieve a pooling level structure with the same scale for the output regardless of the same size of the input feature map.
Secondly, most information of the image is concentrated in low frequency, high frequency information still retains a large amount of redundant information, the color feature and the shape feature of the image have the characteristic of gradual change, but the classical wavelet transformation method can only provide information in three directions of horizontal, vertical and oblique during the process of decomposing the image, and the continuous change of the direction is difficult to adapt. The complex frequency domain transformation processing method of the invention decomposes the spatial fusion characteristics by utilizing the multi-scale function and the complex wavelet function, takes the imaginary part obtained by decomposition as the high-frequency sub-band of the image, takes the real part obtained by decomposition as the low-frequency sub-band of the image, and provides 8 selectable directions including 6 high-frequency sub-bands and 2 low-frequency sub-bands, so that the carried redundant information is very limited. Meanwhile, in order to ensure that low-frequency information is not lost, the invention takes the average value of two low-frequency sub-bands as low-frequency information, and prevents information loss caused by taking a certain direction.
The invention provides a Conv-C network model to carry on the image description text to produce at the same time, adopt the multilayer convolution network to study and construct the language model, adopt the image characteristic input method of the attention mechanism, integrate the text data characteristic of different levels in the course of introducing the image characteristic and use for the prediction process, the said model of the invention expands and maps the image description characteristic at first, namely the image description characteristic expands n times with the adaptive mapping of the decoder network, form n groups of key value pairs to use for the attention mechanism, through decoding and combining the attention mechanism to process the image characteristic multiunit adaptation, have realized the choice to the image characteristic to a certain extent, thus utilize the image content information included in the image characteristic vector better, and can be aligned conveniently in the text generating process; meanwhile, the model generates an image content description text through multiple convolutions, in one convolution operation, the association between elements in a receptive field can be realized through convolution operation, the operation distances between the elements are equal, so that the operation distance between different positions in a sequence is reduced by k times, and k is the size of a convolution kernel, and the problem of imbalance of operation when the sequence elements at different positions are associated is relieved to a certain extent.
Drawings
Fig. 1 is a schematic flowchart of an image content description method based on deep learning according to an embodiment of the present invention;
fig. 2 is a schematic structural diagram of an image content description system based on deep learning according to an embodiment of the present invention;
the implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.
Detailed Description
It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
The color features and histogram features of the image are respectively extracted in a spatial domain, multi-scale complex frequency domain transformation is carried out on the image by using multi-direction wavelet transformation, and the description of the image content is realized by utilizing an improved image description model. Referring to fig. 1, a schematic diagram of an image content description method based on deep learning according to an embodiment of the present invention is provided.
In this embodiment, the image content description method based on deep learning includes:
and S1, acquiring an image to be described, carrying out binarization processing on the image by using a threshold-based binarization method, and refining the outline of the binarized image by using a non-maximum signal suppression method.
Firstly, the invention obtains an image to be described, and carries out binarization processing on the image by using a threshold-based binarization method, wherein the threshold-based binarization method comprises the following steps:
Figure BDA0002615235680000091
wherein:
g (x, y) is the gray value of a pixel point with coordinates (x, y) in the binary image after binarization;
f (x, y) is the gray value of the pixel point with the coordinate (x, y) in the original image;
t is a binary threshold value which is set as 110 by the invention;
further, the invention calculates the angle value of each pixel point in the image, and the calculation formula of the angle value alpha is as follows:
Figure BDA0002615235680000092
wherein:
gxthe gradient of the pixel point in the x direction;
gythe gradient of the pixel point in the y direction;
dividing to obtain four angle regions which are respectively 0-45 degrees, 45-90 degrees, 90-135 degrees and 135-180 degrees, and classifying all pixel point angles into four discrete angle regions;
and comparing two adjacent pixels in the same angle area of the central pixel on the contour in the image, if the central pixel is smaller than any one of the two adjacent pixels, discarding the pixel, and only if the central pixel is larger than the two adjacent pixels, reserving the pixel, thereby realizing the thinning of the contour of the binary image by using a non-maximum signal suppression method.
And S2, extracting the space domain features in the image contour region, wherein the space domain features comprise color features and gradient histogram features of the image.
Further, the invention extracts the spatial domain feature in the image contour region, and the extraction process of the spatial domain feature is as follows:
1) extracting color features of the image by using an HIS model:
Figure BDA0002615235680000101
Figure BDA0002615235680000102
Figure BDA0002615235680000103
wherein:
Pijthe gray level of the ith color channel component in the image is j;
n is the number of all pixel points in the image;
M1the first moment of the image color characteristic represents the mean value of the image color characteristic;
M2the second moment of the image color characteristic represents the variance of the image color characteristic;
M3the third moment of the image color characteristic represents the inclination of the image color characteristic;
2) performing Gamma normalization processing on the image, wherein the Gamma compression formula is as follows:
I(x,y)=I(x,y)gamma
wherein:
i (x, y) is an image to be described;
3) calculating the gradient value of each pixel point in the horizontal direction and the vertical direction:
Gx(x,y)=H(x+1,y)-H(x-1,y)
Gy(x,y)=H(x,y+1)-H(x,y-1)
wherein:
h (x, y) represents the pixel value of a certain pixel point of the image to be described;
Gx(x, y) and Gy(x, y) are the horizontal direction gradient and the vertical direction gradient at the pixel point (x, y) in the input image which are solved respectively;
4) the gradients of all pixel points in the image are sequentially connected end to end, the gradient value of the pixel of the whole image to be described is scanned by taking 8 pixels as step length, 36 scanning windows are arranged in the horizontal direction, and a 288-dimensional descriptor is formed and used as the gradient histogram feature of the image to be described.
And S3, calculating the importance of different features in the spatial domain features, giving the weights of the different spatial domain features based on the importance of the features, and fusing the spatial domain features according to the given feature weights.
Further, the color features and gradient histogram features of the obtained image are spliced in series to obtain a feature training sample set X of the image to be described, wherein the feature training sample set X is { X ═ X }iN, where x is 1i={A1,...,AxThe feature is the spatial domain splicing feature extracted from the ith image to be described;
further, the invention randomly selects a sample X from a feature training sample set XiThen from and xiFinding k adjacent samples x from data samples of the same typejSimultaneously from and xiFinding k neighbor samples x from heterogeneous data samplesl
Calculating weights of different features by using a feature weight calculation formula, wherein the features comprise color features and gradient histogram features of an image to be described, and the feature weight calculation formula is as follows:
Figure BDA0002615235680000111
Figure BDA0002615235680000112
wherein:
diff(A,xj,xl) Represents a sample xjAnd xlThe difference above feature a;
a is a spatial domain splicing characteristic vector of an image to be described;
m is the sampling times;
k is the number of the screened neighbor samples;
class(xi) Denotes xiBeing of the same kind, p (class (x)i) ) represents and xiProbability of being homogeneous;
p (C) represents and xiProbability of being a different class;
setting the threshold value to be 0.01, removing the feature when the calculated weight value of the feature is less than 0.01, and finally, re-connecting and fusing the obtained features in series according to the feature weight to form a new feature set when the weight value of the feature is greater than or equal to 0.01 and is reserved.
And S4, performing multi-scale complex frequency domain transformation processing on the spatial domain fusion characteristics of the image by using a multi-direction wavelet transformation method to obtain a high-frequency sub-band and a low-frequency sub-band of the image.
Furthermore, the invention uses a multidirectional wavelet transform method to perform multi-scale complex frequency domain transform processing on the spatial domain fusion characteristics of the image, and the process of the multi-scale complex frequency domain transform processing is as follows:
1) respectively establishing a multi-scale function phii,j(n1,n2) And complex wavelet function
Figure BDA0002615235680000113
Figure BDA0002615235680000114
Figure BDA0002615235680000115
Wherein:
n1,n2any two feature vectors of the fusion features in the spatial domain;
j, k represent the expansion and translation indices of the image, respectively;
i denotes the number of frequency subbands in 8 directions;
re represents a real part obtained by decomposing the image;
im represents an imaginary part obtained by decomposing the image;
2) decomposing the spatial domain fusion characteristics f (n) of the image by using a multi-scale function and a complex wavelet function:
Figure BDA0002615235680000121
wherein:
z represents a natural number set;
cj,krepresenting a scale factor;
Figure BDA0002615235680000122
representing complex wavelet coefficients in an ith direction;
n1,n2any two feature vectors of the fusion features in the spatial domain;
j, k represent the expansion and translation indices of the image, respectively;
i denotes the number of frequency subbands in 8 directions, including 6 high frequency subbands and 2 low frequency subbands;
3) and taking the imaginary part obtained by decomposition as a high-frequency sub-band of the image, taking the real part obtained by decomposition as a low-frequency sub-band of the image, taking the average value of the two low-frequency sub-bands as low-frequency information to ensure that the low-frequency information is not lost, and preventing information loss caused when a certain direction is taken.
And S5, inputting the low-frequency sub-band of the image into a pre-constructed self-adaptation Net model, and extracting image description features.
Further, inputting a low-frequency sub-band of the image into a pre-constructed self-adaptation Net model, and extracting image description characteristics of the image to be described;
the self-adaptation Net model comprises a plurality of convolution layers and a pooling layer, wherein the formula for extracting features by utilizing the convolution layers is as follows:
F=Wf(I)
wherein:
i is a low-frequency subband of the image;
Wfmapping of low frequency subbands to convolutional layer input;
f is the image description feature extracted by the convolution layer;
compared with the existing model, the pooling layer of the model is composed of three pooling structures of 1 ring, 2 rings and 4 rings, X multiplied by Y feature maps obtained by the convolutional layer are pooled, and output features are connected together to form a final image description feature;
the three pooling structures of the adaptive pooling layer are not fixed in size, but change along with an input feature map, when the feature map of the size of the input adaptive pooling layer is 7 × 7, the pooling layer is equivalently formed by three pooling structures of 1 ring, 2 rings and 4 rings, each pooling structure performs adaptive adjustment of the pooling structure according to the size of the input feature map without limiting the size of an image, and the three pooling structures are respectively:
1. a pooling structure of 7 × 7 size at the center;
2. a pooling structure of pooling in the size of 4 multiplied by 4 at the center and annular pooling in the size of 4 multiplied by 4 at the inner ring of 7 multiplied by 7 at the outer ring;
3. the center is in a 2 multiplied by 2 size pooling structure, the outer ring is in a 4 multiplied by 4 inner ring 2 multiplied by 2 size annular pooling structure, the outer ring is in a 6 multiplied by 6 inner ring 4 multiplied by 4 size annular pooling structure, and the outer ring is in a 7 multiplied by 7 inner ring 6 multiplied by 6 size annular pooling structure.
For a feature map of the input adaptive pooling layer size of 20 × 20, the three pooling structures are:
1. a pooling structure of central 20 x 20 size;
2. a pooling structure of pooling in the size of 10 multiplied by 10 at the center and annular pooling in the size of 10 multiplied by 10 at the inner ring of 20 multiplied by 20 at the outer ring;
3. a pooling structure of pooling in the size of 5 × 5 at the center, annular pooling in the size of 5 × 5 at the outer ring 10 × 10 at the inner ring, annular pooling in the size of 10 × 10 at the outer ring 15 × 15 at the inner ring, and annular pooling in the size of 15 × 15 at the outer ring 20 × 720 at the inner ring.
And S6, inputting the extracted image description features into a pre-constructed Conv-C network model, and generating an image description text.
Further, the extracted image description features are input into a pre-constructed Conv-C network model to generate an image description text;
the Conv-C network model adopts an encoding and decoding network structure, and an encoder of the Conv-C network model extracts image characteristics F to be used for guiding generation of description texts by using a deep convolution network in a self-adaptation Net model; in the description generation part, a Conv-C network model adopts multilayer convolution network learning to construct a language model, and integrates text data characteristics of different layers in the process of introducing image characteristics for a prediction process by adopting an attention mechanism image characteristic input method;
the process of generating the image description text by using the Conv-C network model comprises the following steps:
1) expanding and mapping the image description characteristics, namely expanding the image description characteristics and adaptive mapping of a decoder network by n times to form n groups of key value pairs for an attention mechanism;
meanwhile, the image features are selected to a certain extent by carrying out multi-group adaptive decoding on the image features and processing by combining an attention mechanism, so that image content information contained in image feature vectors is better utilized, and the text generation process can be conveniently aligned;
2) in the process of generating the description text, defining the description text to be generated as SnAnd is initialized to S { < pad >,<s>therein of<pad>And<s>filling blank and sentence initial characters respectively, and quantizing S vectors through a word embedding mapping layer:
Vs=WeS
wherein:
Weembedding weights for the mapped layers for the words;
Vsthe vectorized description text to be generated is obtained;
3) and obtaining the feature representation of the text to be generated by utilizing two layers of convolution:
Figure BDA0002615235680000131
wherein:
ConvAand ConvBRespectively representing two different convolution operations, wherein the sizes of convolution kernels of the two operations are respectively 3 multiplied by 3 and 5 multiplied by 5;
4) the obtained convolution text characteristic is related with the image description characteristic by using an attention mechanism to obtain a proper image characteristic Fatt
Figure BDA0002615235680000132
Wherein:
f is the obtained image description characteristics;
5) and (3) integrating the multiple characteristics to finish a layer of convolution operation to obtain the output of the convolution layer:
Figure BDA0002615235680000141
6) superposing the L layers of convolution operations to obtain the final characteristic expression ElFinally mapping to obtain the probability distribution of the next word, and generating the input vocabulary at the next moment by sampling:
xt+1=argmax Softmax(WvFl)
wherein:
Wvmapping the feature vector into the weight of each word in the dictionary;
xt+1describing vocabulary for the image content generated at the time t + 1;
t is the current time;
Flthe feature expression is obtained after L layers of convolution are added;
the invention will finally obtain xt+1Filling in the description text S to be generatednAnd continuously iterating to finally obtain a generated image content description text.
The following describes embodiments of the present invention through an algorithmic experiment and tests of the inventive treatment method. The hardware testing environment of the algorithm is deployed in a tensorflow deep learning framework, a processor is an Intel (R) core (TM) i5-7700 CPU 8 core, a display card is GeForce GTX1040, a display memory 8G, a development environment is python3.5, and a development tool is an Anaconda scientific calculation library; the comparison algorithm models are a ResNet-F5 model, a VGG-F3 model and a CNN model.
In the algorithm experiment, an image description data set issued by MS COCO in 2014 is adopted in a data set experiment, the data set experiment totally comprises 8 ten thousand groups of training set samples and 4 ten thousand groups of verification set samples, and each group of samples comprises one image and corresponding 5 sentences of manually marked English description texts. In the experiment, 8 ten thousand groups of training data are used as a training set for training a model, and 500 pieces of training data are randomly selected from a verification set sample to be used as a test set for evaluating the effect of the model.
In the aspect of data set processing, the original image size is kept in the input process for image data, and the image data is input into an image coding network after being normalized and standardized. For text data, a dictionary is constructed based on description texts in the entire training set, and words in which the occurrence frequency is greater than or equal to 5 are retained as a training dictionary, and the dictionary dimension used in the experiment is about 14500. During the process the words not contained in the dictionary are replaced with < UNK >, the text sequence is filled up to a fixed length with < PAD >, set to 25 in the experiment, resulting in the final training set. The image data in the data set are respectively input into the training model, the generated image content description text is compared with the description text in the training set, and the comparison result is the image content description accuracy of the algorithm model.
According to the experimental result, the accuracy of describing the image content of the ResNet-F5 model is 81.93%, the accuracy of describing the image content of the VGG-F3 model is 76.25%, the accuracy of describing the image content of the CNN model is 86.78%, the accuracy of describing the image content of the algorithm is 89.21%, and compared with a comparison algorithm, the image content description method based on deep learning provided by the invention has higher accuracy of scotching the image content.
The invention also provides an image content description system based on deep learning. Referring to fig. 2, a schematic diagram of an internal structure of an image content description system based on deep learning according to an embodiment of the present invention is provided.
In the present embodiment, the image content description system 1 based on deep learning includes at least an image acquisition device 11, an image processor 12, an image content description device 13, a communication bus 14, and a network interface 15.
The image capturing device 11 may be a PC (Personal Computer), a terminal device such as a smart phone, a tablet Computer, or a mobile Computer, or may be a server.
Image processor 12 includes at least one type of readable storage medium including flash memory, a hard disk, a multi-media card, a card-type memory (e.g., SD or DX memory, etc.), a magnetic memory, a magnetic disk, an optical disk, and the like. The image processor 12 may in some embodiments be an internal storage unit of the deep learning based image content description system 1, for example a hard disk of the deep learning based image content description system 1. The image processor 12 may also be an external storage device of the deep learning based image content description system 1 in other embodiments, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like, provided on the deep learning based image content description system 1. Further, the image processor 12 may also include both an internal storage unit and an external storage device of the depth learning based image content description system 1. The image processor 12 can be used not only to store application software installed in the deep learning-based image content description system 1 and various kinds of data, but also to temporarily store data that has been output or is to be output.
The image content description device 13 may be, in some embodiments, a Central Processing Unit (CPU), a controller, a microcontroller, a microprocessor or other data Processing chip for executing program codes stored in the image processor 12 or Processing data, such as image content description program instructions.
The communication bus 14 is used to enable connection communication between these components.
The network interface 15 may optionally include a standard wired interface, a wireless interface (e.g., WI-FI interface), and is typically used to establish a communication link between the system 1 and other electronic devices.
Optionally, the system 1 may further comprise a user interface, which may comprise a Display (Display), an input unit such as a Keyboard (Keyboard), and optionally a standard wired interface, a wireless interface. Alternatively, in some embodiments, the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode) touch device, or the like. The display, which may also be referred to as a display screen or display unit, is suitable for displaying information processed in the deep learning based image content description system 1 and for displaying a visualized user interface.
Fig. 2 only shows the image content description system 1 with the components 11-15 and based on deep learning, it will be understood by those skilled in the art that the structure shown in fig. 1 does not constitute a limitation of the image content description system 1 based on deep learning, and may comprise fewer or more components than shown, or combine certain components, or a different arrangement of components.
In the embodiment of apparatus 1 shown in fig. 2, image processor 12 has stored therein program instructions describing the content of images based on deep learning; the steps of the image content description device 13 executing the image content description program instructions stored in the image processor 12 are the same as the implementation method of the image content description method based on the deep learning, and are not described here.
Furthermore, an embodiment of the present invention also provides a computer-readable storage medium having stored thereon image content description program instructions executable by one or more processors to implement the following operations:
acquiring an image to be described, and carrying out binarization processing on the image by using a threshold-based binarization method;
thinning the outline of the binary image by using a non-maximum signal suppression method;
extracting spatial domain features in the image contour region, wherein the spatial domain features comprise color features and gradient histogram features of an image;
calculating the importance of different features in the spatial domain features, giving weights to the different spatial domain features based on the importance of the features, and fusing the spatial domain features according to the given weights of the features;
performing multi-scale complex frequency domain transformation processing on the spatial domain fusion characteristics of the image by using a multi-direction wavelet transformation method to obtain a high-frequency sub-band and a low-frequency sub-band of the image;
inputting the low-frequency sub-band of the image into a pre-constructed self-adaptation Net model, and extracting image description characteristics;
and inputting the extracted image description features into a pre-constructed Conv-C network model, and generating an image description text.
It should be noted that the above-mentioned numbers of the embodiments of the present invention are merely for description, and do not represent the merits of the embodiments. And the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, apparatus, article, or method that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, apparatus, article, or method. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, apparatus, article, or method that includes the element.
Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium (e.g., ROM/RAM, magnetic disk, optical disk) as described above and includes instructions for enabling a terminal device (e.g., a mobile phone, a computer, a server, or a network device) to execute the method according to the embodiments of the present invention.
The above description is only a preferred embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims (10)

1. An image content description method based on deep learning, characterized in that the method comprises:
acquiring an image to be described, and carrying out binarization processing on the image by using a threshold-based binarization method;
thinning the outline of the binary image by using a non-maximum signal suppression method;
extracting spatial domain features in the image contour region, wherein the spatial domain features comprise color features and gradient histogram features of an image;
calculating the importance of different features in the spatial domain features, giving weights to the different spatial domain features based on the importance of the features, and fusing the spatial domain features according to the given weights of the features;
performing multi-scale complex frequency domain transformation processing on the spatial domain fusion characteristics of the image by using a multi-direction wavelet transformation method to obtain a high-frequency sub-band and a low-frequency sub-band of the image;
inputting the low-frequency sub-band of the image into a pre-constructed self-adaptation Net model, and extracting image description characteristics;
and inputting the extracted image description features into a pre-constructed Conv-C network model, and generating an image description text.
2. The image content description method based on deep learning as claimed in claim 1, wherein the binarization method based on threshold value is:
Figure FDA0002615235670000011
wherein:
g (x, y) is the gray value of a pixel point with coordinates (x, y) in the binary image after binarization;
f (x, y) is the gray value of the pixel point with the coordinate (x, y) in the original image;
t is a binary threshold, which is set to 110 by the present invention.
3. The image content description method based on deep learning of claim 2, wherein the refining of the contour of the binarized image based on the non-maximum signal suppression method comprises:
calculating an angle value of each pixel point in the image, wherein a calculation formula of the angle value alpha is as follows:
Figure FDA0002615235670000012
wherein:
gxthe gradient of the pixel point in the x direction;
gythe gradient of the pixel point in the y direction;
dividing to obtain four angle regions which are respectively 0-45 degrees, 45-90 degrees, 90-135 degrees and 135-180 degrees, and classifying all pixel point angles into four discrete angle regions;
and comparing two adjacent pixels in the same angle area of the central pixel on the outline in the image, if the central pixel is smaller than any one of the two adjacent pixels, discarding the pixel, and reserving the pixel only if the central pixel is larger than the two adjacent pixels.
4. The image content description method based on deep learning of claim 3, wherein the extracting the spatial domain features in the image contour region comprises:
1) extracting color features of the image by using an HIS model:
Figure FDA0002615235670000021
Figure FDA0002615235670000022
Figure FDA0002615235670000023
wherein:
Pijthe gray level of the ith color channel component in the image is j;
n is the number of all pixel points in the image;
M1the first moment of the image color characteristic represents the mean value of the image color characteristic;
M2the second moment of the image color characteristic represents the variance of the image color characteristic;
M3the third moment of the image color characteristic represents the inclination of the image color characteristic;
2) performing Gamma normalization processing on the image, wherein the Gamma compression formula is as follows:
I(x,y)=I(x,y)gamma
wherein:
i (x, y) is an image to be described;
3) calculating the gradient value of each pixel point in the horizontal direction and the vertical direction:
Gx(x,y)=H(x+1,y)-H(x-1,y)
Gy(x,y)=H(x,y+1)-H(x,y-1)
wherein:
h (x, y) represents the pixel value of a certain pixel point of the image to be described;
Gx(x, y) and Gy(x, y) are the horizontal direction gradient and the vertical direction gradient at the pixel point (x, y) in the input image which are solved respectively;
4) the gradients of all pixel points in the image are sequentially connected end to end, the gradient value of the pixel of the whole image to be described is scanned by taking 8 pixels as step length, 36 scanning windows are arranged in the horizontal direction, and a 288-dimensional descriptor is formed and used as the gradient histogram feature of the image to be described.
5. The method as claimed in claim 4, wherein the calculating the importance of different features in the spatial domain features and giving weights to the different spatial domain features based on the importance of the features comprises:
randomly selecting sample X from feature training sample set XiThen from and xiFinding k adjacent samples x from data samples of the same typejSimultaneously from and xiFinding k neighbor samples x from heterogeneous data samplesl
Calculating weights of different features by using a feature weight calculation formula, wherein the features comprise color features and gradient histogram features of an image to be described, and the feature weight calculation formula is as follows:
Figure FDA0002615235670000031
Figure FDA0002615235670000032
wherein:
diff(A,xj,xl) Represents a sample xjAnd xlThe difference above feature a;
a is a spatial domain splicing characteristic vector of an image to be described;
m is the sampling times;
k is the number of the screened neighbor samples;
class(xi) Denotes xiBeing of the same kind, p (class (x)i) ) represents and xiProbability of being homogeneous;
p (C) represents and xiProbability of being a different class;
setting the threshold value to be 0.01, removing the feature when the calculated weight value of the feature is less than 0.01, and finally, re-connecting and fusing the obtained features in series according to the feature weight to form a new feature set when the weight value of the feature is greater than or equal to 0.01 and is reserved.
6. The image content description method based on deep learning of claim 5, wherein the multi-scale complex frequency domain transform processing of the spatial domain fusion features of the image by the multi-direction wavelet transform method comprises:
1) respectively establishing a multi-scale function phii,j(n1,n2) And complex wavelet function
Figure FDA0002615235670000035
Figure FDA0002615235670000033
Figure FDA0002615235670000034
Wherein:
n1,n2any two feature vectors of the fusion features in the spatial domain;
j, k represent the expansion and translation indices of the image, respectively;
i denotes the number of frequency subbands in 8 directions;
re represents a real part obtained by decomposing the image;
im represents an imaginary part obtained by decomposing the image;
2) decomposing the spatial domain fusion characteristics f (n) of the image by using a multi-scale function and a complex wavelet function:
Figure FDA0002615235670000041
wherein:
z represents a natural number set;
cj,krepresenting a scale factor;
Figure FDA0002615235670000042
representing complex wavelet coefficients in an ith direction;
n1,n2any two feature vectors of the fusion features in the spatial domain;
j, k represent the expansion and translation indices of the image, respectively;
i denotes the number of frequency subbands in 8 directions, including 6 high frequency subbands and 2 low frequency subbands;
3) and taking the imaginary part obtained by decomposition as a high-frequency sub-band of the image, taking the real part obtained by decomposition as a low-frequency sub-band of the image, taking the average value of the two low-frequency sub-bands as low-frequency information to ensure that the low-frequency information is not lost, and preventing information loss caused when a certain direction is taken.
7. The image content description method based on deep learning of claim 6, wherein the extracting image description features by using the pre-constructed self-adaptation Net model comprises:
the self-adaptation Net model comprises a plurality of convolution layers and a pooling layer, wherein the formula for extracting features by utilizing the convolution layers is as follows:
F=Wf(I)
wherein:
i is a low-frequency subband of the image;
Wfmapping of low frequency subbands to convolutional layer input;
f is the image description feature extracted by the convolution layer;
the pooling layer of the model is composed of three pooling structures of 1 ring, 2 rings and 4 rings, X multiplied by Y feature maps obtained by the convolutional layer are pooled, and output features are connected together to form a final image description feature;
the three pooling structures of the adaptive pooling layer are not fixed in size, but change along with the input feature map, and when the feature map of the input adaptive pooling layer size is 7 × 7, the pooling layer is equivalently formed by three pooling structures of 1 ring, 2 rings and 4 rings, wherein the three pooling structures are respectively:
1. a pooling structure of 7 × 7 size at the center;
2. a pooling structure of pooling in the size of 4 multiplied by 4 at the center and annular pooling in the size of 4 multiplied by 4 at the inner ring of 7 multiplied by 7 at the outer ring; 3. the center is in a 2 multiplied by 2 size pooling structure, the outer ring is in a 4 multiplied by 4 inner ring 2 multiplied by 2 size annular pooling structure, the outer ring is in a 6 multiplied by 6 inner ring 4 multiplied by 4 size annular pooling structure, and the outer ring is in a 7 multiplied by 7 inner ring 6 multiplied by 6 size annular pooling structure.
8. The image content description method based on deep learning of claim 7, wherein the generation of the image description text by using the pre-constructed Conv-C network model comprises:
the Conv-C network model adopts an encoding and decoding network structure, and an encoder of the Conv-C network model extracts image characteristics F to be used for guiding generation of description texts by using a deep convolution network in a self-adaptation Net model; in the description generation part, a Conv-C network model adopts multilayer convolution network learning to construct a language model, and integrates text data characteristics of different layers in the process of introducing image characteristics for a prediction process by adopting an attention mechanism image characteristic input method;
the process of generating the image description text by using the Conv-C network model comprises the following steps:
1) expanding and mapping the image description characteristics, namely expanding the image description characteristics and adaptive mapping of a decoder network by n times to form n groups of key value pairs for an attention mechanism;
2) in the process of generating the description text, defining the description text to be generated as SnAnd initialized to S ═ tone<pad>,<pad>,<s>Therein of<pad>And<s>filling blank and sentence initial characters respectively, and quantizing S vectors through a word embedding mapping layer:
Vs=WeS
wherein:
Weembedding weights for the mapped layers for the words;
Vsthe vectorized description text to be generated is obtained;
3) and obtaining the feature representation of the text to be generated by utilizing two layers of convolution:
Figure FDA0002615235670000051
wherein:
ConvAand ConvBRespectively representing two different convolution operations, wherein the sizes of convolution kernels of the two operations are respectively 3 multiplied by 3 and 5 multiplied by 5;
4) the obtained convolution text characteristic is related with the image description characteristic by using an attention mechanism to obtain a proper image characteristic Fatt
Figure FDA0002615235670000052
Wherein:
f is the obtained image description characteristics;
5) and (3) integrating the multiple characteristics to finish a layer of convolution operation to obtain the output of the convolution layer:
Figure FDA0002615235670000053
6) superposing the L layers of convolution operation to obtain the final characteristic expression FlFinally mapping to obtain the probability distribution of the next word, and generating the input vocabulary at the next moment by sampling:
xt+1=argmax Softmax(WvFl)
wherein:
Wvmapping the feature vector into the weight of each word in the dictionary;
xt+1describing vocabulary for the image content generated at the time t + 1;
t is the current time;
Flthe feature expression is obtained after L layers of convolution are added;
will finally obtain xt+1Filling in the description text S to be generatednIn the method, the generated image content is obtained finally through continuous iterationThe text is described.
9. An image content description system based on deep learning, characterized in that the system comprises:
image acquisition means for receiving an image to be described;
the image processor is used for carrying out binarization processing on the image by using a threshold-based binarization method and refining the outline of the binarized image by using a non-maximum signal suppression method; simultaneously extracting spatial domain features in the image contour region, and performing multi-scale complex frequency domain transformation processing on the spatial domain fusion features of the image by using a multi-direction wavelet transformation method to obtain a high-frequency sub-band and a low-frequency sub-band of the image;
the image content description device inputs the low-frequency sub-band of the image into a pre-constructed self-adaptation Net model, extracts image description characteristics, inputs the extracted image description characteristics into a pre-constructed Conv-C network model, and generates an image description text.
10. A computer readable storage medium having stored thereon image content description program instructions executable by one or more processors to implement the steps of a method of implementing a deep learning based image content description as claimed in any one of claims 1 to 8.
CN202010767475.9A 2020-08-03 2020-08-03 Image content description method and system based on deep learning Withdrawn CN111915542A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010767475.9A CN111915542A (en) 2020-08-03 2020-08-03 Image content description method and system based on deep learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010767475.9A CN111915542A (en) 2020-08-03 2020-08-03 Image content description method and system based on deep learning

Publications (1)

Publication Number Publication Date
CN111915542A true CN111915542A (en) 2020-11-10

Family

ID=73287061

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010767475.9A Withdrawn CN111915542A (en) 2020-08-03 2020-08-03 Image content description method and system based on deep learning

Country Status (1)

Country Link
CN (1) CN111915542A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112926457A (en) * 2021-02-26 2021-06-08 中国电子科技集团公司第二十八研究所 SAR image recognition method based on fusion frequency domain and space domain network model
CN113128521A (en) * 2021-04-30 2021-07-16 西安微电子技术研究所 Method and system for extracting features of miniaturized artificial intelligence model, computer equipment and storage medium
CN113592743A (en) * 2021-08-11 2021-11-02 北华航天工业学院 Spectrum high-frequency information and low-frequency information separation and coupling method based on complex wavelet transformation

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112926457A (en) * 2021-02-26 2021-06-08 中国电子科技集团公司第二十八研究所 SAR image recognition method based on fusion frequency domain and space domain network model
CN112926457B (en) * 2021-02-26 2022-09-06 中国电子科技集团公司第二十八研究所 SAR image recognition method based on fusion frequency domain and space domain network model
CN113128521A (en) * 2021-04-30 2021-07-16 西安微电子技术研究所 Method and system for extracting features of miniaturized artificial intelligence model, computer equipment and storage medium
CN113128521B (en) * 2021-04-30 2023-07-18 西安微电子技术研究所 Method, system, computer equipment and storage medium for extracting characteristics of miniaturized artificial intelligent model
CN113592743A (en) * 2021-08-11 2021-11-02 北华航天工业学院 Spectrum high-frequency information and low-frequency information separation and coupling method based on complex wavelet transformation
CN113592743B (en) * 2021-08-11 2024-01-23 北华航天工业学院 Spectral high-frequency information and low-frequency information separation and coupling method based on complex wavelet transformation

Similar Documents

Publication Publication Date Title
CN109471945B (en) Deep learning-based medical text classification method and device and storage medium
CN108108751B (en) Scene recognition method based on convolution multi-feature and deep random forest
CN111915542A (en) Image content description method and system based on deep learning
CN111275784B (en) Method and device for generating image
CN110084172B (en) Character recognition method and device and electronic equipment
AU2021354030B2 (en) Processing images using self-attention based neural networks
CN110866098B (en) Machine reading method and device based on transformer and lstm and readable storage medium
CN111935487B (en) Image compression method and system based on video stream detection
CN112966685A (en) Attack network training method and device for scene text recognition and related equipment
CN108805280B (en) Image retrieval method and device
CN112418320A (en) Enterprise association relation identification method and device and storage medium
CN110147460B (en) Three-dimensional model retrieval method and device based on convolutional neural network and multi-view map
CN113435531B (en) Zero sample image classification method and system, electronic equipment and storage medium
CN111445545B (en) Text transfer mapping method and device, storage medium and electronic equipment
CN112819848A (en) Matting method, matting device and electronic equipment
CN113159053A (en) Image recognition method and device and computing equipment
CN116740078A (en) Image segmentation processing method, device, equipment and medium
CN115565186B (en) Training method and device for character recognition model, electronic equipment and storage medium
CN114781393B (en) Image description generation method and device, electronic equipment and storage medium
CN115205648A (en) Image classification method, image classification device, electronic device, and storage medium
CN115439713A (en) Model training method and device, image segmentation method, equipment and storage medium
CN114625877A (en) Text classification method and device, electronic equipment and storage medium
CN114281919A (en) Node adding method, device, equipment and storage medium based on directory tree
CN114692715A (en) Sample labeling method and device
CN114091662B (en) Text image generation method and device and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WW01 Invention patent application withdrawn after publication

Application publication date: 20201110

WW01 Invention patent application withdrawn after publication