CN101706780A

CN101706780A - Image semantic retrieving method based on visual attention model

Info

Publication number: CN101706780A
Application number: CN200910092164A
Authority: CN
Inventors: 冯松鹤; 郎丛妍; 须德
Original assignee: Beijing Jiaotong University
Current assignee: Beijing Jiaotong University
Priority date: 2009-09-03
Filing date: 2009-09-03
Publication date: 2010-05-12

Abstract

The invention provides an image semantic retrieving method based on a visual attention mechanism model, which is driven by data completely, thus understanding the semantics of images from an angle of a user as far as possible under the condition of no need to increase the interactive burden of the user, and being close to the perception of the user so as to improve the retrieving performance. The image semantic retrieving method has the advantages that: (1) a visual attention mechanism theory in a visual cognition theory is introduced into image retrieve; (2) the method is a completely bottom-up retrieve mode, thus having no need of user burden brought by user feedback; and (3) obvious edge information and obvious regional information in images are simultaneously considered, the mode of retrieval integration is realized, and the performance of image retrieval is improved.

Description

Image semantic retrieval method based on visual attention model

Technical Field

The invention relates to an image recognition and retrieval technology, in particular to an image semantic retrieval method.

Background

With the rapid development of multimedia technology and internet technology, digital images become a widely used medium. The rapid spread of digital cameras and mobile devices capable of taking pictures in recent years makes the acquisition of digital images easier, and the number of images that people come into contact with and need to process each day shows geometric growth, and the application range is greatly expanded. In the face of such large-scale image resources, how to organize and search them efficiently and quickly becomes a problem to be solved urgently. Since images are different from texts, the texts themselves can explain the content, and the images need to explain the meaning by the subjective understanding of people, so that the retrieval of the images is more difficult than the query and matching of the texts. In the last 70 to 80 years, the retrieval of images is mainly based on text retrieval, texts related to the images are manually marked, the marking workload is large, and the image retrieval method depends on the individual subjective judgment of a marker. Content-Based Image Retrieval (CBIR) technology was first proposed in 1992, and has been studied in great quantity due to its rich Content forms and broad application prospects, and has a wide potential application in many fields such as biomedicine, digital libraries, military, education, commerce, internet search, etc. In a content-based image retrieval system, images are described by their own visual information (high-dimensional feature vectors such as color, texture, and shape), and queries are performed based on visual similarity measures between the images. During retrieval, because the user is difficult to directly input the feature vector corresponding to the target image, the system requires the user to provide a representative sample image or a hand-drawn sketch, then the system searches an image database for an image which is similar to the image in visual content by using the feature vector of the image, and takes the previous images as retrieval results according to the similarity and returns the retrieval results to the user. Since the system automatically extracts and matches the visual content of the image, the CBIR technique overcomes the inefficiencies and subjectivity of manual labeling.

In the early stages of CBIR, research efforts focused on how to select appropriate global features (e.g., color histograms, edge direction histograms) or combinations of features to describe image content, and then to perform image matching using appropriate similarity measures to improve retrieval accuracy. Because the global image features only provide coarse-grained semantic description, the difference between a foreground object and a background in the image is not considered, and thus rich detail semantic information of the image cannot be reflected. The method is generally only suitable for simple images or images with single background. Early CBIR prototype systems such as QBIC (see literature 1, Flickner M, Sawhneh, et al. query by image and video content: the QBIC System. IEEE Computer, 1995, 28 (9): 23-32), Photoshop (see literature 2, Pentland A, Picard R W, Scaoff S.Photoshop: topor content-based management of image database. in: Proc. of SPIE, Vol.2185(1994)34-47), VisualSEEK (see literature 3, Smith J R, Chang S F. VisualSEEK: a fused content-based image system. in: Proc. of Int. Conf. account. in. video. Multimedia. model: 96-98) were used in a search based on the characteristics of the Global model Noc. 21898, Mass. Multimedia. TM., Picture, et al.

The image retrieval method based on the region features is one of the important ways for realizing the image semantic retrieval, can overcome the problem that the retrieval requirement of a user on an object layer cannot be met by using the global image features, and can further understand and analyze the image by using the region-based features relative to the global image features, meanwhile, the region-based image retrieval method is closer to the retrieval intention of a user who generally wants to inquire an image set similar to an object contained in an image to be retrieved when retrieving the image. The image is divided into a plurality of homogeneous regions by using a classical image segmentation technology, then low-level visual features such as color, texture, shape and the like are extracted for each region and form a feature vector, finally feature matching based on the region is carried out, and the most similar image set is output.

The concept of image retrieval using segmented regions was first proposed by the Netra system (see document 4, Ma W Y, Manjunath B. Netra: a topologic for visualizing large image databases, in: Proc. of IEEE int. Conf. on image processing (ICIP' 97), Santa Barbara, USA, Oct. 1997: 568) proposed by the division of Saint Barra, California University (UCSB), which segments images using edge flow (Edgeflow) segmentation method. For each divided region, color, texture, spatial position relationship between regions, and the like are used as features for description, and then each feature is clustered by using a vector quantization technique to form a Visual dictionary (Visual Codebook). The on-line inquiry process mainly depends on that a user selects an area needing to be inquired from the areas divided by the image, and the user can also indicate the characteristics (shape, color, texture and the like) used by inquiry, then matching is completed in an image library according to the information provided by the user, and finally a similar result is output. The Netra system lays the foundation of a region-based retrieval method, and many subsequent works are the basic framework of the system. However, the system has obvious disadvantages, and the interaction process required by the system to be completed by the user is too complex, so that the system is difficult to popularize.

The Blobworld system (see document 5, CarsonC, Belongie S, Greenspan H, Malik J. Blobworld: Image segmentation using mapping and its application to Image retrieval, IEEE trans. on Pattern Analysis and Machine understanding, 2002, 24 (8): 1026 and 038), introduced by Berkeley university, California, employs an Image segmentation algorithm based on an expectation-maximization algorithm. The algorithm adopts a mixed Gaussian model to establish a model for the joint distribution of the color and texture characteristics of the image, and further divides the image into a plurality of areas with uniform color and texture. Since the system needs the user to specify the region of interest, a simple one-to-one matching mode can be adopted for retrieval on the matching strategy of the similarity. The system also suffers from the drawback of requiring a user to perform a plethora of interactions.

Subsequently, the simple entity system was introduced by the research group at the university of Standford (see document 6, WangJ Z, Li J, Wiederhold G. simple entity: semiconductor-induced integrated implementation for image characteristics. IEEE transactions. on Pattern Analysis and Machine understanding, 2001, 23 (9): 947-. The system firstly calculates color features (mean value of three components in LUV color space) and texture features (second moments on three high-frequency channels after Wavelet transformation) for each block with the size of 4x4 of an image, then adopts an adaptive k-means algorithm to perform clustering, and divides the image into a plurality of areas with uniform color and texture. The biggest difference between the system and the two systems is mainly in the query process, and the system does not need a user to provide information such as an interested area and matching characteristics and directly matches all extracted areas. The method comprises the steps of Matching, wherein a 'many-to-many' Matching strategy, namely an Integrated Region Matching (IRM) algorithm is adopted for image similarity measurement during Matching.

Although region-based image retrieval is closer to the user's query idea than global-based image retrieval, it also has some problems: firstly, because image segmentation is still a very difficult subject in the field of computer vision, the existing image segmentation technology cannot guarantee accurate extraction of objects in an image, so that a segmented region and a semantic object are well corresponded; secondly, since the image retrieval problem is an ambiguity problem in nature, the user only has interest in a part of the region in the image, the part of the region of interest represents the query intention of the user, and most of the rest of the regions of no interest are irrelevant to the query intention of the user. Therefore, the retrieval strategy based on the full-region matching cannot reflect the retrieval purpose of the user, and the irrelevant regions are often difficult to be correctly matched, so that the retrieval accuracy is reduced. And the user manually selects the interested area, the workload of the user is increased invisibly, and the user is not used to the query mode.

Disclosure of Invention

If the psychological perception and behavior of the user during retrieval can be considered, the region of interest in the user in the image is effectively and automatically extracted, the weight is adaptively distributed to each region in the image, and the similarity matching algorithm is designed according to the weight, so that the subjective retrieval requirement of the user is better met, and the method is one of effective solutions for improving the performance of the region-based image retrieval algorithm.

When the user is found to be interested in image retrieval through observation, only part of the image is generally interested in the user. Therefore, how to depict or guess the difference of the perception among the image areas of the user becomes an important means for closing the semantic gap and improving the image retrieval performance based on the areas. The Visual Attention Mechanism (Visual Attention Mechanism) can effectively simulate the perception of a user to a remarkable part in an image, and when the Visual Attention Mechanism (Visual Attention Mechanism) is applied to image retrieval, the ambiguity problem of the image retrieval can be solved well.

In order to overcome the defects of the prior art, the invention provides a method for extracting a salient edge and a salient region based on a visual attention mechanism, and based on the method, the image retrieval at a region level is realized. The method has the advantages that: (1) introducing a visual attention mechanism theory in a visual cognition theory into image retrieval; (2) the method is a complete bottom-up retrieval mode, and user burden brought by user feedback is not needed; (3) meanwhile, the salient edge information and the salient region information in the image are considered, the fusion retrieval mode is realized, and the image retrieval performance is improved.

The nature of the ambiguity of the image retrieval problem determines that the image retrieval mode should be attributed to a Localized (Localized) retrieval mode. How to adopt a complete bottom-up mechanism, how to describe the importance difference of each region of the image directly according to the image content without a feedback mechanism of a user, and the method utilizes the automatically extracted interesting information to represent the query purpose of the user, and realizes the regional image semantic retrieval close to the perception of the user through feature extraction and matching, which is the starting point of the invention.

It can be seen from the observation that in most cases, the query concept of the user (or the region of interest in the image perceived by the user) is consistent with the salient information of the image in many cases. Aiming at the problems in the existing image retrieval based on the global and the regional, the invention provides a completely data-driven image retrieval method based on a visual attention mechanism model. The invention can understand the image semanteme from the user angle as much as possible without increasing the user interaction burden, and is close to the perception of the user to improve the retrieval performance.

The invention firstly provides a salient map generating algorithm based on three layers of Gaussian pyramids, which comprises the steps of constructing a Gaussian pyramid model for an original image, respectively calculating a salient map of each layer of image based on a contrast theory, and finally generating a final salient map by adopting an interlayer fusion mode, wherein the salient map generating algorithm is characterized in that the computational complexity of the salient map is reduced on the premise of ensuring the effectiveness, the salient map reflects the distribution of the salient information in the original image, the understanding of a user on the salient semantics of the image is simulated, namely, the parts which can draw the attention of the user are higher in significance, and the corresponding parts which are expressed as corresponding salient maps are brighter, then, respectively adopting Canny operator and JSEG image segmentation algorithm to extract the edge map and the segmentation map corresponding to the original image, on the basis of the salient map, the invention provides a definition strategy of the salient edges and the salient regions, effectively extracts the salient edge information and the salient region information in the image, the method comprehensively considers two factors of the edge length and the significant strength of the neighborhood of the edge pixel point, and provides a significant edge extraction strategy; then, aiming at the extracted significant Edge and the significant area, the concepts of a significant Edge Direction Histogram (significant Edge Direction Histogram) and a significant area connectivity subgraph (significant Region Adjacency Graph) are respectively provided. The salient edge direction histogram feature is used as the global salient feature description of the image, the salient region connected subgraph is used as the local feature description of the image, and the salient edge information and the region information are organically fused to realize the fusion retrieval of multiple features.

For the purpose of further illustrating the principles and features of the present invention, reference will now be made in detail to the present invention, examples of which are illustrated in the accompanying drawings.

Drawings

FIG. 1 is a flow diagram illustrating a salient edge and salient region extraction process based on a visual attention mechanism, in accordance with an embodiment of the present invention.

Fig. 2 is a schematic diagram of a saliency map generation result based on a three-layer gaussian pyramid according to an embodiment of the present invention.

FIG. 3 is a partial saliency map result representation in accordance with one embodiment of the present invention.

Fig. 4 is a diagram illustrating a partial saliency edge map extraction result according to an embodiment of the present invention.

Fig. 5 is an example of salient region extraction in an image according to an embodiment of the present invention.

FIG. 6 is an illustration of a MSRA salient object image library portion in accordance with an embodiment of the present invention.

FIG. 7 is a graphical illustration of a comparison of an extracted salient region to a reference set, in accordance with an embodiment of the present invention.

Fig. 8 is a salient region extraction precision result according to an embodiment of the invention.

FIG. 9 is a diagram of an exemplary image of a SIVAL image library, in accordance with one embodiment of the present invention.

FIG. 10 is a graph illustrating the comparison of SIVAL library-based search performance according to an embodiment of the present invention.

Fig. 11 is a schematic diagram of an example of salient edge and salient region extraction in an image (the image is derived from a COREL library) according to an embodiment of the invention.

Fig. 12 is an example of image retrieval results based on the COREL library according to an embodiment of the present invention.

FIG. 13 is a comparison of average precision and average recall performance of different algorithms in accordance with an embodiment of the present invention.

Detailed Description

The following detailed description of embodiments of the invention refers to the accompanying drawings.

The general framework of the algorithm of the present invention is shown in fig. 1. The generation of saliency maps using a visual attention mechanism computational model is the basis for the overall search framework, and the specific methods for implementing each module of the framework are described below.

The method comprises the steps of extracting an edge image and a segmentation image corresponding to an original input image by adopting a Canny edge operator and a JSEG image segmentation algorithm respectively, meanwhile, extracting a saliency map corresponding to the original image by adopting a saliency map generation algorithm based on a visual attention mechanism in the invention, and then organically fusing the edge image and the segmentation image with the extracted saliency map respectively, namely, carrying out superposition operation on the segmentation image result and the saliency map so as to obtain the saliency edge map and the saliency region map of the original input image.

As shown in fig. 1, step 1, processing the leftmost original input image by using a Canny edge operator to obtain a corresponding edge image;

step 2, processing an original input image by utilizing Saliency mapping (namely a Saliency map generation algorithm based on a visual attention mechanism provided by the invention) to obtain a corresponding Saliency image;

step 3, processing the original input image by utilizing Segmentation mapping (the mapping is the prior art, namely JSEG algorithm) to obtain a corresponding Segmentation image;

step 4, carrying out salient edge extraction by using the edge image obtained in the step 1 and the salient image obtained in the step 2 to obtain a salient edge image;

and 5, extracting a salient region by using the salient image obtained in the step 2 and the segmentation image obtained in the step 3 to obtain a salient region image.

Step 1: image saliency map generation algorithm based on visual attention mechanism model

Because the invention aims to extract the information of the salient edges and the salient regions in the images by using the visual attention mechanism model and realize semantic retrieval of the images, in order to adapt to the real-time requirement of image retrieval, an effective visual attention calculation model with lower calculation complexity is required. The invention provides a visual attention calculation model based on three layers of Gaussian pyramids and a comparison theory. The strategy of simply carrying out linear fusion on the multi-scale contrast characteristic values is adopted to reduce the computational complexity. In general, the algorithm is a saliency map calculation model of 'intra-scale central and peripheral contrast and inter-scale interpolation fusion'. The algorithm not only considers the effectiveness of the generation of the saliency map, but also saves the system overhead, and is beneficial to meeting the real-time requirement of image retrieval.

Fig. 2 shows a schematic diagram of the saliency map calculation method, where (a) represents a multi-scale gaussian pyramid representation of a given image, (b) represents a corresponding saliency map at each scale, and (c) represents a saliency map obtained by multi-scale fusion. Fig. 2 is divided into 3 steps: (1) extracting visual features of color brightness; (2) extracting directional visual features; (3) the multi-feature fusion generates a saliency map.

According to fig. 2, step 1, the original image is subjected to visual feature extraction of color brightness by adopting a multi-scale three-layer gaussian pyramid representation mode. By multi-scale, i.e., using multiple resolutions to present the image, the original image is down-sampled, such that the image size is 1/2, 1/4, etc. of the original image. This step corresponds to (a) in fig. 2.

The input image is decomposed into a series of "channels" of brightness, red, blue, yellow, and orientation, based on features extracted from the visual cortex in early vision. The input image is first represented as a 3-level gaussian pyramid. Where layer 0 is the input image and layers 1 through 2 are formed by filtering and down-sampling the input image with a 9 x 9 gaussian filter, respectively, of sizes 1/2 through 1/4 of the input image, respectively. Assuming that I is the input image of the current scale, r, g, b respectively represent the three color channels of red, green, and blue of the image. The luminance map I (x) corresponding to I is calculated as follows:

I(x)＝0.299r(x)+0.587g(x)+0.114b(x) (1)

where x represents a pixel in I.

Then, in order to separate the hue information from the luminance, r, g, b is divided by the luminance value to be normalized. This results in four color channels: r (red), G (green), B (blue) and Y (yellow). (if the channel value is negative, it is set to zero)

R(x)＝r(x)-[g(x)+b(x)]/2，

G(x)＝g(x)-[r(x)+b(x)]/2，

(2)

B(x)＝b(x)-[r(x)+g(x)]/2，

Y(x)＝[r(x)+g(x)]/2-|r(x)-g(x)|/2-b(x).

According to visual perception theory, what attracts visual attention is the contrast of features rather than the absolute values of features. Therefore, in order to simulate the structure of central-peripheral straw resistance of receptive field in the human eye visual system (namely the specific central-peripheral contrast effect of the visual system), an in-layer contrast means is adopted for various visual characteristics, namely a neighborhood window is opened around each pixel point, the neighborhood window is regarded as the receptive field, and the characteristic contrast value of the current pixel point and the rest pixels in the neighborhood window is calculated and used as the significant value of the pixel point under the current scale. Since in the human cerebral cortex, color features are represented in a system of so-called "double-opposed channels". Thus, color feature RG is also established in the model to simulatively characterize the dual-oppositivity of the red/green color pair, and color feature BY is established to characterize the dual-oppositivity of the blue/yellow color pair:

RG(x，y)＝|(R(x)-G(x))-(R(y)-G(y))|/2，

(3)

BY(x，y)＝|(B(x)-Y(x))-(B(y)-Y(y))|/2

wherein x represents a given pixel point in the image under the current scale, y belongs to phi_xAnd representing adjacent pixel points in a neighborhood window taking x as the center, wherein the size of the neighborhood window is 3 multiplied by 3. The color contrast difference between x and y is defined as follows:

wherein eta_RGAnd η_BYRespectively representing the weights of the two color features. Here, ,

(5)

the brightness contrast difference between the two pixels x and y is simply defined as:

ΔI(x，y)＝|I(x)-I(y)| (6)

combining the above feature contrasts, the color-brightness contrast difference between x and y can be obtained as:

and 2, respectively calculating a saliency map corresponding to the image under each scale level through directional visual feature extraction. This step corresponds to (b) in fig. 2.

The directional features of the image can be regarded as shape information of an object in the image, and the image with a given scale is filtered by adopting Gabor filters (0, pi/4, pi/2 and 3 pi/4) in four directions, so that directional feature mapping maps in 4 directions can be obtained. Here, the direction characteristic difference between the pixel points x and y is defined as:

<math><mrow><msub><mi>S</mi><mi>O</mi></msub><mrow><mo>(</mo><mi>x</mi><mo>,</mo><mi>y</mi><mo>)</mo></mrow><mo>=</mo><munder><mi>Σ</mi><mrow><mi>θ</mi><mo>&Element;</mo><mo>{</mo><mn>0</mn><mo>,</mo><mi>π</mi><mo>/</mo><mn>4</mn><mo>,</mo><mi>π</mi><mo>/</mo><mn>2,3</mn><mi>π</mi><mo>/</mo><mn>4</mn><mo>}</mo></mrow></munder><mi>ΔO</mi><mrow><mo>(</mo><mi>x</mi><mo>,</mo><mi>y</mi><mo>,</mo><mi>θ</mi><mo>)</mo></mrow><mo>-</mo><mo>-</mo><mo>-</mo><mrow><mo>(</mo><mn>8</mn><mo>)</mo></mrow></mrow></math>

where θ represents the direction of filtering, where θ ∈ {0, π/4, π/2, 3 π/4 }. Δ O (x, y, θ) represents the characteristic difference between pixel points x and y in the direction θ.Representing the sum of the feature differences in the four directions.

And 3, fusing the three layers of saliency maps into the original scale to generate the final saliency map through multi-feature fusion. This step corresponds to (c) in fig. 2.

Given color brightness characteristics and direction characteristics definitions, the saliency value of any pixel point x in an image is calculated as follows:

<math><mrow><mi>SP</mi><mrow><mo>(</mo><mi>x</mi><mo>)</mo></mrow><mo>=</mo><munderover><mi>Σ</mi><mrow><mi>l</mi><mo>=</mo><mn>1</mn></mrow><mi>L</mi></munderover><munder><mi>Σ</mi><mrow><mi>y</mi><mo>&Element;</mo><msub><mi>θ</mi><mi>x</mi></msub></mrow></munder><mrow><mo>(</mo><msub><mi>γ</mi><mi>CI</mi></msub><msup><msub><mi>S</mi><mi>CI</mi></msub><mi>l</mi></msup><mrow><mo>(</mo><mi>x</mi><mo>,</mo><mi>y</mi><mo>)</mo></mrow><mo>+</mo><msub><mi>γ</mi><mi>O</mi></msub><msup><msub><mi>S</mi><mi>O</mi></msub><mi>l</mi></msup><mrow><mo>(</mo><mi>x</mi><mo>,</mo><mi>y</mi><mo>)</mo></mrow><mo>)</mo></mrow><mo>-</mo><mo>-</mo><mo>-</mo><mrow><mo>(</mo><mn>9</mn><mo>)</mo></mrow></mrow></math>

wherein L ═ 1.. L]Representing the L-th layer of Gaussian pyramid, wherein taking L-3 means that three layers of Gaussian pyramids are establishedAnd (4) a tower. y ∈ theta_xTheta representing a pixel point x_xIs taken as θ_xIs the size of the 3 x 3 neighborhood centered on x as the receptive field. Gamma C_IAnd gamma_OThe weights of the color-luminance characteristic contrast value and the direction characteristic contrast value are respectively expressed, and are taken as gamma for simplicity_CI＝γ_O＝1。

And 4, filtering the saliency map obtained in the step 3 by using a Gaussian filter in order to remove noise points in the saliency map,

wherein G (x) is a standard deviation σ_G1 gaussian filter.

Fig. 3 is an example of two saliency maps, the left one being the original image and the right one being the saliency map obtained from the computational model proposed according to the invention, the greater the intensity of the points in fig. 3 meaning that it is more interesting. As can be seen from fig. 3, the visual attention calculation model proposed by the present invention can better reflect the degree of saliency of each region in the image.

Step 2: salient edge extraction algorithm

The edge information is a basic feature of the image and is also a place on the image where the gray scale changes most intensely. It contains rich intrinsic information (such as direction, step property, shape and the like) and is one of the important characteristics of image recognition. Since the human visual system is sensitive to the edge of the image, the extraction of the edge features is simple, and the amount of calculation is small, the edge information has been considered as one of the effective means for describing the image. Although edges may describe shapes better, this does not mean that all edges are beneficial for shape description and image retrieval. Experiments have shown that a person, when observing the edges of objects in an image, tends to remember the prominent edges, i.e. the longer edges, among them, while ignoring the detailed edges, i.e. the short edges. However, the image edges extracted by the edge detection operator such as the Canny operator often include a large number of short and small fragmentary edges, which is not favorable for representing the semantic information of the image.

How to extract the significant edge with high human perception degree from the initial edge image is an important research idea for understanding the image semantics. The saliency map provides clues to locate salient portions in the image. Since the human visual system is sensitive to salient portions in an image, it is of course sensitive to edge information around salient regions. The existing method mostly considers the use of a saliency map to extract a salient region in an image, while edge information is one of important features for describing semantic content of the image, and currently, no work is available for extracting a salient edge in the image by using the saliency map. The invention provides an effective salient edge extraction algorithm based on a salient map.

First, for a given image, an initial edge map is obtained using the Canny edge detector. The purpose of edge detection is to find out pixel points with local abrupt change of gray level, and because a Canny operator has good positioning and thinning functions, the operator is utilized to extract initial edge information of an image. Let E be { E ═ E₁，e₂，…，e_NDenotes the set of all edges in the initial edge map.

Secondly, combining the saliency map and the vision habit of human eyes, two basic elements of the saliency edge are defined: edge length and edge average saliency. For any edge e_iE, its significance is defined as:

SE(e_i)＝λL_L(e_i)+λ_SSP(e_i)，i＝1，...，N (10)

wherein, L (e)_i) Indicates the edge e_iLength of (d), SP (e)_i) Is defined as an edge e_iAn average saliency value based on the saliency map. N represents the initial total number of edges for a given image. Here, L (e) is required to be substituted_i) And SP (e)_i) Normalized to [0-1 respectively]In the interval. Lambda [ alpha ]_LAnd λ_ERespectively representing the weight coefficient of the edge length and the average significance value of the edge when defining the significance of the edge. Here will be λ_LAnd λ_ESet to 0.3 and 0.7, respectively.

SP(e_i) Is defined as follows: by analysing the composition edge e_iIndirectly delineate the edge e by the saliency of the neighborhood sub-window region of each pixel point_iAverage significance of (d). To eliminate the effect of singularities, consider edge e_iAnd analyzing the 3 multiplied by 3 neighborhood information of each pixel point, and analyzing the significance value of the sub-window of the field to measure the significance value of the pixel point.

<math><mrow><mi>SP</mi><mrow><mo>(</mo><msub><mi>e</mi><mi>i</mi></msub><mo>)</mo></mrow><mo>=</mo><munderover><mi>Σ</mi><mrow><mi>n</mi><mo>=</mo><mn>1</mn></mrow><mrow><mi>L</mi><mrow><mo>(</mo><msub><mi>e</mi><mi>i</mi></msub><mo>)</mo></mrow></mrow></munderover><munder><mi>Σ</mi><mrow><mi>x</mi><mo>&Element;</mo><msub><mi>w</mi><msubsup><mi>p</mi><mi>n</mi><mi>i</mi></msubsup></msub></mrow></munder><mi>SP</mi><mrow><mo>(</mo><mi>x</mi><mo>)</mo></mrow><mo>/</mo><mi>L</mi><mrow><mo>(</mo><msub><mi>e</mi><mi>i</mi></msub><mo>)</mo></mrow><mo>-</mo><mo>-</mo><mo>-</mo><mrow><mo>(</mo><mn>11</mn><mo>)</mo></mrow></mrow></math>

Wherein,

representing by pixel point P_n ⁱA 3 x 3 sub-window in the center. Sp (x) represents the saliency value of pixel point x on the saliency map.

After the definition of the significance of each edge in the initial edge graph is given, an empirical threshold T is introduced_E＝max(SE(e_i) 4) to extract a set of significant edges in the initial edge map. The final set of significant edges is defined as follows:

Θ_SE＝{e_i|SE(e_i)＞T_E，i＝1，...，N} (12)

fig. 4 gives an example of the partial significant edge extraction result. As can be seen from fig. 4, the salient edge extraction algorithm based on the saliency map can more effectively extract salient edge information reflecting semantic objects in the image. Fig. 4 first calculates a saliency map and a Canny edge map of an original image, then fuses the saliency map and the Canny edge map, and extracts a saliency edge map corresponding to the original image according to a saliency edge selection strategy formulated by the present invention.

As shown in fig. 4, part (a) on the left side is an original image, the original image is subjected to saliency map processing, and an initial saliency map is obtained by using the image saliency map generation algorithm based on the visual attention mechanism model, namely part (b);

then, carrying out edge processing on the original image, and obtaining an initial edge image by using a Canny edge detection operator, namely part (c);

next, a significant edge set is extracted from both (b) and (c) using an empirical threshold according to the defined edge length and the edge average significance, thereby obtaining a significant edge map, i.e., (d).

And step 3: salient region extraction algorithm

The significance analysis of regions in an image is an urgent problem to be solved for region-based image retrieval. And adaptively measuring the saliency of the image region in a mode of fusing the saliency map and the image segmentation result map. The advantage of such a fusion is that information of the region of coherence can be obtained with the aid of the image segmentation map. The information of the salient region is obtained without adopting a seed region growing mode. Moreover, since the saliency map itself is a blurred grayscale map, it is difficult to obtain a region of good physical significance by performing region growing on its basis. By dividing the image on the original image, the related information such as the color and the texture of the image can be fully utilized to obtain a better region division result, and the divided region is closer to the semantic object in the image.

a. Image segmentation based on JSEG algorithm

Image segmentation is often used as an initialization step for high-level applications such as object recognition and image retrieval. Here, JSEG segmentation algorithm (see, document 7, Deng Y, Manjunath B S. Unverervice segmentation of color-texture regions in images and video. IEEE trans. on patterning and Machine Learning, 2001, 23 (8): 800-. The JSEG algorithm first adaptively quantizes the image, with pixels in the image represented by their corresponding quantized colors to obtain a class map. Assume Z is the set of all points in the class map. For the element Z in Z, it can be expressed as Z ═ x, y, where x, y are the coordinates of the corresponding pixels in the image, and m is the average of all elements in Z. Suppose that the colors in the image are quantized to C levels, i.e. Z is divided into C classes Z_iI is 1, …, C. Let m_iAs belonging to Z_iMean of the elements of the class. Suppose that:

<math><mrow><msub><mi>S</mi><mi>T</mi></msub><mo>=</mo><munder><mi>Σ</mi><mrow><mi>z</mi><mo>&Element;</mo><mi>Z</mi></mrow></munder><msup><mrow><mo>|</mo><mo>|</mo><mi>z</mi><mo>-</mo><mi>m</mi><mo>|</mo><mo>|</mo></mrow><mn>2</mn></msup></mrow></math>

(13)

<math><mrow><msub><mi>S</mi><mi>W</mi></msub><mo>=</mo><munderover><mi>Σ</mi><mrow><mi>i</mi><mo>=</mo><mn>1</mn></mrow><mi>C</mi></munderover><munder><mi>Σ</mi><mrow><mi>z</mi><mo>&Element;</mo><msub><mi>Z</mi><mi>i</mi></msub></mrow></munder><msup><mrow><mo>|</mo><mo>|</mo><mi>z</mi><mo>-</mo><msub><mi>m</mi><mi>i</mi></msub><mo>|</mo><mo>|</mo></mrow><mn>2</mn></msup></mrow></math>

the JSEG algorithm hereby gives a definition criterion under ideal segmentation conditions:

J＝S_B/S_W＝(S_T-S_W)/S_W (14)

the J value defined by the above equation uses the ratio of the inter-class distance to the intra-class distance in the class map as the uniformity of the region. The uniformity metric can reflect both uniformity across colors and uniformity across textures. Calculating the J value for a local window centered on each pixel point in the class map results in a so-called "J-map". In the J diagram, the area edge and the area center in the original image correspond to points with a large J value and a small J value, respectively. Finally, region growing and merging are carried out on the J diagram to obtain the final segmentation result.

b. Salient region extraction based on maximum entropy algorithm

After image segmentation is completed by adopting a JSEG algorithm, each image consists of a group of homogeneous regions. How to fuse the saliency map with the obtained image segmentation map and extract the saliency region is an important step for completing semantic retrieval of the image. Because the semantic content of the image is rich and varied, how to adaptively select the number of the significant regions in the image is still a difficult problem. A robust maximum entropy algorithm based salient region extraction algorithm is used.

Giving an image I (r) segmented by a JSEG segmentation algorithm₁，r₂，...，r_|I|}，r_iRepresenting the ith region in image I, | I | represents the total number of regions. Respectively define regions r_iSignificant value of SR (r)_i) And average significance ASR (r)_i) As shown in the following formula:

<math><mrow><mi>SR</mi><mrow><mo>(</mo><msub><mi>r</mi><mi>i</mi></msub><mo>)</mo></mrow><mo>=</mo><munder><mi>Σ</mi><mrow><mi>x</mi><mo>&Element;</mo><msub><mi>r</mi><mi>i</mi></msub></mrow></munder><mi>SP</mi><mrow><mo>(</mo><mi>x</mi><mo>)</mo></mrow></mrow></math>

(15)

ASR(r_i)＝SR(r_i)/Area(r_i)

first, a threshold t is introduced_RThe regions with lower average saliency values are filtered out. If ASR (r)_i)＜t_RThen r will be_iAnd removing the candidate salient region queue. Here get

The advantage of using pre-filtering is that background regions with larger partial region areas but lower average saliency values can be removed from the candidate saliency regions.

Secondly, according to the maximum entropy theory, the value ranges of the significant values of all the regions in the image are set to be [0, M ], and then the threshold of the significant region can be calculated as follows:

wherein N is_uThe significant value of the representation area is the number of u, M represents the maximum value of the significant value of each area in the image, and the T with the maximum value of the above formula is enabled to be_RI.e. the threshold value sought. That is, the significance value is greater than T_RThe region of (a) can be considered a salient region. Thus, the definition of a salient region in a given image is:

Θ_SR＝{r_i|SR(r_i)≥T_R & ASR(r_i)≥t_R} (17)

the region satisfying the above requirements prevents the background having a large area from becoming a significant region, and also prevents a small region having a high average significant value, which is caused by over-segmentation, from becoming a significant region.

Fig. 5 gives an example of salient region extraction in several images. Firstly, a JSEG segmentation algorithm is adopted for an original image to obtain a segmentation image, then the segmentation image is fused with a salient image obtained based on a visual attention mechanism in one embodiment of the invention, and the most salient region in the image is extracted. As can be seen from fig. 5, the algorithm can extract the salient regions that most express the semantic content of the image.

As shown in fig. 5, the left part (a) is an original image, and the image segmentation processing based on the JSEG algorithm is performed on the original image to obtain a part (b);

extracting a salient region of the JSEG segmentation map (namely part (b)) by using the maximum entropy algorithm to obtain a salient map (namely part (c));

then, a salient region is extracted from both the parts (b) and (c) according to a defined threshold, thereby obtaining a salient region map, i.e., part (d).

And 4, step 4: image retrieval algorithm fusing salient region and edge information

After the salient edge and the salient region of the image are extracted, the salient edge and the salient region are respectively subjected to feature description, and similarity matching of the image is respectively carried out by utilizing the extracted feature vectors. And then, fusing the matching results of the two images to obtain a final image retrieval result.

For the salient edge and the salient region, respectively, salient edge direction histograms (SEHDs) and salient region connected Subgraphs (SRAGs) are proposed to describe the semantic content of the image. It should be noted that the salient edge direction histogram, as a global feature description, in combination with the local features such as the salient region connected graph, will effectively improve the performance of image retrieval.

a. Salient edge histogram feature description

An edge direction histogram descriptor (EHD) can effectively represent a natural image, and although it is sensitive to deformation of objects and scenes, an EHD can effectively describe the spatial distribution of edges. The EHD is calculated as follows: given an image, it is equally divided into 4 × 4 sub-images, and local edge direction histograms within each sub-image are computed separately. The directions of the edges are roughly divided into five categories: horizontal, vertical, 45 °, 135 °, and others. Thus, each local histogram is a 5-dimensional histogram, each dimension representing a direction. The whole image is composed of 16 such 5-dimensional local histograms, and an 80-dimensional edge direction histogram feature vector is formed after normalization. Two images I_AAnd I_BThe distance between the salient edge features based on the salient edge features is defined as:

wherein,

and

respectively representing images I_AAnd I_BWhere N-80 denotes the dimension of the edge histogram.

b. Salient region connectivity graph characterization

As an important component of region-based image retrieval, the definition of image similarity is fully emphasized, and almost every region-based image retrieval system has its own definition of image similarity. Early systems defined the similarity between images based on the distance between individual regions. However, this definition method transfers excessive burden to the user, and the result of this may cause the user to be "overwhelmed", which may result in the system being unable to be practically popularized. Later systems improve the definition of image similarity, not only simplify the user interface and reduce the burden of users, but also comprehensively utilize the information of a plurality of areas to make the system have robustness to poor segmentation results. The spatial distribution of the image Region is represented by a Region Adjacency Graph (RAG). The search by using the region adjacency graph has the advantages that: the relative spatial relationship between regions is taken into account, which is different from the region Matching algorithm such as the conventional fused region Matching strategy (IRM) in essence, because in the IRM, the region Matching algorithmThe full-region matching algorithm is adopted, and the relative spatial relationship between the regions is not considered. The representation mode of the region adjacency graph can be merged into the relative spatial position relationship between the regions. After obtaining the region adjacency graph for each image, Sub-graph isomorphism algorithm (Sub-graph isomorphism algorithm) is used to measure the similarity between two RAGs. According to an embodiment of the invention, in combination with the Salient region selected by the visual attention mechanism, the Salient region is only selected as a root node, and a Salient primitive RAG (Salient RAG, SRAG) is constructed as a basic unit for performing subgraph matching between images. Since the number of salient regions in an image is usually much smaller than the number of segmented regions, processing in this way can significantly reduce the number of primitive subgraphs. Specifically, the SRAG is constructed by taking each salient region in the image as a root node and an area directly adjacent to each salient region as a leaf node. Suppose an image I_AAnd I_BRespectively has m_AAnd m_BA salient region, two images I_AAnd I_BThe distance between the two salient region features based on the salient region features is defined as:

wherein,

representing two images I_AAnd I_BThe distance between the significant areas of (a) and (b),

representing the distance between leaf nodes that have a given two salient regions as root nodes, respectively. Since the number of leaf nodes is not equal, the minimum of the two sets of leaf nodes is used as its distance measure. w is a_rAnd w_bThe weights respectively represent the similarity of the root node and the leaf node, and in one embodiment of the invention, w is taken_r＝0.8，w_b0.2. It should be noted that in one embodiment of the present invention, the similarity between the salient regions of the two images is considered, and the similarity between the regions adjacent to the salient regions of the two images is also considered. In essence, the matching mode takes the context information of the salient region into consideration, and the retrieval robustness is improved. Because user-specified query concepts are often difficult to represent in one area or object. For example, a flower is a concept consisting of a set of petals and green leaves. Therefore, the neighbor information of the merged region can effectively improve the performance of retrieval.

In which two images I are respectively given_AAnd I_BAfter the distance measurement based on the salient edge feature and the distance measurement based on the salient region feature are carried out, the feature fusion is adopted, and then I_AAnd I_BThe final distance metric is defined as:

Dis(I_A，I_B)＝λDis_R(I_A，I_B)+(1-λ)Dis_E(I_A，I_B) (20)

where λ is the weight assigned to the region search and 1- λ is the weight for the edge search. In general, in image retrieval, an edge feature is an important supplement to a region feature, and therefore λ is 0.7, that is, a similarity measure based on the region feature retrieval has a large proportion.

The present invention was effectively demonstrated in three different image databases (MSRA image library (see document 8, Liu T, Sun J, Zheng N, Tang X, Shum H. Learning to detect a present object. in: Proc. of IEEE Computer Society Conf.on Computer Vision and Pattern Recognition (CVPR' 07), Santa Barbara, USA, Jun. 2007: 309-. The method comprises the following steps that (1) the effectiveness of a significant region extraction strategy provided by an algorithm is demonstrated on an MSRA image library; the performance of the bottom-up retrieval algorithm is closer to that of a top-down retrieval algorithm based on multi-example learning by verifying on a SIVAL image library; finally, the performance of the image is compared and analyzed on a COREL image library with the existing retrieval algorithms.

A large number of experimental results show that compared with most of the conventional image retrieval methods based on the global and the regional, the designed image semantic retrieval method based on the visual attention mechanism is improved to a considerable extent, namely, in the aspects of automation and objective evaluation of retrieval performance.

Significant region evaluation results

The extraction of salient regions is first measured using a salient object image library (downloaded at the website: http:// research. microsoft. com/. about jiansun/saiientobject/present _ object. htm) provided by microsoft asian institute (MSRA), which consists of 5000 images, each of which has at least one salient object, while each of the images is manually marked by 9 volunteers with a rectangular frame for the position of the salient region. In order to effectively scale the salient region extraction algorithm proposed by the present invention, a region-based precision metric is used.

<math><mrow><mi>Precision</mi><mo>=</mo><mfrac><mn>1</mn><mi>N</mi></mfrac><munderover><mi>Σ</mi><mrow><mi>i</mi><mo>=</mo><mn>1</mn></mrow><mi>N</mi></munderover><mi>f</mi><mrow><mo>(</mo><mfrac><msubsup><mi>I</mi><mi>i</mi><mi>c</mi></msubsup><msubsup><mi>I</mi><mi>i</mi><mi>g</mi></msubsup></mfrac><mo>-</mo><mi>ξ</mi><mo>)</mo></mrow><mo>-</mo><mo>-</mo><mo>-</mo><mrow><mo>(</mo><mn>21</mn><mo>)</mo></mrow></mrow></math>

Where N is the number of images in the image library. I is_i ^gIs a significant region in the reference set, I_i ^cRepresenting an image I_iSignificant region detected and I_i ^gThe region, ξ, of the intersection is a predetermined threshold. f (×) represents an indicator function defined as:

figure 7 shows the results of a portion of the experiment and their comparison with the reference set. The first and third columns in fig. 7 represent the salient regions in the reference data set, respectively, demarcated by the artificial rectangular box, while the second and fourth columns represent the salient regions automatically extracted by the algorithm of the present invention. By comparing with the salient region given by the reference set, the salient region information in the image can be effectively extracted based on the fusion of the salient map and the image segmentation map.

Figure 8 shows a graph of the results of precision as a function of threshold selection. The abscissa of fig. 8 represents the selection of the threshold and the ordinate represents the accuracy of the salient region proposed by the algorithm of the present invention compared to the reference set. As can be seen from fig. 8, even when ξ is 0.9, nearly 45% of the images can automatically detect a significant region with a matching rate of 90% with the reference set; and when xi is 0.5, more than 70% of images can detect a significant area with a matching rate of 50%. Therefore, the MSRA data set can show that the salient region extraction algorithm provided by the invention can basically extract the salient region close to the subjective perception of the user, and effective sample data is provided for the subsequent image semantic retrieval.

Retrieving performance results

To test the performance of the multi-feature fused image retrieval algorithm proposed above, experiments were conducted using the SIVAL image library and the COREL image library of the st louis division of washington university, respectively.

(1) SIVAL image library-based search performance comparison

The SIVAL (spatial Independent, Variable Area, and Lighting) library (website http:// www.cs.wustl.edu/. about sg/accio/SIVAL. html) at the St.Louis school of Washington university is constructed by a research team at Washington university, studying multiple examples of learning and applications thereof. It can be known from the name of the image library that the images in the image library have the characteristics of space independence and variable illumination and visual angles. The purpose of constructing the image library is to test the performance of localized image retrieval based on multi-instance learning. The image library comprises 25 image sets with different semantics, and each semantic category comprises 60 images, and the total number is 1500. The image library side focuses on the idea of verifying Localized image retrieval (Localized CBIR), that is, how to learn the query concept of a user through a multi-example learning algorithm and use the query concept to realize retrieval. Specifically, 25 different objects (e.g., applets, bananas, cowecans, etc.) are presented 6 times each at different viewing angles and lighting conditions in 10 different scenes for a total of 1500 images.

FIG. 9 shows example images in a SIVAL image library. Fig. 9(a) shows an exemplary graph of a 25-class collection of different objects in the SIVAL image library, and fig. 9(b) shows an exemplary graph of 10 different scenes of one class of javascript objects.

Since the image retrieval algorithm based on the visual attention mechanism is also a localized image retrieval mode, in order to verify the validity of the algorithm, the method is compared with the ACCIO image retrieval algorithm based on the multi-example learning theory proposed in the document 9. The ACCIO algorithm is based on a relevant feedback strategy, a target semantic concept is obtained by learning a positive and negative image submitted by a user based on a diversity density algorithm, and retrieval matching is completed according to the target semantic concept; the algorithm is based on a visual attention mechanism calculation model, and similarity matching is realized by using the salient edge and salient region information extracted from the image; although both algorithms fall into a top-down based search mode and a bottom-up based search mode, respectively, the essence of the algorithms is to attempt to achieve a more semantically focused localized image search by analyzing the user's query semantic concepts. In order to objectively compare the retrieval performance of the two algorithms, the SIVAL image library is firstly segmented by adopting a JSEG algorithm, and each segmented region is characterized by a 30-dimensional visual feature vector (comprising features such as color moment features, Gabor texture features, region shapes and areas). Each algorithm performs 5 rounds of experiments, calculates the precision ratio of each round of search results (namely the number of the regular images in the first 60 images returned by the system search), and adopts the average value of the precision ratios of the 5 rounds of search as the evaluation index of the search performance. In each round of experiment, for an ACCIO algorithm, 8 positive examples of images and 8 negative examples of images are randomly selected for each semantic category, and a diversity density algorithm is adopted to learn a target concept feature vector representing the semantic category based on 16 given positive and negative examples of images; for the algorithm, a regular example image is randomly extracted from each semantic category for retrieval and matching. The comparison result of the two is shown in fig. 10.

The abscissa of fig. 10 represents semantic category information (25 different types of object images in total); the chinese and english references for the 25 classes of objects in fig. 10 are as follows:

ajaxorange: orange detergent

Apple: apple (Malus pumila)

Banana: banana

Blue scrung: blue dish-washing towel

Candlewith holder: candle holder

Cardboardbox: paper board

Checkeredscaf: patterned headband

Cokecan: cola pop-top can

Dataminingbook: data mining book

Dirtyrunningshot: dirty running shoes

Dirtyworkgloboves: dirty gloves

Fabrisoftenbox: softening agent

Feltflowerung: figured cloth

Glazedwood: wooden barrel

Goldmeal: gold medal

Greenteabox: green tea box

Juliespot: kaleidoscope

Largespoon: soup spoon

Rapbook: talking and singing book

Smileyfacetdol: smiling face doll

Spritecan: sprite pop-top can

Stripednotebook: notebook computer

Translucentbowl: plastic bowl

wd40 can: universal lubricant

woodrollingpin: rolling pin

The ordinate represents the average precision of the algorithm in percent. The result of the experimental comparison shows that the retrieval accuracy of the algorithm is closer to that of the ACCIO algorithm, the retrieval precision of certain categories is even better than that of the ACCIO algorithm, and in general, compared with the ACCIO algorithm, the algorithm does not need the participation of users in feedback, and adopts a complete bottom-up mechanism to extract the significant information in the image to represent the query concept of the users and is used for retrieval and matching of the image.

(2) COREL library-based search performance comparison

The COREL image library is the most commonly used benchmark test set in the field of image retrieval. The content of the image library covers the categories of people, natural landscapes, animals, plants, buildings and the like. The experiment of (2) was to extract a subset of images from the COREL image library that contained 50 semantic categories, each semantic category having 100 images, for a total of 5000 images for retrieval. In the searching process, the result belonging to the same semantic class as the sample image is considered to be correct, and otherwise, the result is wrong. In the experiment, the query adopts a query mode based on an example, and the images in the library are sequentially arranged from small to large according to the distance between the images and the query image. The test set is used to randomly select 500 images from all 50 categories as the image set to be queried. The software running environment of the algorithm is as follows: MATLAB6.5 and Windows XP, the hardware environment is: Pentium-43.0G computer, 1G memory. The evaluation indexes of the experiment adopt Recall (Recall) and Precision (Precision) commonly used in the field of image retrieval, and are defined as follows:

recall is defined as: recall (N) ═ C_N/M

(23)

The precision ratio is defined as: precision (n) ═ C_N/N

Wherein N is the number of images returned by the current retrieval of the system; c_NThe number of images which belong to the same semantic category as the image to be retrieved in the N images output by the system retrieval; m is the number of images which belong to the same semantic category as the image to be retrieved in the image library. In a given COREL image library, M is 100 because each semantic class contains 100 images.

To verify the effectiveness of the invention, the method of the invention was compared to several algorithms: (1) an algorithm based on global color and texture features (here, 72-dimensional HSV color histogram) is denoted as "HSV 72-bin"; (2) a classic fusion region matching algorithm, denoted as "IRM"; (3) an algorithm for searching by simply using a significant region is marked as 'SR Only'; (4) an algorithm for searching by simply using a significant edge is marked as 'SEOnly'; (5) the search algorithm that fuses the salient edges and salient regions is denoted as "deployed Fusion".

Fig. 11 gives an example of extracting salient edges and salient regions on the COREL library. In fig. 11, each set of 6 images is represented from top left to bottom right as: the image processing method comprises an original image, an edge segmentation image, a region segmentation image, a saliency edge image and a saliency region image.

Fig. 12 shows an example of a partial search result, where the first column represents an image to be searched, and the last four columns represent 4 images that are most similar to the image to be searched and retrieved from the image library. Wherein, Query Image represents the Image to be retrieved; top 4 matches represents the best matching 4 images.

FIG. 13 shows the comparison of the average precision and average recall for different algorithms, where the left graph shows the performance comparison of the average precision for each search algorithm and the right graph shows the performance comparison of the average recall for each algorithm, it is evident from FIG. 13 that the search performance for merging significant edges and significant regions is better than the search algorithm based on global features and the region search algorithm based on the IRM matching strategy.

In fig. 13:

number of Retrievals: the number of searches;

average Precision: average search precision;

the graphs are respectively expressed as:

deployed Fusion: the performance of the algorithm provided by the invention;

SR Only: a retrieval algorithm based solely on salient regions;

IRM: an IRM matching strategy-based region retrieval algorithm;

HSV-72: a retrieval algorithm of a 72-dimensional global color histogram based on HSV color space;

SE Only: and (4) a retrieval algorithm based on the salient edges.

Although specific embodiments of the present invention have been described above, it will be understood by those skilled in the art that these specific embodiments are merely illustrative and that various omissions, substitutions and changes in the form of the detail of the methods and systems described above may be made by those skilled in the art without departing from the spirit and scope of the invention. For example, it is within the scope of the present invention to combine the steps of the above-described methods to perform substantially the same function in substantially the same way to achieve substantially the same result. Accordingly, the scope of the invention is to be limited only by the following claims.

Claims

1. A visual attention model-based image semantic retrieval method comprises the following steps:

step 1: inputting an original image;

step 2: generating a saliency map, an edge map and a region segmentation map corresponding to the original image;

and step 3: generating a salient edge map by using the salient map and the edge map corresponding to the original image; generating a salient region map by using the salient map and the region segmentation map corresponding to the original image;

and 4, step 4: generating a salient edge feature and a salient region feature by using the salient edge map and the salient region map;

and 5: and fusing the salient edge features and the salient region features to perform image retrieval.

2. The method of claim 1, wherein the step 2 of generating the saliency map corresponding to the original image comprises:

step 2-1: extracting visual characteristics of color brightness by using a representation mode of a three-layer Gaussian pyramid;

step 2-2: respectively calculating a saliency map corresponding to the image under each scale level through directional visual feature extraction;

step 2-3: generating a saliency map corresponding to the original image through multi-feature fusion;

step 2-4: and filtering the saliency map obtained in the step 2-3 by using a Gaussian filter.

3. The method of claim 1, wherein the step 2 of generating the edge map corresponding to the original image comprises: and processing the original image by using an edge detection operator to generate an edge image.

4. The method of claim 1, wherein the step 2 of generating the segmentation map corresponding to the region of the original image comprises: the original image is subjected to image segmentation processing to generate a region segmentation image.

5. The method according to any one of claims 1 to 4, wherein the step 3 of generating the saliency map using the saliency map and the edge map corresponding to the original image comprises:

step 5-1: extracting a significant edge set in the edge image according to the defined edge length, the edge average significance and a first experience threshold;

step 5-2: and generating a salient edge map corresponding to the original image according to the salient edge set in the initial edge map and the salient map corresponding to the original image.

6. Method according to one of claims 1 to 4, characterized in that the edge detection operator is a Canny operator.

7. The method according to any one of claims 1 to 4, wherein the step 3 of generating the saliency map using the saliency map and the region segmentation map corresponding to the original image comprises:

step 7-1: extracting a salient region of the region segmentation image by using a maximum entropy algorithm;

step 7-2: and extracting a salient region according to a second empirical threshold, the salient map corresponding to the original image and the region segmentation image so as to obtain a salient region map.

8. The method according to claim 4, characterized in that the image segmentation is performed using the JSEG algorithm.

9. The method of claim 1, wherein the salient edge features in step 4 are salient edge direction histograms and the salient region features are salient region connected subgraphs.