CN111652309A

CN111652309A - Visual word and phrase co-driven bag-of-words model picture classification method

Info

Publication number: CN111652309A
Application number: CN202010478642.8A
Authority: CN
Inventors: 刘秀萍; 李蕊男
Original assignee: Individual
Current assignee: Individual
Priority date: 2020-05-29
Filing date: 2020-05-29
Publication date: 2020-09-11

Abstract

According to the visual word and phrase co-driven bag-of-words model picture classification method, the picture is divided into a foreground region and a background region through a visual explicit analysis method, features are respectively extracted from the two regions to establish respective visual feature dictionaries, word histograms of the two parts are aggregated to represent the picture according to a certain weight, and semantic meanings of feature expressors are enriched. The clustering process of visual words and phrases is ingeniously changed into a tree structure by introducing hierarchical clustering analysis, a series of problems caused by poor selection of an initial center of a mean clustering algorithm are avoided, and the whole to partial analysis process of the picture is more appropriate. The self-adaptive hypersphere soft assignment method is provided, the spatial position arrangement relation of visual words and phrases in the picture is introduced, the ambiguity problem of a single word is ingeniously removed, the classification expression capacity is greatly improved, the picture classification depth is larger, the accuracy is higher, the classification speed is higher, and the robustness of the method is better.

Description

Visual word and phrase co-driven bag-of-words model picture classification method

Technical Field

The invention relates to a visual picture classification method, in particular to a bag-of-words model picture classification method driven by visual words and phrases, and belongs to the technical field of picture classification.

Background

With the popularity of mobile social contact and the increasing popularity of portability of intelligent terminal devices, the number of digital pictures shows a well-spraying increase. On an internet platform, pictures of GB and even TB are generated and shared every day, and the number of pictures shared every day by a famous picture social application photo wall exceeds 8550 ten thousand; from satellite remote sensing pictures to medical microscopic pictures, from traffic management to video monitoring, huge numbers of pictures and videos are not generated all the time. There is a need for techniques related to efficient picture content analysis, retrieval and classification in such a large picture library.

The image classification can not only distinguish the belonging type information of the image, but also provide the sub-information content contained therein, even emotional expression and the like, so the cognition of the image can be promoted to the level based on the semanteme, the technology can lead a computer to see the surrounding environment and analyze and understand like a person, lead the computer to have the automatic identification capability and wide application range, and mainly comprises the following steps: the digital library has the advantages that the search engine has strong requirements for various picture and video retrieval systems at home and abroad at present, electronic commerce websites need to visually retrieve the same similar products by using a picture classification technology, and books similar to book covers can be found by using the picture classification technology when book resources are searched by a digital library; secondly, photo album classification, which leads common users to generate thousands of pictures along with the popularization of the mobile equipment capable of taking photos, and if classification management is not carried out, the photo albums are disordered; thirdly, picture analysis, in the analysis of the remote sensing image picture, the classification technology can realize early warning monitoring, and meanwhile, the method has great application to ocean resource development, homeland mineral resource exploration and the like, a large number of pathological pictures can be generated by diagnostic equipment in the medical field, and the picture classification can assist doctors in efficiently searching and accessing the needed pictures; and fourthly, machine vision, picture classification has significance for route planning of unmanned driving and robots, intelligent vehicles or robots can be assisted to judge the environments where the intelligent vehicles or the robots are located, obstacles are avoided, the optimal routes are walked, pictures are classified and used on the unmanned aerial vehicles, and geological exploration, farmland damage investigation and the like can be carried out.

The image classification is a sub-problem in the field of machine learning and pattern recognition, and aggregates subject knowledge of pattern recognition, computer vision, statistics, biology and the like, and can provide a real environment for combined research and specific application of the subjects, and enrich the working progress of various aspects of theory and practical application, so that the image classification is extremely important from both a real application aspect and a theoretical research aspect.

From the research and application progress related to the prior art, the general idea of picture classification is as follows: firstly, the picture is represented by using the picture characteristics, secondly, the picture content is represented by dequantization according to the characteristics, and thirdly, a proper classifier is designed for picture classification. According to the hierarchical expression of the characteristic semantics, the picture classification semantic model can be summarized into the following two types:

the method is based on the image classification of a low-level semantic model, and in the early image classification and retrieval in the prior art, the image content expression is directly expressed by attributes such as color, shape and texture of an image. The method comprises the steps of representing pictures by using vectors with consistent color moments and edge directions, classifying tourism pictures by using a naive Bayes classifier, classifying football video shots by aggregating color texture features, dividing the pictures into sub-blocks, calculating spatial color features and texture features from the sub-blocks, and classifying the sub-blocks by selecting a K neighbor classifier after aggregating the two features. The method directly maps the bottom-layer features to the high-layer semantics, and due to the lack of conversion transition of the middle semantics, the generalization capability of classification is unsatisfactory, the accuracy is low when the classification is performed on samples outside a training set, and the actual application value is very low.

And secondly, the image classification based on the middle-layer semantic model is combined with human visual cognition to realize the transition from the bottom-layer characteristics to the high-layer semantics. The method is divided into two types according to different semantics of the middle layer: the first method corresponds local features in a picture to local concepts of a semantic layer, achieves the purpose of classification by constructing a middle-layer semantic model for the features, and the method absorbs the concept of a bag of words model in text classification to complete classification; the second method is to design corresponding global concepts such as roughness, ruggedness, opening degree, smoothness and the like for picture contents according to human visual perception attributes, the global statistical characteristics are obtained by counting low-level features according to a certain mode, and the generalization capability is obviously poorer than that of the first method because mode rules are set subjectively. The invention relates to a first method based on a middle-layer semantic model, which uses a feature word bag and a support vector machine as core structures to classify pictures.

Although the image classification method in the prior art achieves certain results, the method has a great gap from ideal intelligent classification, and the major difficulties are as follows: the image classification method with good performance overcomes the interference of the factors to the maximum extent and finds out the most essential semantic features of the objects. The information quantity brought by the factors is properly introduced into the visual feature dictionary, so that the visual word and phrase distinguishing capability is more precise, and a computer has the distinguishing capability of human eyes and brain; the second is the synonymity of pictures among classes, distinguishing pictures with similar contents of different classes is the most important difficulty in picture classification, some pictures with differences among classes can represent similar visual features, but the semantic understanding level in reality does not allow the classification result to be confused, so that various objects are required to be learned and identified in feature extraction, semantic word generation and semantic distinguishing in picture classification. In order to solve the problems, the constructed visual feature dictionary must be composed of words with accurate distinguishing capability and practical semantic expression.

In summary, the present invention is intended to solve the following problems in view of some of the drawbacks of the prior art:

firstly, in the generation process of visual words and phrases in the prior art, a K-means clustering algorithm is obtained by partitioning and clustering the distribution of bottom-layer features, similarity between feature points and the feature points is determined according to simple position distribution, and the similarity does not have accurate meaning. The precise matching of the features cannot be realized in the clustering, and the distinguishing capability of the feature dictionary is very weak.

Secondly, in the prior art, hard assignment is adopted during feature mapping, the ambiguity of features is not easily reflected, soft assignment is adopted, the mapping range of the features and visual words and phrases is not easily controlled, the ambiguity of the feature mapping to the visual words and phrases cannot be flexibly ensured, visual null words which do not contribute much to the relevance of classification categories cannot be abandoned, and the obtained feature dictionary has poor identification rate.

Thirdly, in the word bag model in the prior art, semantic correlation between visual words and phrases and the lack of semantic correlation are ignored in the mode of expressing picture content, word sets in pictures are disordered, the word bag model cannot express local features and depict context information among the features, and accurate classification effect cannot be obtained.

Fourthly, in the word bag model in the prior art, SIFT vectors are adopted to represent pictures singly, the picture feature sources of the word bag model are single, and weight information for expressing different content areas is lacked, but the pictures usually express contents jointly by background and theme, the prior art cannot discuss structural factors expressed by picture contents in combination with a biological mechanism, and has no obvious defect of classifying the pictures into foreground and background, and the word bag model is respectively constructed in the foreground and background, and then aggregated for classification. The method is more in line with the mode of seeing the picture by the human brain, and different weights are dynamically assigned to the characteristics of different parts, so that the generated visual characteristic dictionary has pertinence.

Fifthly, a series of problems caused by poor selection of the initial center of the mean clustering algorithm cannot be avoided in the prior art, and the whole to partial analysis process of the picture is improper. In the assignment stage, the problems of polysemy mapping of visual word phrases in a feature dictionary and redundancy caused by visual word phrases containing useless information cannot be solved, the classification result obtained by a hard assignment method is not ideal, and the picture classification speed is very low.

Sixth, the prior art does not integrate the spatial structure information and context semantics of visual word phrases, can not remove the ambiguity problem of single words, can not select the expression weights of the two parts according to the spatial structure of the picture, and has the advantages of lower classification expression capability level of the model, smaller picture classification depth, lower accuracy, low classification speed and poor robustness of the method.

Disclosure of Invention

Aiming at the defects of the prior art, the invention divides the picture into a foreground region and a background region by a visual explicit analysis method, respectively extracts features from the two regions to establish respective visual feature dictionaries, and aggregates word histograms of the two parts according to a certain weight to represent the picture, thereby enriching the semantic meaning of the feature expression sub. The clustering process of visual words and phrases is ingeniously changed into a tree structure by introducing hierarchical clustering analysis, a series of problems caused by poor selection of an initial center of a mean clustering algorithm are better avoided, and the whole to partial analysis process of the picture is more appropriate. In the assigning stage, in order to solve the problem of redundancy brought by polysemy mapping of visual words and phrases in a feature dictionary and visual words and phrases containing useless information, a certain number of visual virtual words are removed, the spatial position relation between picture features and the centers of the visual words and phrases is comprehensively considered, a self-adaptive hypersphere soft assigning method is provided, the spatial position arrangement relation of the visual words and phrases in pictures is introduced, words forming the phrases can mutually provide context information on the pictures, and the ambiguity problem of a single word is ingeniously removed. The classification expression capability of the model is greatly improved, the image classification depth is larger, the accuracy is higher, the classification speed is higher, and the robustness of the method is better.

In order to achieve the technical effects, the technical scheme adopted by the invention is as follows:

a visual word and phrase co-driven bag-of-words model picture classification method comprises the steps of regarding a picture as an element set, enabling elements in the element set to be discrete visual word and phrase combinations, respectively counting probabilities of different visual words and phrases appearing in the set to obtain corresponding frequency histogram vectors, enabling the frequency histogram vectors to be equivalent representations of the picture in a bag-of-words model angle, and finally introducing the frequency histogram vectors into a classifier for training and classification; the method comprises the following specific steps:

firstly, extracting the characteristics of a foreground and background aggregated picture; the picture feature extraction expression method of foreground and background aggregation is based on a human visual attention mechanism, and divides a picture into a visual obvious region and a non-explicit region, wherein the visual obvious region is a foreground, the non-explicit region is a background, the foreground comprises prominent representation contents in the picture, and the background comprises environmental factors of the picture;

secondly, performing polymerization expression on the visual feature word bag; clustering the multidimensional space vectors through a clustering algorithm, wherein each clustering center is an independent word phrase, and a visual feature dictionary is formed after combination for subsequent feature mapping and searching;

thirdly, generating mapping of visual word phrases; the picture features are assigned to word and phrase corresponding to the visual feature dictionary, the visual word and phrase which are closest to the feature vector distance in the picture are searched in vector space, then the visual word and phrase are assigned to corresponding words, each picture is represented as a word and phrase vector with K dimension, and K is the number of the previously set clustering centers;

fourthly, training and classifying by a classifier; and taking the obtained K-dimensional vector as the input of a classifier, and training and classifying the classifier for picture classification.

The visual word and phrase co-driven bag-of-words model picture classification method further comprises a first step of adopting a picture feature extraction method based on a visual attention mechanism for the picture feature extraction of foreground and background aggregation, wherein the visual obvious region extraction method comprises the following steps: firstly, establishing 9 layers of Gaussian pyramids of the picture from three dimensions of direction, color and brightness of the picture, secondly, extracting features of the three dimensions of direction, color and brightness from each layer of the Gaussian pyramids to form a feature pyramid, thirdly, carrying out scale-by-scale subtraction in a multi-scale space to obtain a feature distribution diagram taking a salient target as a center, and fourthly, establishing a Markov chain of the two-dimensional picture by using a Markov random field to obtain a final visual obvious region distinguishing diagram of the picture.

The visual word and phrase co-driven bag-of-words model picture classification method further comprises the second step of carrying out visual feature bag aggregation expression to aggregate foreground features and background features to express picture content, dividing a visual feature dictionary into a foreground feature dictionary generated by foreground SIFT features and a background feature dictionary generated by background intensive SIFT features, and finally carrying out picture classification judgment on weighted aggregation of histograms obtained by mapping the two feature dictionaries; the method specifically comprises the following steps: dense SIFT expression sub-sampling, foreground feature dictionary generation, background feature dictionary generation and aggregated feature generation.

The visual word and phrase co-driven bag-of-words model picture classification method is characterized in that further, intensive SIFT expressors adopt an even sampling mode, pixel interval size is set to control sampling density, and the picture is subjected to feature extraction window by window;

after feature points are extracted at intervals, setting the same scale C for all the feature points, adjusting the picture to be horizontal 0 degrees, drawing a circle by taking the feature points as the circle center and taking the set scale C as the radius, uniformly dividing pixel points falling in the circle into 4 gamma-4 non-overlapping subregions, dividing angle coordinates in the subregions at intervals of 45 degrees, then counting an angle histogram of each subregion in each direction, and generating a feature expressor to be represented by a 128-dimensional vector;

the intensive SIFT adopts a mode of uniformly extracting feature points, adopts multi-scale extraction to recover scale invariance, expresses the overall general picture of the picture in a large scale, and captures partial details of the picture in a small scale.

The visual word and phrase co-driven bag-of-words model picture classification method further comprises the following specific steps of foreground feature dictionary generation:

step 1, extracting SIFT features from a foreground region in a picture, obtaining a visual feature dictionary corresponding to a foreground according to a clustering method, and marking the dictionary as A_q；

Step 2, extracting SIFT expressors from the foreground content of the picture to be classified, and marking all generated SIFT feature sets as B_q；

Step 3, mapping all the feature points in the B to the words closest to the feature points in the A according to a hard assignment method, and obtaining a visual word and phrase set corresponding to each picture after the mapping is completed;

step 4, recording the number of visual words and phrases appearing in each picture, obtaining a corresponding frequency histogram, and marking the frequency histogram as D_qNo normalization is performed.

Visual word and phrase co-driven bag-of-words model picture classification method, further, dense SIFT adopts the grid division method, divides the content into gamma J check subblocks, expresses each subblock with SIFT, the concrete steps of the background feature dictionary generation are as follows:

dividing grids for a background region of each picture according to Igamma J to obtain background subregions;

step two, extracting a dense SIFT expression subset from the background subarea, clustering by using a K-means clustering algorithm, setting L2 clustering centers, and respectively recording the centers as S₁、S₂、S₃、…、S_L2All centers are gathered to be a visual feature dictionary of background content corresponding to SIFT, marked as A_h；

Step three, the background content blocks of the picture to be classified are processed in a grid block shape, SIFT expressors are extracted from the sub-blocks and marked as B_h；

Step four, for B_hThe feature quantization process of (1) mapping the feature to a according to the mapping method_hThe corresponding word in (1);

step five, recording the number of visual words and phrases appearing in each picture, obtaining a corresponding frequency histogram of the visual words and phrases, and marking the frequency histogram as D_hNo normalization is performed.

Further, when characteristics are aggregated, the expression ratio of a foreground area to the content of a whole picture is larger, when the characteristics of the foreground area and the background area are aggregated, the weight assignment highlights the foreground area, the weight of the background is weakened relatively, and a characteristic aggregation weight assignment function is shown as formula 1:

E_q＝C×F^aformula 1

Wherein, C is a scaling parameter, a is an adjustment parameter, the scaling parameter and the adjustment parameter jointly adjust the weight assignment, F is a foreground feature ratio coefficient, and C is 1, a is 0.4, as shown in formula 2:

J_qnumber of SIFT feature points detected for the foreground region, J_hThe number of intensive SIFT feature points in the background region is 0<F<1,E_qThe weight of the SIFT in the foreground region is the weight occupied by the aggregation features, and according to the dispatching function, the weight occupation ratio of the SIFT is improved under the adjustment of the power function.

Foreground weight E_qIs determined by the number of foreground features,when the number of foreground feature points is close to zero, the number of background feature points is relatively close to infinity, and E_qTending towards zero. When the number of the foreground feature points is increased, the number of the background feature points is reduced, and F is gradually increased and approaches to 1. Let E_hAs a background weight, then E_h＝1-E_q，E_hIn the range of [0, 1]The expression for the final polymerization characteristics is shown in formula 3:

D_qh＝(E_q×D_q)＞＞(E_h×D_h) Formula 3

Wherein, the > is a vector connection symbol, the foreground word representation vector and the background word representation vector are aggregated, because the two vectors have length difference, normalization processing is needed, and the result is obtained after the processing:

and the feature word bag aggregation is used for aggregating the foreground and background two parts by a flexible weight assignment method, and the visual word and phrase representation is carried out on the picture together.

The visual word and phrase co-driven bag-of-words model picture classification method is further characterized in that in the step three, in the mapping generation of visual word and phrase, the visual word and phrase hierarchical clustering specific process is as follows:

step 1, setting features extracted from i picture samples for visual word and phrase clustering, wherein the dimension of each feature is j dimension, and the i j dimension vectors form a j row and i column matrix;

step 2, calculating the Euclidean distance between the feature points, finding out the corresponding feature point when the distance is minimum, and combining the feature points to form a cluster;

step 3, calculating the average value of the two characteristic points in the step 2 to form a new cluster center;

step 4, repeating the step 2 and the step 3 until the proportion E of the whole algorithm is satisfied with the preset hierarchical analysis, ending the hierarchical clustering, and taking the obtained cluster center as an initial visual word phrase;

step 5, for the characteristic points which are not calculated in the step 2, calculating the distance between the characteristic points and each visual word phrase, finding out the minimum distance, and assigning the characteristic points to the cluster closest to the characteristic points;

step 6, updating the center of each cluster and the number of characteristic points in the cluster, and updating the clustering characteristic value of the cluster;

step 7, repeating the step 5 and the step 6 until the data convergence in the cluster is unchanged;

and 8, finishing hierarchical analysis to obtain initial clustering information including the number of clusters and the central position of the visual word phrase, and refining on the basis of hierarchical clustering by using a K-means clustering algorithm.

The invention provides a visual word and phrase co-driven bag-of-words model picture classification method, and further provides a rapid self-adaptive soft assignment coding method based on a hypersphere model, wherein relevance between each visual word phrase and a classification result is identified by introducing a chi-square model, visual words and phrases with smaller relevance are removed, feature points are adaptively assigned to a plurality of adjacent clustering centers, new redundant information is prevented from being added, and features are mapped to the visual words and phrases through proper semantic level association;

the soft allocation method is characterized in that mapping between each local feature and words in the feature dictionary is flexibly configured, and different coding coefficients are adopted to express the correlation between the features and the words;

the soft assignment method maps each SIFT feature point to J visual words and phrases that are closest to it, as shown in equation 5:

the above formula represents the feature point m_bAssigned to a visual words and phrases s nearest thereto_aL is the total number of the feature points,

the visual word and phrase frequency histogram using the soft-assignment method is transformed into:

as a weighting factor, sim (m)_bAnd s) represents a feature point m_bAnd (3) adopting Gaussian distribution to estimate parameters according to the similarity with the visual word phrase s, wherein the formula is shown as 7:

where d is the variance of the gaussian distribution function.

The invention relates to a visual word and phrase co-driven bag-of-words model picture classification method, in particular to a method for filtering useless virtual words in text classification by using the reference of introducing picture classification, wherein some noise visual word phrases are regarded as visual virtual words, and if the characteristics are assigned to the words, the words are useless; the invention introduces a chi-square model to carry out mining analysis on a visual characteristic dictionary, finds out a visual imaginary word with the minimum correlation with classification categories, eliminates and optimizes the visual imaginary word to obtain a word set with higher identification rate, and improves allocation efficiency;

the Ka-Square model analyzes the independence of two random variables in data modeling, each visual word phrase and classification category are used as the two random variables in picture classification, the correlation degree of the two parts is evaluated by importing the two random variables into the Ka-Square model, and the larger the Ka-Square value is, the larger the contribution of the corresponding word to the classification result is; smaller chi-square value means smaller contribution of the corresponding word to the classification result; visual false words are removed from the high-dimensional space visual feature dictionary, so that the effect of reducing the dimension is achieved;

the characteristic dictionary optimization chi-square model estimation formula of the picture classification is shown as the formula 8:

wherein R is the total number of pictures in the training library, j_+fIs class S_fTotal number of pictures in, j_1fRepresents a class S_fIn which the word s is contained_bNumber of pictures j_2fIs represented in the category S_fNot containing words s_bNumber of pictures j₁₊Representing words s contained in a training library_bNumber of pictures j₂₊Representing words s not included in the training library_bThe number of pictures, L, is the total number of the feature points;

calculating the chi-square value corresponding to each visual word and phrase according to the formula 8, sorting, setting the threshold number H for removing visual null words, removing the useless words corresponding to the H minimum chi-square values to obtain a new visual feature dictionary, and allocating the feature mapping according to the new feature dictionary.

Compared with the prior art, the invention has the advantages and innovation points that:

the visual word and phrase co-driven bag-of-words model picture classification method provided by the invention aims at the problem that in the generation process of the visual words and phrases in the prior art, a K-means clustering algorithm is obtained by segmenting and clustering the distribution of bottom-layer features, the similarity of the feature points and the feature points is determined according to the simple position distribution before the feature points and the feature points, and the similarity is not accurate in semanteme. The method is characterized in that hierarchical clustering analysis is introduced in clustering, picture contents are regarded as small blocks of contents and combined layer by layer, a tree-shaped visual feature dictionary is constructed, the matching process of features from coarse to fine is realized, position information among the content blocks is aggregated, and the distinguishing capability of the feature dictionary is enhanced.

The visual word and phrase co-driven bag-of-words model picture classification method provided by the invention aims at the ambiguity that hard assignment is not easy to reflect the characteristics in the characteristic mapping process in the prior art, and soft assignment is not easy to control the mapping range of the characteristics and the visual words and phrases. And providing self-adaptive hypersphere soft assignment, trying to flexibly ensure the ambiguity of mapping the features to visual words and phrases, introducing a chi-square model by combining the concept of the virtual words in the text grammar, abandoning the visual virtual words which do not contribute much to the relevance of classification categories, reducing the scale of word sets and obtaining a feature dictionary with better identification rate.

Thirdly, the visual word and phrase co-driven bag-of-words model picture classification method provided by the invention ignores semantic correlation between visual words and phrases aiming at the mode of expressing picture contents in the prior art bag-of-words model. Aiming at the lack of semantic relevance, the co-occurrence of visual words and phrases is mined by using an association rule in a disordered word set in a picture, the words with larger co-occurrence are synthesized into the visual phrases, and finally the picture content is represented by adopting two forms of the words and the phrases. The improved bag-of-words model not only can express local characteristics, but also can depict context information among the characteristics, so that a more accurate classification effect can be obtained.

The invention provides a visual word and phrase co-driven bag-of-words model picture classification method, and provides a method for aggregating characteristics of a background and a foreground in a characteristic extraction and expression stage aiming at the obvious defects that SIFT vectors are adopted to express pictures singly, picture characteristic sources of a bag-of-words model are single, and weight information for expressing different content areas is lacked in the bag-of-words model in the prior art. The picture is divided into a foreground area and a background area by a visual explicit analysis method, features are respectively extracted from the foreground area and the background area to establish respective visual feature dictionaries, and word histograms of the two parts are aggregated according to a certain weight to represent the picture by combining a visual attention mechanism in biology, so that the semantic meaning of a feature expression sub is enriched, the mode of seeing the picture by human brain is met, different weights are dynamically assigned to the features of the different parts, and the generated visual feature dictionary has pertinence. Experimental results show that the method has a better effect of classifying pictures with the picture contents comprising the foreground gold and the background.

And fifthly, the visual word and phrase co-driven bag-of-word model picture classification method provided by the invention skillfully changes the clustering process of visual words and phrases into a tree structure by introducing hierarchical clustering analysis, better avoids a series of problems caused by poor selection of an initial center of a mean clustering algorithm, and is more appropriate for the whole to partial analysis process of pictures. In the assignment stage, in order to solve the problem of redundancy caused by polysemy mapping of visual word phrases and visual word phrases containing useless information in a feature dictionary, a self-adaptive hyper-sphere soft assignment method is provided by removing a certain number of visual virtual words and comprehensively considering the spatial position relation between picture features and the centers of the visual word phrases. Compared with the mean value clustering algorithm and the hard assignment method in the prior art, the two parts of the method obtain more ideal classification results, and the image classification speed is obviously accelerated.

Sixthly, the visual word and phrase co-driven bag-of-words model picture classification method provided by the invention is helpful to the model classification performance by integrating the spatial structure information and the context semantics of the visual word and phrase. On the basis of the improvement idea of absorption and fusion, the spatial position arrangement relation of visual words and phrases in the picture is introduced through a series of association rules, the word bag characteristics are presented in the word bag model by using the words and the phrases simultaneously, the words forming the phrases can provide context information of the picture mutually, and the ambiguity problem of a single word is ingeniously eliminated. The improved bag-of-words model expresses pictures by visual words and visual phrases together, expression weights of the two parts are selected according to the spatial structure of the pictures, and experimental results show that the improved method greatly improves the classification expression capability of the bag-of-words model, the classification depth of the pictures is larger, the accuracy rate is higher, the classification speed is higher, and the robustness of the method is better.

Drawings

FIG. 1 is a schematic flow diagram of feature bag aggregation in accordance with the present invention.

Fig. 2 is a schematic diagram of dense SIFT picture feature extraction according to the present invention.

FIG. 3 is a diagram illustrating a feature aggregation weight assignment function according to the present invention.

FIG. 4 is a schematic diagram of the hypersphere soft dispatch sample assignment of the present invention.

FIG. 5 is a diagram of the search neighborhood for visual phrases in accordance with the present invention.

Detailed Description

The technical scheme of the visual word and phrase co-driven bag-of-words model picture classification method provided by the invention is further expressed by combining the drawings, so that the technical scheme can be better understood and implemented by the technical personnel in the field.

The visual word and phrase co-driven bag-of-word model picture classification method provided by the invention is characterized in that a picture is regarded as an element set, but elements in the element set are not pixels any more but discrete visual word and phrase combinations, then the occurrence probabilities of different visual words and phrases in the set are respectively counted to obtain corresponding frequency histogram vectors, the frequency histogram vectors are equivalent representations of the picture in a bag-of-word model angle, and finally the frequency histogram vectors are introduced into a classifier for training and classification.

The method comprises the following specific steps:

firstly, extracting the characteristics of a foreground and background aggregated picture; feature extraction is the basic operation in the picture classification process and is the starting point of the subsequent method, so that the success or failure of a picture algorithm or application is often greatly related to the initially selected features. The picture feature extraction expression method of foreground and background aggregation is based on a human visual attention mechanism, and divides a picture into a visual obvious region and a non-explicit region, wherein the visual obvious region is a foreground, the non-explicit region is a background, the foreground comprises prominent representation contents in the picture, and the background comprises environmental factors of the picture;

First, image feature extraction of foreground and background aggregation

In the word bag model in the prior art, the image feature extraction is to extract a single feature from the whole image, the single feature designed manually in the prior art is imperfect, the visual word phrase used for representing the target object and the visual word phrase representing the background are mixed in the same word bag representing the image, and the representation neglecting the image content is formed by combining a foreground content and a background. The invention provides a method for extracting and expressing picture features of foreground and background aggregation, which is used for accurately classifying pictures. The image feature extraction expression method for foreground and background aggregation is used for expanding and improving the feature extraction process in a word bag model in the prior art, and is used for dividing an image into a visual obvious region and a non-explicit region based on a human visual attention mechanism, wherein the visual obvious region is a foreground, the non-explicit region is a background, the foreground comprises prominent representation contents in the image, and the background comprises environmental factors of the image.

Picture feature extraction based on visual attention mechanism

When a scene is browsed, the human visual system can quickly locate an interested target and ignore unimportant areas, and the areas which attract eyeballs are explicit areas. The explicit region helps people to quickly understand the picture main body, divide the main and secondary, and then adopt different processing methods aiming at different regions.

The method for extracting the visual obvious region comprises the following steps: firstly, establishing 9 layers of Gaussian pyramids of the picture from three dimensions of direction, color and brightness of the picture, secondly, extracting features of the three dimensions of direction, color and brightness from each layer of the Gaussian pyramids to form a feature pyramid, thirdly, carrying out difference on scale by scale in a multi-scale space to obtain a feature distribution diagram taking a salient target as a center, and fourthly, establishing a Markov chain of the two-dimensional picture by using a Markov random field to obtain a final visual obvious region distinguishing diagram of the picture.

Second, visual characteristic word bag polymerization expression

The visual obvious area extracted by the method is a foreground area, and the rest part is a background area. The image feature extraction and expression method for foreground and background aggregation not only extracts effective image information from a foreground region, but also extracts intensive SIFT features from the background region to assist in expressing the image, and particularly when the features of the foreground region are not rich, generated feature information is less, and the intensive SIFT features of the background can provide a large amount of feature information. The invention provides a method for aggregating foreground features and background features to express picture content, dividing a visual feature dictionary into a foreground feature dictionary generated by foreground SIFT features and a background feature dictionary generated by background intensive SIFT features, and finally performing picture classification judgment by weighting and aggregating histograms obtained by mapping the two feature dictionaries. The method specifically comprises the following steps: dense SIFT expression sub-sampling, foreground feature dictionary generation, background feature dictionary generation and aggregated feature generation.

The process of feature bag aggregation is shown in fig. 1.

Dense SIFT expression subsampling

The intensive SIFT expressors adopt an even sampling mode, pixel interval size is set to control sampling density, and the picture is subjected to feature extraction window by window. The extraction process becomes relatively simple, more dense feature points can be obtained, and richer picture information can be reserved.

After feature points are extracted at intervals, the same scale C is set for all the feature points, in order to guarantee the rotation invariance of the SIFT expression, a picture is adjusted to be horizontal 0 degree, the feature points are used as the circle center, the set scale C is used as the radius to draw a circle, pixel points falling in the circle are evenly divided into 4 gamma-4 non-overlapping sub-areas, angle coordinates are divided in the sub-areas at intervals of 45 degrees, then the angle histogram of each sub-area in each direction is counted, and the generated feature expression is expressed by 128-dimensional vectors.

The dense SIFT adopts a mode of uniformly extracting feature points, the consequence is that the scale invariance cannot be guaranteed, the multi-scale extraction is adopted to recover the scale invariance, the large scale is adopted to express the overall general picture of the picture, the small scale is adopted to capture partial details of the picture, the schematic diagram of the picture characteristic extraction by adopting the dense SIFT is shown in figure 2, the left picture represents uniform interval sampling of uniform scale, and the right picture represents multi-scale extraction sampling of four scales.

(II) Foreground feature dictionary Generation

The specific steps of the foreground feature dictionary generation are as follows:

(III) background feature dictionary Generation

The dense SIFT has high generation speed, can effectively express content blocks with flat gray level change, has the advantages of being very proper in background expression and can obtain better effect in image classification.

The dense SIFT adopts a grid division method to divide the content into Igamma J grid subblocks, SIFT is used for expressing each subblock, and the specific steps of generating the background feature dictionary are as follows:

Step four, for B_hThe feature quantization process of (1), mapping the features according to the mapping methodIs projected as A_hThe corresponding word in (1);

(IV) polymerization feature Generation

The expression ratio of the foreground area to the whole picture content is larger, and when the characteristics of the foreground area and the background area are aggregated, the weight assignment highlights the foreground area and relatively weakens the weight of the background. The feature aggregation weight assignment function is shown in fig. 3 and equation 1:

E_q＝C×F^aformula 1

Foreground weight E_qThe number of foreground features is determined, when the number of foreground feature points is close to zero, the number of background feature points is relatively close to infinity, and E_qTending towards zero. When the number of the foreground feature points is increased, the number of the background feature points is reduced, and F is gradually increased and approaches to 1. Let E_hAs a background weight, then E_h＝1-E_q，E_hIn the range of [0, 1]The expression for the final polymerization characteristics is shown in formula 3:

D_qh＝(E_q×D_q)＞＞(E_h×D_h) Formula 3

the feature bag aggregation improves the feature extraction process in a bag-of-words model, on the basis of SIFT feature extraction in the prior art, a foreground region and a background region of a picture are obtained through visual explicit analysis, SIFT features are extracted from the foreground region to construct a foreground visual feature dictionary, intensive SIFT features are extracted from the background region to construct a background visual feature dictionary, the foreground visual feature dictionary and the background visual feature dictionary are aggregated by a flexible weight assignment method, and visual word and phrase representation is carried out on the picture together. The method can be obtained through experimental analysis, and the improved aggregation feature method has obvious advantages in classification effect compared with the single SIFT feature extraction method in the prior art aiming at the picture of the content determined by the foreground and the background together.

Third, visual word phrase generation mapping

(I) classified hierarchical clustering of pictures

(1) Deficiency of K-means clustering algorithm

Most bag-of-words models in the prior art adopt a K-means clustering algorithm to cluster and generate visual single words and phrases in the extracted SIFT feature set. However, in the K-means clustering algorithm, all sample points are clustered together in a mutually exclusive manner, and in the picture feature points, one feature point can only belong to one visual word and phrase concentration point, and cannot be classified into other clusters. If the feature point corresponding to a certain region in the picture is similar to the features of the surrounding regions, the semantic information contained in the feature point cannot be expressed in the model. The clustering rule is single and solidified, whether the mapping of the picture features and the visual word phrases is correct or not is ignored, and important semantic information is lost.

Besides the lack of semantic relationship consideration, the K-means clustering algorithm has the following disadvantages, and these fatal disadvantages also cause problems in the generation process of visual words and phrases:

firstly, initial visual words and phrases are selected, the centers of K initial visual words and phrases are randomly selected in picture sample characteristics, once the initial visual words and phrases are not distributed, important local characteristics can play a role in an iteration process and fall into local minimum convergence, and an unbalanced clustering result is obtained;

secondly, processing the isolated characteristic points, wherein the isolated sample points in the K-means clustering algorithm can cause the clustering effect to be reduced, and the displacement error is caused in the original correct clustering word phrase center.

Thirdly, the calculation complexity is very high, because the calculation of the K mean value clustering algorithm belongs to the iterative optimization process, when the visual word short language is determined, all sample characteristics participate in the operation, the calculation complexity is very high,

and fourthly, the K-means clustering algorithm does not support expansion, clustering needs to be completed again in sample iteration when a new sample is introduced, no dimension reduction step is included, and dimension dilemma is easily caused in a high-dimensional sample space.

(2) Visual word phrase hierarchical clustering

The visual word phrase hierarchical clustering specifically comprises the following steps:

The hierarchical clustering analysis and the K-means clustering are combined, the hierarchical analysis can extract approximate classification information, the K-means clustering is refined, and meanwhile, the defects of the K-means clustering algorithm in the prior art, such as local minimum convergence caused by poor selection of an initial clustering center, the requirement of experience setting for the number of clustering clusters and tedious iterative calculation, can be avoided.

(II) adaptive hypersphere soft dispatching method

The invention provides a quick self-adaptive soft assignment coding method based on a hyper-sphere model, which is characterized in that relevance between each visual word phrase and a classification result is identified by introducing a chi-square model, visual words and phrases with smaller relevance are removed, feature points are adaptively assigned to a plurality of adjacent clustering centers, new redundant information is prevented from being added, and features are mapped to the visual words and phrases by means of proper semantic level association.

1. Soft dispatch method

The soft assignment method provided by the invention flexibly configures the mapping between each local feature and the words in the feature dictionary, and adopts different coding coefficients to express the correlation between the features and the words.

where d is the variance of the gaussian distribution function.

Compared with hard assignment, soft assignment changes the number of assigned words from a fixed one to a plurality of nearest neighbor cluster centers, and the solution well avoids the disadvantage of hard assignment.

2. Mining and removing visual null words

The invention introduces a method for filtering useless virtual words in text classification into picture classification by reference, takes some noisy visual word phrases as visual virtual words, and is useless if characteristics are assigned to the words. The invention introduces a chi-square model to carry out mining analysis on the visual characteristic dictionary, finds out the visual virtual word with the minimum correlation with the classification category, removes and optimizes the visual virtual word to obtain a word set with higher recognition rate, and improves the assignment efficiency.

The chi-square model in data modeling is used for analyzing the independence of two random variables, each visual word phrase and classification category are used as the two random variables in picture classification, the correlation degree of the chi-square model evaluation two parts is imported, and the greater the chi-square value is, the greater the contribution of the corresponding word to the classification result is; a smaller chi-squared value indicates that the corresponding word contributes less to the classification result. In addition, visual imaginary words are removed from the high-dimensional space visual feature dictionary, the effect of reducing the dimension is achieved, and the algorithm complexity is correspondingly reduced.

wherein R is the total number of pictures in the training library, j_+fIs class S_fTotal number of pictures in, j_1fRepresents a class S_fIn which the word s is contained_bNumber of pictures j_2fIs represented in the category S_fNot containing words s_bNumber of pictures j₁₊Representing words s contained in a training library_bNumber of pictures j₂₊Representing words s not included in the training library_bL is the total number of feature points.

3. Fast adaptive soft dispatch of hyper-sphere models

The new visual feature dictionary optimized through the chi-square model does not contain redundant word phrases, and the simplified word phrase set can better represent picture content. Because the image content of the picture is two-dimensional, and the picture characteristic is multidimensional, the invention provides that a hyper-sphere taking a new visual word phrase as a sphere center is constructed in a multidimensional characteristic space, the radius of the hyper-sphere is the distance between the sphere center and the farthest point in the cluster where the hyper-sphere is located, and the hyper-sphere is assigned to a plurality of contained sphere centers according to the condition that the characteristic point is contained in the sphere range, so as to achieve the self-adaptive soft assignment.

(1) Specific assignment procedure

If the dimension of the feature space is L, namely L clustering centers are included, the vector of each feature belonging to the cluster where the visual word phrase is located is r_b＝(r_b1,...,r_bL)，r_bLRepresents a characteristic point m_bWhether to assign to visual word phrase s_LWherein r is_ba0 denotes the feature and visual word phrase s_aAre not similar and do not giveIs assigned r_ba1 denotes the feature and visual word phrase s_aSimilarly, the assignment is given, and the decision criterion for whether to assign is to calculate whether the distance between the feature point and the visual word phrase is less than or equal to the radius of the hyper-sphere where the visual word phrase is located.

The process of soft dispatch of the hyper-sphere is represented in two-dimensional space as shown in FIG. 4, i₁、i₂、i₃Points to be assigned in the features of the 3 pictures to be classified, a, b, c, d are determined visual word and phrase centers, a sphere range constructed by each cluster center is indicated in fig. 4, and the feature points are assigned to corresponding visual words and phrases according to whether the feature points are located in a certain circle: such as feature point i₁Is only located in the circle a, and is only assigned to the circle a, the feature point i₂If the two visual words are located in the intersection area of the circle b and the circle c, the two visual words and the phrases b and c are assigned; characteristic point i₃And if the three visual words are positioned in the intersection area of the circle b, the circle c and the circle d, the three visual words and phrases b, c and d are assigned.

For features i located in a single circle₁The selected assignment number is 1, namely hard assignment, all feature points can be assigned to one or more nearest visual words and phrases by using soft assignment for features positioned at the intersection of a plurality of circles, the assignment method does not need manual intervention, and the assignment method has strong adjustment capability for different pictures.

(2) Weight assignment

The supersphere radii of each visual word phrase are different, and the features at the intersection of the spheres are mapped to be far from or near the center of each sphere, so the weight assigned to each visual word and phrase by the features should be related to this distance. The weight of the feature assignment to each visual word and phrase is preset to be in Gaussian mixture distribution as the distance between the feature point and the clustering center in the soft assignment, and the weight coefficient of the assignment is set as E_baDenotes the feature i_bAssigning to visual word phrase s_aThe weight of (2) is calculated according to equation 9 when calculating the weight,

in the formula 9 r_baRepresents a feature point i_bWhether or not to assign to visual words and phrases s_a，g_baRepresents a feature point i_bTo visual words and phrases s_aD is the standard deviation of Gaussian mixture distribution, and the specific value needs to be determined by experience to be the optimal value.

To normalize the weight vectors for final processing, equation 9 is modified to equation 10,

aiming at the defects of visual word generation and mapping in a bag-of-words model, the visual word phrase generation and mapping method respectively provides an improvement method. In the word clustering generation stage, the clustering process of visual word phrases is changed into tree-shaped layer-by-layer clustering by introducing hierarchical clustering analysis, so that the local characteristics of the pictures are matched from coarse to fine; and in the word mapping and assigning stage, useless visual word phrases expressed to the picture in the visual feature dictionary are abandoned, the picture features are associated with the spatial position relation of the centers of the visual word phrases, and the self-adaptive hypersphere soft assigning method is provided. The experimental result shows that the hierarchical clustering has a good classifying effect on the pictures with single content and a composition mode of forming a whole by parts, and the self-adaptive hypersphere soft assignment can obtain a good classifying effect when an optimization coefficient is set.

Fourthly, bag-of-words model for aggregating spatial semantic information

Through the improvement discovery of three steps of bag-of-words model feature extraction, word clustering and feature mapping assignment, the bag-of-words model representation method in the prior art only depends on surface information of statistical local features, neglects the relationship between visual word phrases, and does not effectively integrate the spatial information and semantic information of picture features into a classification process. Based on the consideration, the invention extracts necessary space semantic information among co-occurrence characteristics to form visual phrases, and finally, words and phrases are aggregated to form the basis of picture classification.

Due to the fact that the position space information of the space pyramid is poor in stability due to the difference of the samples, the invention tries to find a representation method which can more appropriately aggregate the position space information into the picture. The invention provides a phrase packet model which mutually matches visual words and visual phrases aggregating spatial information by mining common rules of picture concentration without artificially making spatial limitation on the visual words and phrases, and the phrase packet model adds visual phrase distribution histogram vectors into a bag-of-words model to express the relative positions between the visual words, so that the visual words are improved to the visual phrases with high-level semantic information, and the spatial discrimination of picture characteristics is enhanced.

Concept of visual phrase

The bag-of-words model in the text classification can be introduced into the picture visual classification algorithm, the word order grammar and the like in the text understanding are abandoned based on the bag-of-words model, and the text is only regarded as a set of appearing words. The probability of occurrence of a word in the text is assumed to be independent and not affected by other sentences or words. In the bag-of-words model of photo classification, the photos are also considered as a pile of mutually independent visual word and phrase combinations. However, this assumption is not true in many cases, because some sub-picture blocks in the picture often appear together, such as human five sense organs, machine essential parts, etc., and these sub-pictures of five sense organs and parts are not represented by a single local feature, they must be identified as corresponding objects through a certain spatial position collocation by a series of feature points, so that some visual word phrases appear together with other visual word phrases, and thus, the picture identification capability is more stable and clear. By mining the spatial position association information rule of the visual words, the association between the words is strengthened, and word subsets with higher co-occurrence association are obtained in the pictures, which are called as visual phrases.

Therefore, when constructing the visual phrase, it is necessary to find a subset of words in the picture with a large co-occurrence association. Based on content expression and general logic cognition of a plurality of picture samples, feature phrases composed of visual features at a close distance have more meanings than feature phrases composed of visual features at a far distance, therefore, the method considers that associated feature word combinations in pictures are close-positioned, in order to find out word combinations at close positions, the method searches binary visual phrases capable of forming stable spatial relations in a neighborhood range with words as centers, and if the times of appearance of high-frequency visual word phrase groups which are close-positioned and commonly appear in the spatial relations exceed a preset threshold value, the binary visual phrases are considered to be meaningful visual phrases, so that not only global spatial position information is considered, but also the semantic relation among picture contents is considered.

(II) creation of visual phrases

The extraction range of the spatial information determines the scale of the visual phrases, the association precision of the excavated spatial information is lost if the extraction range is too large, and the important spatial co-occurrence phrases are lost if the extraction range is too small. If the number of visual word phrases is set to be L, if two words are combined in the whole situation, the scale of the visual phrases is M_L ²Such a large scale of visual phrase feature dictionary will bring a huge amount of calculation to the next work, (L × (L-1))/2), and not all the combined visual phrases in the picture have a practical meaning to the semantic information of the picture.

Based on the consideration of the two points, the invention combines the shape characteristic thought, sets the neighborhood as a concentric circle which takes the visual word phrase as the center of the circle as shown in figure 5 and is divided into an inner circle area and an outer circle area, wherein the inner circle area is close to the center of the circle, the co-occurrence threshold value is set to be low, the outer circle area is relatively far, and the co-occurrence threshold value is set to be high. The method conforms to the principle that the associated word combinations are closely attached in position, not only ensures the appearance of word groups in a large range, but also enables the weight of the word groups closely attached to the word groups to be increased, and meanwhile, the circular structure can overcome the influence of rotation and translation of the picture content.

The specific steps for constructing the visual phrase are as follows:

step a, setting visual phrase sets of an inner circle region and an outer circle region as M respectively_L、M_WVisual word S₁、S₂、…、S_LVisual phrases are represented by E (S)_i,S_j) Showing that the threshold value of the co-occurrence times of the inner circles is H_LThe threshold value of the number of the excircle co-occurrences is H_W；

Step (ii) ofb, local feature X of any picture_rAs the center of circle, set the radius as r₁And r₂The concentric circle of (2) is divided into two sub-areas as shown in fig. 5, and local features { jX of F neighboring circles are found_r ⁽¹⁾，jX_r ⁽²⁾，…，jX_r ^(F)Form a local feature binary group (X)_r，jX_r ⁽¹⁾)，(X_r，jX_r ⁽²⁾)，…，(X_r，jX_r ^(F)) Get the corresponding visual word group E (S) according to the mapping method_i，S_j)；

Step c, searching the visual phrase sets according to the regions respectively, if E (S)_i,S_j) If not, adding it into the set, and counting its occurrence number N [ E (S)_i,S_j)]1 if E (S)_i,S_j) Already present, the number of occurrences N [ E (S)_i,S_j)]＝ N[E(Si,Sj)]+1。

Step d, filtering out visual phrase co-occurrence times lower than H in the inner circle region_LRemoving the co-occurrence frequency of the visual phrases in the outer circle region to be lower than H_WVisual phrase of, M_L、M_WThe retained final phrase set M is formed;

and e, moving the circle center to the next visual feature, repeating the step of pairing statistics and filtering until the complete features are traversed.

Finally, the visual phrases are combined with the previous visual word feature dictionary to form a dictionary which contains independent visual words and visual phrases with richer contents, so that a complete visual feature dictionary is successfully constructed.

In the classification, the weight values of visual words and visual phrases are regulated because the spatial information contributions contained in the pictures with different contents are different. The spatial information weight E is set, the larger the E is, the larger the representation contribution of the visual words to the picture content is, the more the visual words are, the more the. In the classification process, proper weight proportion is given according to the composition structure of the picture content so as to achieve good classification effect.

The word bag model for aggregating the spatial semantic information mines the symbiosis between visual words and phrases through the combination association rule between words in the text feature dictionary and introduces the position distribution of the words and other visual words and phrases around the words to form the visual phrases. Visual phrases have higher semantic categories, visual phrase histogram vectors are added in the bag-of-words model to assist in expressing pictures, and the semantics contained in the visual feature dictionary for aggregating words and phrases have hierarchical matching, so that more reliable information is provided for picture classification. Experimental data analysis shows that the visual phrase representation method is particularly suitable for picture contents with strong structuredness, and can achieve higher classification accuracy compared with a bag-of-words model in the prior art.

Aiming at the obvious defects that SIFT vectors are adopted by a bag-of-words model in the prior art to represent pictures singly, picture feature sources of the bag-of-words model are single, and weight information for expressing different content areas is lacked, the invention provides an aggregation representation method for aggregating features of background and foreground in a feature extraction and expression stage. The picture is divided into a foreground region and a background region by a visual explicit analysis method, features are respectively extracted from the two regions to establish respective visual feature dictionaries, word histograms of the two parts are aggregated to represent the picture according to a certain weight, and the semantic meaning of a feature expression sub is enriched. On the basis of analyzing clustering algorithm and assignment method in the word bag model in the prior art, the invention respectively provides an improvement method aiming at the defects. The clustering process of visual words and phrases is ingeniously changed into a tree structure by introducing hierarchical clustering analysis, a series of problems caused by poor selection of an initial center of a mean clustering algorithm are better avoided, and the whole to partial analysis process of the picture is more appropriate. In the assignment stage, in order to solve the redundancy problem caused by polysemy mapping of visual word phrases and visual word phrases containing useless information in a feature dictionary, a self-adaptive hypersphere soft assignment method is provided by removing a certain number of visual virtual words and comprehensively considering the spatial position relationship between picture features and the centers of the visual word phrases. Compared with the mean value clustering algorithm and the hard assignment method in the prior art, the two parts of the method obtain more ideal classification results, and the image classification speed is obviously accelerated. The invention introduces the spatial position arrangement relation of visual words and phrases in the picture through a series of association rules, proposes that the bag-of-words features are simultaneously expressed by words and phrases in a bag-of-words model, and the words forming the phrases can mutually provide context information of the picture, thereby skillfully removing the ambiguity problem of a single word. The improved bag-of-words model expresses pictures by visual words and visual phrases together, expression weights of the two parts are selected according to the spatial structure of the pictures, and experimental results show that the improved method greatly improves the classification expression capability of the bag-of-words model, and the improved method has the advantages of larger classification depth of the pictures, higher accuracy, higher classification speed and better robustness.

Claims

1. The visual word and phrase co-driven bag-of-words model picture classification method is characterized in that a picture is regarded as an element set, elements in the element set are discrete visual word and phrase combinations, the probability of different visual words and phrases appearing in the set is respectively counted to obtain corresponding frequency histogram vectors, the frequency histogram vectors are equivalent representations of the picture in a bag-of-words model angle, and finally the frequency histogram vectors are introduced into a classifier for training and classification; the method comprises the following specific steps:

thirdly, generating mapping of visual word phrases; the picture features are assigned to word phrases corresponding to the visual feature dictionary, visual words and phrases with the closest distance to the feature vectors in the pictures are searched in a vector space and then assigned to corresponding words, each picture is represented as a word phrase vector with K dimension, and K is the number of the previously set clustering centers;

2. The visual word and phrase co-driven bag-of-words model picture classification method according to claim 1, characterized in that, in the first step, the picture feature extraction of foreground and background aggregation adopts a picture feature extraction method based on a visual attention mechanism, and the visual salient region extraction method is as follows: firstly, establishing 9 layers of Gaussian pyramids of the picture from three dimensions of direction, color and brightness of the picture, secondly, extracting features of the three dimensions of direction, color and brightness from each layer of the Gaussian pyramids to form a feature pyramid, thirdly, carrying out scale-by-scale subtraction in a multi-scale space to obtain a feature distribution diagram taking a salient target as a center, and fourthly, establishing a Markov chain of the two-dimensional picture by using a Markov random field to obtain a final visual obvious region distinguishing diagram of the picture.

3. The visual word and phrase co-driven bag-of-word model picture classification method according to claim 1, characterized in that, in the second step, the visual feature bag aggregation expression is used for aggregating foreground features and background features to express picture contents, the visual feature dictionary is divided into a foreground feature dictionary generated by foreground SIFT features and a background feature dictionary generated by background dense SIFT features, and finally, the histogram obtained by mapping the two feature dictionaries is weighted and aggregated to perform picture classification judgment; the method specifically comprises the following steps: dense SIFT expression sub-sampling, foreground feature dictionary generation, background feature dictionary generation and aggregated feature generation.

4. The visual word and phrase co-driven bag-of-words model picture classification method according to claim 3, characterized in that the dense SIFT expressors adopt a uniform sampling mode, pixel interval size is set to control sampling density, and feature extraction is performed on pictures window by window;

5. The visual word and phrase co-driven bag-of-words model picture classification method as claimed in claim 3, wherein the foreground feature dictionary generation comprises the following specific steps:

step 1, extracting SIFT features from a foreground region in a picture, obtaining a visual feature dictionary corresponding to a foreground according to a clustering method, and marking the visual feature dictionary as A_q；

6. The visual word and phrase co-driven bag-of-words model picture classification method of claim 3, wherein the dense SIFT adopts a mesh division method to divide the content into Igamma J checker subblocks, each subblock is expressed by SIFT, and the generation of the background feature dictionary specifically comprises the following steps:

7. The visual word and phrase co-driven bag-of-words model picture classification method according to claim 3, wherein when the features are aggregated, the ratio of the foreground region to the content of the whole picture is larger, when the features of the foreground region and the background region are aggregated, the weight assignment highlights the foreground region and relatively weakens the weight of the background, and the feature aggregation weight assignment function is as shown in formula 1:

E_q＝C×F^aformula 1

Wherein, C is a scaling parameter, a is an adjustment parameter, the scaling parameter and the adjustment parameter jointly adjust the weight assignment, F is a foreground characteristic ratio coefficient, and C is 1, a is 0.4, as shown in formula 2:

J_qnumber of SIFT feature points detected for the foreground region, J_hAs a backgroundNumber of SIFT feature points with dense regions, 0<F<1,E_qThe weight of the SIFT in the foreground region is the weight occupied by the aggregation features, and according to the dispatching function, the weight occupation ratio of the SIFT is improved under the adjustment of the power function.

Foreground weight E_qThe number of foreground features is determined, when the number of foreground feature points is close to zero, the number of background feature points is relatively close to infinity, and then E_qTending towards zero. When the number of the foreground feature points is increased, the number of the background feature points is reduced, F is gradually increased and approaches to 1. Let E_hAs a background weight, then E_h＝1-E_q，E_hIn the range of [0, 1]The expression for the final polymerization characteristics is shown in formula 3:

D_qb＝(E_q×D_q)＞＞(E_h×D_h) Formula 3

8. The visual word and phrase co-driven bag-of-words model picture classification method according to claim 1, characterized in that, in the third step, in the visual word and phrase generation mapping, the visual word and phrase hierarchical clustering specifically comprises:

9. The visual word and phrase co-driven bag-of-words model picture classification method according to claim 8, characterized in that the invention proposes a fast adaptive soft-assignment coding method based on a hypersphere model, which identifies the correlation between each visual word and phrase and the classification result by introducing a chi-square model, removes the visual words and phrases with smaller correlation, adaptively assigns feature points to a plurality of neighboring clustering centers to avoid adding new redundant information, and maps the features onto the visual words and phrases by proper semantic level association;

the soft assignment method is to flexibly configure the mapping between each local feature and the words in the feature dictionary, and to express the correlation between the features and the words by adopting different coding coefficients;

the above formula representsThe feature point m_bAssigned to a visual words and phrases s nearest thereto_aL is the total number of the feature points,

j(b，a，s)＝1，

where d is the variance of the gaussian distribution function.

10. The visual word and phrase co-driven bag-of-words model picture classification method as claimed in claim 9, characterized in that the invention introduces a method of filtering useless virtual words in text classification into picture classification, and some noisy visual word phrases are treated as visual virtual words, which are useless if features are assigned to these words; the invention introduces a chi-square model to carry out mining analysis on a visual characteristic dictionary, finds out a visual imaginary word with the minimum correlation with classification categories, eliminates and optimizes the visual imaginary word to obtain a word set with higher identification rate, and improves allocation efficiency;