WO2022148372A1

WO2022148372A1 - Visual phrase construction method and apparatus based on image feature space and spatial-domain space

Info

Publication number: WO2022148372A1
Application number: PCT/CN2022/070305
Authority: WO
Inventors: 王亚楠
Original assignee: 瞬联软件科技(南京)有限公司
Priority date: 2021-01-05
Filing date: 2022-01-05
Publication date: 2022-07-14
Also published as: CN112668590A

Abstract

Provided are a visual phrase construction method and apparatus based on an image feature space and a spatial-domain space. The method comprises the following steps: extracting, from a target image, visual words meeting a preset condition, so as to form a visual word set (S101); selecting, from the visual word set, each key feature word in a target area in the target image (S102); for each key feature word, extracting, from the visual word set, a neighborhood feature word having a geometrical relationship with the key feature word, so that same and the key feature word form a corresponding visual phrase (S103); and on the basis of formed visual phrases, establishing a visual phrase set for describing the features of the target image (S104). A local feature space and a spatial-domain space of an image are combined to jointly construct a visual phrase, so that the ambiguity of the current visual phrase during an image matching process can be greatly reduced, and a visual phrase with higher discrimination is obtained.

Description

Method and device for constructing visual phrases based on image feature space and spatial space

Technology neighborhood

The invention relates to a visual phrase construction method based on image feature space and image airspace space, and also relates to a corresponding visual phrase construction device, belonging to the technical field of image recognition.

Background technique

The extraction and expression of image visual features is the most basic and core part of image retrieval, segmentation and recognition algorithms. Discriminative features are of great significance to image retrieval, image segmentation and recognition.

According to the expression of features, image features can usually be divided into: low-level features, middle-level features and high-level features. The low-level features are often composed of low-level features such as edges, colors, and textures as basic units. A proposed feature is generated based on the underlying feature analysis. From the perspective of human cognition, the understanding of an image is first of all high-level semantic features with a high degree of abstraction, and also includes simple low-level features. Therefore, human vision's understanding of images is the process of acquiring semantic information at different levels and granularities.

Traditional image feature models often use low-level features such as feature points, edges, colors, and textures as basic units, and build complex semantics and conceptual abstractions upward. However, due to the underlying features and their constructed local features, there are often synonymy and ambiguity, that is, similar local features may be quantified to different local features, and dissimilar local features may also be quantified to the same local feature. , and the semantic features extracted based on deep learning still have the status quo that the discrimination is not high. On the other hand, image sources and types are diverse and complex, and are constrained by different factors such as scale, illumination, perspective, and complex background, resulting in a semantic gap between low-level features and high-level semantics. Therefore, how to define highly discriminative image features and how to overcome the synonymy and ambiguity of image features is still an urgent problem to be solved.

SUMMARY OF THE INVENTION

The primary technical problem to be solved by the present invention is to provide a visual phrase construction method based on image feature space and image space space.

Another technical problem to be solved by the present invention is to provide a visual phrase construction device based on image feature space and image space space.

In order to achieve the above object, the present invention adopts the following technical scheme:

According to a first aspect of the embodiments of the present invention, a method for constructing a visual phrase based on an image feature space and an image airspace space is provided, including the following steps:

Extract the visual words that meet the preset conditions in the target image to form a visual word set;

Select each key feature word in the target area in the target image from the visual word set;

For each key feature word, extract a neighborhood feature word that has a geometric relationship with the key feature word from the visual word set, and form a corresponding visual phrase with the key feature word;

Based on the constituted visual phrases, a set of visual phrases describing the features of the target image is established.

Preferably, the extraction of visual words satisfying preset conditions in the target image to form a visual word set specifically includes the following steps:

Quantify the local features of the target image into visual words;

According to the categories of visual words, the frequency of occurrence of various visual words is counted, and visual words with a frequency higher than a preset frequency are selected to form a visual word set.

Preferably, the extraction of neighborhood feature words that have a geometric relationship with the key feature word in the visual word set specifically includes the following steps:

Draw a circle with the position of the current key feature word in the target image as the center and a predetermined distance as the radius;

Find the neighborhood feature word corresponding to the current key feature word in the visual word set; the position of the neighborhood feature word in the target image must be within the drawn circle.

Preferably, forming a corresponding visual phrase with the key feature word specifically includes the following steps:

Take the position of the current key feature word and the positions of any two corresponding neighborhood feature words as vertices to form a triangle;

After determining that the shortest side length of the triangle is greater than the preset side length, and determining that the minimum angle of the triangle is greater than the preset angle, then select the current key feature word corresponding to the triangle and any two corresponding neighborhood feature words as one of the target images visual phrases.

Preferably, based on each visual phrase formed, establishing a visual phrase set describing the feature of the target image specifically includes the following steps:

Classify each formed visual phrase;

encode visual phrases of the same category;

According to the coding of each category of visual phrases, a set of visual phrases describing the characteristics of the target image is established.

Preferably, the classification of the formed visual phrases specifically includes the following steps:

Determine whether the positions of key feature words and neighborhood feature words in any two visual phrases in the target image can be aligned one-to-one;

If they can be aligned, the two visual phrases belong to the same type;

If they cannot be aligned, the two visual word groups are of different types.

Preferably, the encoding of the visual phrases of the same category specifically includes the following steps:

According to the position of the key feature word and the neighborhood feature word of the current category visual phrase in the target image, obtain the position information of the key feature word and the neighborhood feature word;

According to the position information, the corresponding key feature words or neighborhood feature words are encoded;

According to the encoding of the key feature words of the visual phrase of the current category and the encoding of the neighboring feature words, the encoding of the visual phrase of the current category is composed.

Preferably, according to the coding of each category of visual phrases, establishing a visual phrase set describing the characteristics of the target image specifically includes the following steps:

Count the frequency of occurrence of various visual phrases;

Composing codes of visual phrases whose frequencies are higher than a predetermined frequency into a code set;

Let the set of codes be the set of visual phrases that describe the features of the target image.

Preferably, the judging whether the positions of the key feature words and the neighborhood feature words in any two visual phrases in the target image can be aligned one by one in a one-to-one correspondence, specifically includes the following steps:

Obtain the encoding of the visual word to which the position of each key feature word and neighborhood feature word in the target image belongs;

According to the coding of the visual word to which the position belongs, calculate the minimum position distance of two corresponding key feature words or neighborhood feature words belonging to different visual phrases in the target image;

If the calculated minimum position distance is equal to zero, it is determined that the positions of the two corresponding key feature words or neighborhood feature words in the target image are aligned.

According to a second aspect of the embodiments of the present invention, there is provided an apparatus for constructing visual phrases based on image feature space and image space space, including a processor and a memory, wherein the processor reads a computer program in the memory for executing Do the following:

The visual phrase construction method and device provided by the present invention can make full use of the context constraints of the local feature space of the image and the effective information of the image space space, and greatly improve the accuracy of image visual feature extraction. Applying the method and device for constructing visual phrases to image recognition can greatly improve the accuracy of image retrieval, classification and recognition.

Description of drawings

1 is a schematic flowchart of a method for constructing a visual phrase provided by an embodiment of the present invention;

2 is a schematic diagram of matching of two visual phrases in an embodiment of the present invention;

FIG. 3 is a schematic structural diagram of an apparatus for constructing a visual phrase provided by an embodiment of the present invention.

Detailed ways

The technical content of the present invention will be described in detail below with reference to the accompanying drawings and specific embodiments.

As shown in FIG. 1 , the method for constructing a visual phrase based on an image feature space and an image airspace space provided by an embodiment of the present invention includes the following steps:

101. Extract visual words that meet preset conditions in the target image to form a visual word set;

As shown in Figure 2, each small circle in the figure represents a visual word.

Specifically, it includes the following steps:

1011. Quantify the local features of the target image into visual words;

1012. According to the categories of visual words, count the frequency of occurrence of various types of visual words, and select visual words whose frequencies are higher than a preset frequency to form a visual word set W (w ₁ , w ₂ , . . . , w _n ).

In one embodiment of the present invention, the target image is a certain image in the database. The purpose of selecting visual words with high frequency in the target image is to avoid insufficient number of visual phrases in some images.

102. Select each key feature word in the target area in the target image from the visual word set;

103. For each key feature word, extract the neighborhood feature word that has a geometric relationship with this key feature word in the visual word set, and form a corresponding visual phrase with this key feature word;

Specifically, it includes the following steps:

1031. Take the position of the current key feature word in the target image as the center of the circle, and draw a circle with a predetermined distance as a radius;

In an embodiment of the present invention, a visual phrase is formed by taking the stable triangular structure of local co-occurrence in the image space as an example. Each key feature word is represented in the target image as a micro-region o ₁ (some small circle in Figure 2). A circle is drawn with the current micro area o ₁ as the center and the predetermined distance as the radius.

1032. Find the neighborhood feature word corresponding to the current key feature word in the visual word set; the position of the neighborhood feature word in the target image must be within the drawn circle.

In the area of the drawn circle, find the remaining two small circles as the neighborhood feature words _of the current micro area o1.

1033. Use the position of the current key feature word and the positions of any two corresponding neighborhood feature words as vertices to form a triangle;

Taking the three small circles found in the above steps (the current micro area o ₁ and the corresponding two neighborhood feature words) as vertices, a triangle is formed, and the triangle represents a visual phrase.

1032. After determining that the shortest side length of the triangle is greater than the preset side length, and determining that the minimum angle of the triangle is greater than the preset angle, then select the current key feature word corresponding to the triangle and any two corresponding neighborhood feature words as the target image. a visual phrase.

Pre-calculate the preset side length and preset angle, and remove the combination of the smallest angle being too small and the shortest side being too short in each formed triangle, in order to make the visual phrases as regular as possible in the image space.

104. Based on each formed visual phrase, establish a visual phrase set describing the feature of the target image;

Specifically, it includes the following steps:

1041. Classify each visual phrase formed; specifically, including the following steps:

10411. Determine whether the positions of key feature words and neighborhood feature words in any two visual phrases in the target image can be aligned one-to-one;

Specifically, it includes the following steps:

104111. Obtain the coding of the visual word to which the position of each key feature word and neighborhood feature word in the target image belongs;

In an embodiment of the present invention, it is assumed that the three small circles of the current visual phrase (triangle) are represented as a, b, and c respectively; then the codes of the visual words to which a, b, and c belong are respectively: vw _a , vw _b , vw _c .

104112. According to the coding of the visual word to which the position belongs, calculate the minimum position distance of two corresponding key feature words or neighborhood feature words belonging to different visual phrases in the target image;

As shown in Figure 2, two small circles (the positions at both ends of the horizontal line in the figure) belonging to two visual phrases are matched. The calculation formula is:

D _vp =min∑i∈A _,j∈B |vw _i -vw _j | i,j=1,2,3 (1)

In formula (1), A and B represent two different visual phrases; vw is the code of the visual word to which the vertex of the visual phrase belongs.

104113. If the calculated minimum position distance is equal to zero, then determine that the positions of the two corresponding key feature words or neighborhood feature words are aligned in the target image.

In an embodiment of the present invention, if D _vp =0, it means that the three vertices (small circles) corresponding to the two visual phrases are aligned one-to-one.

10412. If they can be aligned, the two visual phrases belong to the same type;

10413. If they cannot be aligned, the two visual word groups are of different types.

In the embodiment of the present invention, if the three vertices in the two visual phrases can be aligned in a one-to-one correspondence, it indicates that the two visual phrases belong to the same type of visual phrases. Because, in two visual phrase matching, as long as the three vertices are aligned, the corresponding corners and edges will also be aligned accordingly.

1042. Encode visual phrases of the same class;

Specifically, it includes the following steps:

10421. According to the position of the key feature word and the neighborhood feature word of the current category visual phrase in the target image, obtain the position information of the key feature word and the neighborhood feature word;

In one embodiment of the present invention, a certain vertex (small circle) belongs to a certain key feature word or neighborhood feature word, so the position information of the key feature word and neighborhood feature word refers to the position of the vertex in the target image. Location information, including information about the visual word to which the vertex belongs, the angle at which the vertex is located, and the vertex-to-edge.

10422. According to the position information, the corresponding key feature words or neighborhood feature words are encoded;

Taking vertex a as an example, obtain the visual word of vertex a, the angle of the corner and the length of the opposite side, and obtain the code of vertex a: v _a ={vw _a ,ang _a ,eg _a };

Among them, vw _a is the code of the visual word to which vertex a belongs, ang _a is the angle normalization code of the angle where the vertex a is located, and eg _a is the side length normalization code of the vertex a to the edge.

10423. According to the encoding of the key feature words and the neighborhood feature words of the current category visual phrase, the encoding of the current category visual phrase is composed.

Based on the codes of the vertices a, _b , and _c , the code vp={va, vb, _vc } of the visual phrase to which they belong is determined.

1043. According to the coding of each category of visual phrases, establish a set of visual phrases describing the characteristics of the target image;

Specifically, it includes the following steps:

10431. Count the frequency of occurrence of various visual phrases;

10432. Form codes of visual phrases with frequencies higher than a predetermined frequency into code sets;

10433. Let the set of codes be a set of visual phrases that characterize the target image.

In an embodiment of the present invention, the frequency of occurrence of all categories of visual phrases is counted in the target image, and the visual phrases with higher occurrence frequency are selected as the features of the image, and a set of visual phrases VP (vp ₁ , vp ₂ , ..., vp _n ).

The set VP (vp ₁ , _vp ₂ , . precision.

As shown in FIG. 3, in order to realize the visual phrase construction method provided by the present invention, the present invention also provides a visual phrase construction device based on image feature space and image space space, including a processor 21 and a memory 22, and can also be based on actual needs. It further includes communication components, sensor components, power supply components, multimedia components and input/output interfaces. The memory, communication components, sensor components, power supply components, multimedia components and input/output interfaces are all connected to the processor 21 . As mentioned above, the memory 22 in the node device may be static random access memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable Read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, etc. The processor can be a central processing unit (CPU), a graphics processing unit (GPU), a field programmable logic gate array (FPGA), a dedicated Integrated circuit (ASIC), digital signal processing (DSP) chip, etc. Other communication components, sensor components, power supply components, multimedia components, etc. can all be implemented by using common components in existing smart terminals, and will not be described in detail here.

On the other hand, in the above-mentioned visual phrase construction device based on image feature space and image space space, the processor 21 reads the computer program in the memory 22 for performing the following operations:

The visual phrase construction method and device provided by the present invention combine the local feature space of the image and the image airspace space to jointly construct the visual phrase, which can greatly reduce the ambiguity of the visual phrase in the image matching process, and obtain higher discrimination. visual phrases. At the same time, the present invention classifies and encodes the visual phrase based on the feature space attribute of the vertex of the visual phrase and the relationship between the vertices. This code can more accurately represent image features, which can greatly improve the accuracy of image retrieval, segmentation and recognition.

The method and apparatus for constructing visual phrases based on image feature space and spatial space provided by the present invention are described in detail above. For those of ordinary skill in the art, any obvious changes made to the present invention without departing from the essential content of the present invention will constitute an infringement of the patent right of the present invention, and will bear corresponding legal responsibilities.

Claims

A method for constructing visual phrases based on image feature space and airspace space, characterized in that it comprises the following steps:

Extract the visual words that meet the preset conditions in the target image to form a visual word set;

Select each key feature word in the target area in the target image from the visual word set;

For each key feature word, extract a neighborhood feature word that has a geometric relationship with the key feature word from the visual word set, and form a corresponding visual phrase with the key feature word;

Based on the constituted visual phrases, a set of visual phrases describing the features of the target image is established.
The method for constructing visual phrases based on image feature space and airspace space as claimed in claim 1, wherein the extraction of visual words meeting preset conditions in the target image to form a visual word set specifically includes the following steps:

Quantify the local features of the target image into visual words;

According to the categories of visual words, the frequency of occurrence of various visual words is counted, and visual words with a frequency higher than a preset frequency are selected to form a visual word set.
The method for constructing visual phrases based on an image feature space and an airspace space according to claim 1, wherein the extraction of the neighborhood feature words having a geometric relationship with the key feature words in the visual word set specifically includes the following steps :

Draw a circle with the position of the current key feature word in the target image as the center and a predetermined distance as the radius;

Find the neighborhood feature word corresponding to the current key feature word in the visual word set; the position of the neighborhood feature word in the target image must be within the drawn circle.
The method for constructing a visual phrase based on an image feature space and an airspace space as claimed in claim 3, characterized in that, forming a corresponding visual phrase with the key feature word specifically comprises the following steps:

Take the position of the current key feature word and the positions of any two corresponding neighborhood feature words as vertices to form a triangle;

After determining that the shortest side length of the triangle is greater than the preset side length, and determining that the minimum angle of the triangle is greater than the preset angle, then select the current key feature word corresponding to the triangle and any two corresponding neighborhood feature words as one of the target images visual phrases.
The visual phrase construction method based on image feature space and airspace space as claimed in claim 1, is characterized in that, described each visual phrase based on formation, sets up the visual phrase set describing target image feature, specifically comprises the steps:

Classify each formed visual phrase;

encode visual phrases of the same category;

According to the coding of each category of visual phrases, a set of visual phrases describing the characteristics of the target image is established.
The method for constructing visual phrases based on an image feature space and an airspace space as claimed in claim 5, wherein the classifying each of the formed visual phrases specifically includes the following steps:

Determine whether the positions of key feature words and neighborhood feature words in any two visual phrases in the target image can be aligned one-to-one;

If they can be aligned, the two visual phrases belong to the same type;

If they cannot be aligned, the two visual word groups are of different types.
The method for constructing visual phrases based on image feature space and airspace space as claimed in claim 5, wherein the coding of visual phrases of the same category specifically includes the following steps:

According to the position of the key feature word and the neighborhood feature word of the current category visual phrase in the target image, obtain the position information of the key feature word and the neighborhood feature word;

According to the position information, the corresponding key feature words or neighborhood feature words are encoded;

According to the encoding of the key feature words of the visual phrase of the current category and the encoding of the neighboring feature words, the encoding of the visual phrase of the current category is composed.
The method for constructing visual phrases based on image feature space and airspace space as claimed in claim 5, characterized in that, according to the coding of each category of visual phrases, a set of visual phrases describing the feature of the target image is established, specifically comprising the following steps:

Count the frequency of occurrence of various visual phrases;

Composing the codes of visual phrases whose frequency is higher than a predetermined frequency into a code set;

Let the set of codes be the set of visual phrases that describe the features of the target image.
The method for constructing visual phrases based on image feature space and airspace space as claimed in claim 6, wherein the judgment is to determine whether the positions of key feature words and neighborhood feature words in any two visual phrases in the target image can be consistent with each other. One-to-one alignment includes the following steps:

Obtain the encoding of the visual word to which the position of each key feature word and neighborhood feature word in the target image belongs;

According to the coding of the visual word to which the position belongs, calculate the minimum position distance of two corresponding key feature words or neighborhood feature words belonging to different visual phrases in the target image;

If the calculated minimum position distance is equal to zero, it is determined that the positions of the two corresponding key feature words or neighborhood feature words in the target image are aligned.
A device for constructing visual phrases based on image feature space and space space, characterized by comprising a processor and a memory, wherein the processor reads a computer program in the memory for performing the following operations:

Extract the visual words that meet the preset conditions in the target image to form a visual word set;

Select each key feature word in the target area in the target image from the visual word set;

For each key feature word, extract a neighborhood feature word that has a geometric relationship with the key feature word from the visual word set, and form a corresponding visual phrase with the key feature word;

Based on the formed visual phrases, a set of visual phrases describing the features of the target image is established.