CN107886491A

CN107886491A - A kind of image combining method based on pixel arest neighbors

Info

Publication number: CN107886491A
Application number: CN201711206725.6A
Authority: CN
Inventors: 夏春秋
Original assignee: Shenzhen Vision Technology Co Ltd
Current assignee: Shenzhen Vision Technology Co Ltd
Priority date: 2017-11-27
Filing date: 2017-11-27
Publication date: 2018-04-06

Abstract

A kind of image combining method based on pixel arest neighbors proposed in the present invention, its main contents include：The synthesis of convolutional neural networks (CNNs), pixel corresponds to and pixel arest neighbors：One-to-many mapping, its process is, training one is initial to return device-convolutional neural networks (CNN), incomplete input is mapped to single output image, then K-NN search is performed to the pixel from this recurrence output, the training examples of matched pixel, efficiently match index is then carried out using the multiple dimensioned dramatic symbol for catching appropriate context level, finally exported from training set to synthesis, it is corresponding to generate intensive Pixel-level.The present invention can naturally generate multiple outputs, while can explain and obey the constraint of user so that faster, the image of synthesis is closer to original image for picture search and matching speed.

Description

A kind of image combining method based on pixel arest neighbors

Technical field

The present invention relates to image to synthesize field, more particularly, to a kind of image combining method based on pixel arest neighbors.

Background technology

With popularization of the digital product in people live, digital picture becomes more and more important information carrier.Have Caused digital picture is it is impossible to meet the subjective esthetic requirement of people from natural scene a bit, or in order to the original such as entertain Cause, it is desirable to arbitrarily change some contents in picture, artificially synthesize some new pictures true to nature.Image synthesizes Technology can be applied to be made and the research of micro- expression and dynamic in virtual cartoon scene, the picture editting of mobile device, mankind's fine motion Draw the fields such as teaching；The technologies such as picture editting can be combined simultaneously, realize user's clothing needed for autonomous editor in shopping at network, So as to be easier to search customer satisfaction system end article；The correlation circumstance for predicting environment in advance can also be synthesized by image, So as to the offer facility such as the operations on the sea such as marine traffic control, fishing and marine regatta.However, due to mode issue And a large amount of different outputs can not be produced, while also it is difficult to control synthesis output；And the data and bright of lacking training in practice Aobvious distance metric, it is also difficult to by search extension into large-scale training set.

The present invention proposes a kind of image combining method based on pixel arest neighbors, first trains an initial recurrence device-volume Product neutral net (CNN), incomplete input is mapped to single output image, then to the pixel from this recurrence output K-NN search is performed, then carrys out matched pixel using the multiple dimensioned dramatic symbol for catching appropriate context level, efficiently The training examples of ground match index, finally exported from training set to synthesis, it is corresponding to generate intensive Pixel-level.The present invention can Multiple outputs are naturally generated, while can explain and obey the constraint of user so that picture search and matching speed faster, close Into image closer to original image.

The content of the invention

The problems such as unmanageable synthesis output, it is an object of the invention to provide a kind of figure based on pixel arest neighbors As synthetic method, an initial recurrence device-convolutional neural networks (CNN) is first trained, incomplete input is mapped to single defeated Go out image, K-NN search then is performed to the pixel from this recurrence output, then using the appropriate context level of seizure Other multiple dimensioned dramatic symbol carrys out the training examples of matched pixel, efficiently match index, finally from training set to synthesis Output, it is corresponding to generate intensive Pixel-level.

To solve the above problems, the present invention provides a kind of image combining method based on pixel arest neighbors, its main contents Including：

(1) synthesis of convolutional neural networks (CNNs)；

(2) pixel is corresponding；

(3) pixel arest neighbors：One-to-many mapping.

Wherein, described image combining method, an initial recurrence device-convolutional neural networks (CNN) is trained first, will Incomplete input is mapped to single output image；This output image is restricted, and is a single output；Then to coming K-NN search is performed from the pixel of this recurrence output；Accorded with using the multiple dimensioned dramatic for catching appropriate context level Carry out matched pixel (the recurrence output from training data)；The efficiently training examples of match index, finally from training set to Synthesis output, it is corresponding to generate intensive Pixel-level.

Wherein, the synthesis of described convolutional neural networks (CNNs), CNNs are applied in segmentation, deep learning and surface normal Estimation, semantic border detection etc.；These networks usually using image tag data to upper standard loss (such as softmax or l₂Return) it is trained；However, such network generally can not handle what the image from (imperfect) label synthesized well Inverse problem；One main innovation is the introduction of the generation network (GAN) of dual training；This expression formula is in computer vision Power is had a very big impact, generates task applied to various images, to low-resolution image, segmentation masking-out, surface normal Figure and other inputs are handled.

Wherein, described pixel is corresponding, and an important results of pixel orientation arest neighbors are in synthesis output and training sample It is corresponding that pixel is generated between example；The semantic corresponding relation between inquiry and training image pixel is established, can be from training sample Middle extraction high-frequency information, a new image is synthesized in the input given from one.

Wherein, described pixel arest neighbors：One-to-many mapping, the problem of information drawing picture is synthesized, are defined as follows：It is given defeated Enter x condition (such as edge graph, normal depth figure or low-resolution image), synthesize the output image of high quality；Assuming that input/ The training pair of output, is designated as (x_n,y_n)；Simplest method is exactly using this task as (non-linear) regression problem：

Wherein, f (x_n；ω) refer to being returned the output of device with any (being probably nonlinear) of ω parametrizations；In public affairs Full convolutional neural networks are used in formula, particularly pixel network is as nonlinear regression device；Pixel arest neighbors include frequency analysis, Example matching, mating chemical composition, pixel representation and effectively search.

Further, described frequency analysis, prediction output f (x) are directly analyzed in the case of super-resolution, its In the case that conditional input x is low-resolution image；The low-resolution image of given face, it is understood that there may be output life can be used as Into multiple textures (such as wrinkle) or minute shapes clue (such as local feature of nose)；In practice, this group output is past Toward by " fuzzy " the single output for by returning return；This can be with input, output and the frequency analysis of original target image It can be clearly seen；Assuming that single output is sufficiently used for intermediate frequency output, but multiple outputs are needed to catch possible high frequency texture Space.

Further, described example matching, in order to catch multiple possible outputs, classics are used in computer vision Nonparametric technique；Simple K- arest neighbors (KNN) algorithm can return to report K output；However, it is possible to predict f (x) with it (multiple possible) high frequency imaging lost, rather than return to whole image with KNN models：

Global (x)=f (x)+(y_k-f(x_k)) (2)

Wherein,Dist be measure two (intermediate frequencies) rebuild between similitude away from From function；Multiple outputs are generated, K best match, rather than overall best match can be reported from training set.

Further, described mating chemical composition, it is more to synthesize by the way that (high frequency) patch is replicated and pasted from training set Output；It is in reconstruction image to allow such composition matching, i.e. simple match single pixel rather than global image Ith pixel writes f_i(x), final synthesis output can be written as：

Comp_i(x)=f_i(x)+(y_jk-f_i(x_k) (3)

Wherein,y_jkRefer to the output pixel j in training example k.

Further, described pixel representation, if distance function only considers global information, combinations matches are reduced to Global (example) matching；On the contrary, the different levels of deep layer network tend to capture the spatial context of varying number (due to difference Acceptance region)；The multiple dimensioned pixel that descriptor aggregates into these information across many levels one high precision represents；Construction one Individual pixel descriptor, using from conv- { 1₂,2₂,3₃,4₃,5₃Feature train the pixel network mould for semantic segmentation Type；In order to assess pixel similarity, the COS distance between two descriptors is calculated.

Further, described effective search, reconstructed image f (x) is given, global K- is found first by conv-5 features Arest neighbors, then T × T pixel window search pixels level matching around the pixel i only in this group of K image；In practice, Change K from { 1,2 ..., 10 }, change T from { 1,3,5,10,96 }, and 72 candidate's outputs are generated for given input；By It is 96 × 96 in the size of composograph, search parameter includes full constituent output (K=10, T=96) and global sample matches (K =1, T=1), exported them as candidate.

Brief description of the drawings

Fig. 1 is a kind of system framework figure of the image combining method based on pixel arest neighbors of the present invention.

Fig. 2 is a kind of frequency analysis of the image combining method based on pixel arest neighbors of the present invention.

Fig. 3 is a kind of pixel representation of the image combining method based on pixel arest neighbors of the present invention.

Fig. 4 is a kind of effective search of the image combining method based on pixel arest neighbors of the present invention.

Embodiment

It should be noted that in the case where not conflicting, the feature in embodiment and embodiment in the application can phase Mutually combine, the present invention is described in further detail with specific embodiment below in conjunction with the accompanying drawings.

Fig. 1 is a kind of system framework figure of the image combining method based on pixel arest neighbors of the present invention.Mainly include convolution The synthesis of neutral net (CNNs), pixel corresponds to and pixel arest neighbors：One-to-many mapping.

Image combining method, an initial recurrence device-convolutional neural networks (CNN) is trained first, by incomplete input It is mapped to single output image；This output image is restricted, and is a single output；Then to defeated from this recurrence The pixel gone out performs K-NN search；Carry out matched pixel using the multiple dimensioned dramatic symbol for catching appropriate context level (the recurrence output from training data)；The efficiently training examples of match index, finally exported from training set to synthesis, it is raw It is corresponding into intensive Pixel-level.

The synthesis of convolutional neural networks (CNNs), CNNs are applied on segmentation, deep learning and surface normal estimation, semantic side Boundary's detection etc.；These networks are usually using image tag data to upper standard loss (such as softmax or l₂Return) carry out Training；However, such network generally can not handle the inverse problem that the image from (imperfect) label synthesizes well；One Main innovation is the introduction of the generation network (GAN) of dual training；This expression formula has very big in computer vision Influence power, task is generated applied to various images, to low-resolution image, segmentation masking-out, surface normal figure and other are defeated Enter to be handled.

Pixel is corresponding, and an important results of pixel orientation arest neighbors are to generate picture between synthesis output and training examples Element is corresponding；The semantic corresponding relation between inquiry and training image pixel is established, high frequency letter can be extracted from training sample Breath, a new image is synthesized in the input given from one.

Pixel arest neighbors：One-to-many mapping, the problem of information drawing picture is synthesized, are defined as follows：Given input x condition (example Such as edge graph, normal depth figure or low-resolution image), synthesize the output image of high quality；Assuming that the training of input/output It is right, it is designated as (x_n,y_n)；Simplest method is exactly using this task as (non-linear) regression problem：

Example is matched, and in order to catch multiple possible outputs, classical nonparametric technique is used in computer vision；Letter Single K- arest neighbors (KNN) algorithm can return to report K output；However, it is possible to (multiple possibility of f (x) loss are predicted with it ) high frequency imaging, rather than return to whole image with KNN models：

Global (x)=f (x)+(y_k-f(x_k)) (2)

Mating chemical composition, more outputs are synthesized by being replicated from training set and pasting (high frequency) patch；In order to allow Such composition matching, i.e. simple match single pixel rather than global image, it is that the ith pixel in reconstruction image writes f_i (x), final synthesis output can be written as：

Comp_i(x)=f_i(x)+(y_jk-f_i(x_k) (3)

Wherein,y_jkRefer to the output pixel j in training example k.

Fig. 2 is a kind of frequency analysis of the image combining method based on pixel arest neighbors of the present invention.Prediction output f (x) exists Directly analyzed in the case of super-resolution, in the case that wherein condition entry x is low-resolution image；Give the low of face Image in different resolution, it is understood that there may be can be as the multiple textures (such as wrinkle) or minute shapes clue (such as nose of output generation Local feature)；In practice, this group output is often by " fuzzy " for by returning the single output returned；This is in input, output With in the frequency analysis of original target image it will be clear that；Assuming that single output is sufficiently used for intermediate frequency output, but need It is multiple to export to catch the space of possible high frequency texture.

Fig. 3 is a kind of pixel representation of the image combining method based on pixel arest neighbors of the present invention.This figure shows right The output of various input patterns.If distance function only considers global information, combinations matches are reduced to global (example) matching； On the contrary, the different levels of deep layer network tend to capture the spatial context of varying number (due to different acceptance regions)；Description The multiple dimensioned pixel that symbol aggregates into these information across many levels one high precision represents；A pixel descriptor is constructed, Using from conv- { 1₂,2₂,3₃,4₃,5₃Feature train the pixel network model for semantic segmentation；In order to assess picture Plain similitude, calculate the COS distance between two descriptors.

Fig. 4 is a kind of effective search of the image combining method based on pixel arest neighbors of the present invention.This figure shows use The example for multiple outputs that this method is generated by simply changing these parameters.Given reconstructed image f (x), first by conv- 5 features find global K- arest neighbors, then T × T pixel window search pixels around the pixel i only in this group of K image Level matching；In practice, change K from { 1,2 ..., 10 }, change T from { 1,3,5,10,96 }, and be given input life Exported into 72 candidates；Because the size of composograph is 96 × 96, search parameter includes full constituent output (K=10, T=96) With global sample matches (K=1, T=1), exported them as candidate.

For those skilled in the art, the present invention is not restricted to the details of above-described embodiment, in the essence without departing substantially from the present invention In the case of refreshing and scope, the present invention can be realized with other concrete forms.In addition, those skilled in the art can be to this hair Bright to carry out various changes and modification without departing from the spirit and scope of the present invention, these improvement and modification also should be regarded as the present invention's Protection domain.Therefore, appended claims are intended to be construed to include preferred embodiment and fall into all changes of the scope of the invention More and change.

Claims

1. a kind of image combining method based on pixel arest neighbors, it is characterised in that mainly including convolutional neural networks (CNNs) Synthesis (one)；Pixel is corresponding (two)；Pixel arest neighbors：One-to-many mapping (three).

2. based on the image combining method described in claims 1, it is characterised in that train an initial recurrence device-volume first Product neutral net (CNN), incomplete input is mapped to single output image；This output image is restricted, and is one Single output；Then K-NN search is performed to the pixel for carrying out autoregression output；Use the appropriate context level of seizure Multiple dimensioned dramatic symbol carrys out matched pixel (the recurrence output from training data)；The efficiently training examples of match index, Finally exported from training set to synthesis, it is corresponding to generate intensive Pixel-level.

3. the synthesis (one) based on the convolutional neural networks (CNNs) described in claims 1, it is characterised in that CNNs is applied Segmentation, deep learning and surface normal estimation, semantic border detection etc.；These networks are usually using image tag data to upper Standard loss (such as softmax or l₂Return) it is trained；Come from however, such network generally can not be handled well The inverse problem of the image synthesis of (imperfect) label；One main innovation is the introduction of the generation network (GAN) of dual training； This expression formula has a very big impact power in computer vision, task is generated applied to various images, to low resolution Rate image, segmentation masking-out, surface normal figure and other inputs are handled.

It is 4. corresponding (two) based on the pixel described in claims 1, it is characterised in that an important knot of pixel orientation arest neighbors Fruit is to generate pixel between synthesis output and training examples to correspond to；It is semantic right between inquiry and training image pixel to establish It should be related to, high-frequency information can be extracted from training sample, a new image is synthesized in the input given from one.

5. based on the pixel arest neighbors described in claims 1：One-to-many mapping (three), it is characterised in that close information drawing picture Into the problem of be defined as follows：Given input x condition (such as edge graph, normal depth figure or low-resolution image), synthesis are high The output image of quality；Assuming that the training pair of input/output, is designated as (x_n,y_n)；Simplest method is exactly that this task is made For (non-linear) regression problem：

<mrow> <munder> <mrow> <mi>m</mi> <mi>i</mi> <mi>n</mi> </mrow> <mi>&omega;</mi> </munder> <mo>|</mo> <mo>|</mo> <mi>&omega;</mi> <mo>|</mo> <msup> <mo>|</mo> <mn>2</mn> </msup> <mo>+</mo> <munder> <mo>&Sigma;</mo> <mi>n</mi> </munder> <mo>|</mo> <mo>|</mo> <msub> <mi>y</mi> <mi>n</mi> </msub> <mo>-</mo> <mi>f</mi> <mrow> <mo>(</mo> <msub> <mi>x</mi> <mi>n</mi> </msub> <mo>;</mo> <mi>&omega;</mi> <mo>)</mo> </mrow> <mo>|</mo> <msub> <mo>|</mo> <msub> <mi>l</mi> <mn>2</mn> </msub> </msub> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>1</mn> <mo>)</mo> </mrow> </mrow>

Wherein, f (x_n；ω) refer to being returned the output of device with any (being probably nonlinear) of ω parametrizations；Make in formula With full convolutional neural networks, particularly pixel network is as nonlinear regression device；Pixel arest neighbors includes frequency analysis, example Match somebody with somebody, mating chemical composition, pixel representation and effectively search.

6. based on the frequency analysis described in claims 5, it is characterised in that prediction output f (x) is in the case of super-resolution Directly analyzed, in the case that wherein condition entry x is low-resolution image；The low-resolution image of given face, may In the presence of the multiple textures (such as wrinkle) or minute shapes clue (such as local feature of nose) that can be used as output generation；In reality In trampling, this group output is often by " fuzzy " for by returning the single output returned；This is in input, output and original target image In frequency analysis it will be clear that；Assuming that single output is sufficiently used for intermediate frequency output, but need multiple outputs can to catch The space of the high frequency texture of energy.

7. based on the example matching described in claims 5, it is characterised in that in order to catch multiple possible outputs, calculating Classical nonparametric technique is used in machine vision；Simple K- arest neighbors (KNN) algorithm can return to report K output；However, can To predict (multiple possible) high frequency imaging that f (x) loses with it, rather than with KNN models return to whole image：

Global (x)=f (x)+(y_k-f(x_k)) (2)

Wherein,Dist be measure two (intermediate frequencies) rebuild between similitude apart from letter Number；Multiple outputs are generated, K best match, rather than overall best match can be reported from training set.

8. based on the mating chemical composition described in claims 5, it is characterised in that by being replicated from training set and pasting (high frequency) Patch synthesizes more outputs；In order to allow such composition matching, i.e. simple match single pixel rather than global image, F is write for the ith pixel in reconstruction image_i(x), final synthesis output can be written as：

Comp_i(x)=f_i(x)+(y_jk-f_i(x_k) (3)

Wherein,y_jkRefer to the output pixel j in training example k.

9. based on the pixel representation described in claims 5, it is characterised in that if distance function only considers global information, Then combinations matches are reduced to global (example) matching；On the contrary, the different levels of deep layer network tend to capture the sky of varying number Between context (due to different acceptance regions)；Descriptor aggregates into these information across many levels more chis of one high precision Pixel is spent to represent；A pixel descriptor is constructed, using from conv- { 1₂,2₂,3₃,4₃,5₃Feature come train be used for semanteme The pixel network model of segmentation；In order to assess pixel similarity, the COS distance between two descriptors is calculated.

10. based on effective search described in claims 5, it is characterised in that given reconstructed image f (x), first by Conv-5 features find global K- arest neighbors, then T × T pixel window search around the pixel i only in this group of K image Pixel matching；In practice, change K from { 1,2 ..., 10 } are middle, change T from { 1,3,5,10,96 }, and it is defeated for what is given Enter 72 candidate's outputs of generation；Because the size of composograph is 96 × 96, search parameter includes full constituent output (K=10, T =96) and global sample matches (K=1, T=1), exported them as candidate.