CN107564063B

CN107564063B - Virtual object display method and device based on convolutional neural network

Info

Publication number: CN107564063B
Application number: CN201710765514.XA
Authority: CN
Inventors: 庄晓滨; 周俊明; 戴长军
Original assignee: Guangzhou Cubesili Information Technology Co Ltd
Current assignee: Guangzhou Cubesili Information Technology Co Ltd
Priority date: 2017-08-30
Filing date: 2017-08-30
Publication date: 2021-08-13
Anticipated expiration: 2037-08-30
Also published as: CN107564063A

Abstract

The application discloses a virtual object display method and device based on a convolutional neural network, wherein the method comprises the following steps: acquiring each frame of picture shot by a camera at the current moment, inputting the frame of picture and a target picture into a pre-established convolutional neural network model aiming at any frame of picture in each frame of picture, outputting the deviation between the coordinates of four vertexes of the frame of picture and the target picture, and displaying a virtual object when the deviation between the coordinates of the four vertexes of the frame of picture and the target picture is less than a threshold value. By the method, whether the image of the camera at the current moment does not reach the designated position can be effectively determined.

Description

Virtual object display method and device based on convolutional neural network

Technical Field

The application relates to the technical field of computers, in particular to a virtual object display method and device based on a convolutional neural network.

Background

In the field of computer vision, any two images containing the same object are connected together through homography, and the homography matrix for determining the two images can be widely applied to the actual life of people, such as image correction, image alignment, camera anti-shake and the like.

At present, under different camera poses, the content of an image generated by the same object is different, but there still exist locally corresponding pixels, and the locally corresponding pixels can be used to determine a homography matrix corresponding to any two images containing the same object.

Specifically, in the prior art, 128 × 128 image data required for an experiment is generated mainly by using pictures in the MS-COCO dataset, a part of parameters of the VGG-style network based on the convolutional neural network are trained by using relative offsets of four pairs of vertices (8 horizontal and vertical coordinates) corresponding to two images as labels, and then, a homography matrix corresponding to two images containing the same object can be determined by using the trained VGG-style network.

However, in the prior art, when data is generated, the changes occurring inside the image, including the brightness change and the common situation of internal disturbance, are not fully considered, so that the accuracy of determining the homography matrix corresponding to two images containing the same object by using the VGG-style network is low.

Disclosure of Invention

The embodiment of the application provides a virtual object display method and device based on a convolutional neural network, which can effectively determine whether the image of a camera at the current moment does not reach a specified position.

The virtual object display method based on the convolutional neural network comprises the following steps:

acquiring each frame of picture shot by a camera at the current moment;

inputting the frame picture and a target picture into a pre-established convolutional neural network model aiming at any frame picture in each frame picture, and outputting the deviation between four vertex coordinates of the frame picture and the target picture;

and when the deviation between the coordinates of the four vertexes of each frame of picture and the target picture is less than a threshold value, displaying the virtual object.

Preferably, before inputting the frame picture and the target picture into the pre-established convolutional neural network model, the method further comprises:

making a training image set, wherein the training image set comprises at least one pair of rectangular images with homography correspondence, initializing each weight parameter in a convolutional neural network model to be trained, inputting the at least one pair of rectangular images with homography correspondence into the convolutional neural network model to be trained, and training each weight parameter in the convolutional neural network model to be trained according to the deviation of the vertex coordinates of the at least one pair of rectangular images with homography correspondence output by the convolutional neural network model to be trained and the vertex coordinates of the at least one pair of rectangular images with homography correspondence to obtain the convolutional neural network model.

Preferably, the at least one pair of rectangular images with the homography correspondence are both grayscale images, and/or the at least one pair of rectangular images with the homography correspondence include the center points of the images and have the same size.

Preferably, the method comprises: perturbing at least one of brightness, ambiguity, noise and sub-image position of one rectangular image of the at least one pair of rectangular images having a homography correspondence.

Preferably, the kernel size of the last pooling layer in the convolutional neural network model is 4x4, and the number of channels of the convolutional kernel of the convolutional layer is 64.

Preferably, the rectangular images in the training image set with the homography correspondence are input into the convolutional neural network model to be trained according to a random gradient descent method, and a loss function is constructed according to the deviation of the vertex coordinates of the rectangular images in the training image set with the homography correspondence output by the convolutional neural network model to be trained and the difference between the vertex coordinates of the rectangular images in the training image set with the homography correspondence until the loss function conforms to a preset model precision value.

Preferably, the manner of disturbing the brightness of one rectangular image of the at least one pair of rectangular images having a homography correspondence is as follows: generating a random number r aiming at a rectangular image to be disturbed, and determining a new gray value of each pixel point in the rectangular image according to the generated random number r through a formula P '═ px (1.0+ r), wherein P' represents the new gray value, P represents an original gray value, and r represents the random number; the mode of disturbing the fuzziness of one rectangular image in the at least one pair of rectangular images with the homography correspondence is as follows: generating a random number a aiming at a rectangular image to be disturbed, and carrying out Gaussian blurring on the rectangular image by taking the random number a as a blurring radius; the mode of disturbing the noise of one rectangular image in the at least one pair of rectangular images with the homography correspondence is as follows: generating a density random number and an intensity random number aiming at a rectangular image to be disturbed, and generating salt and pepper noise in the rectangular image according to the density random number and the intensity random number; the mode of disturbing the sub-image position of one rectangular image in the at least one pair of rectangular images with the homography correspondence is as follows: aiming at a rectangular image to be disturbed, two sub-images with different positions and the same size are randomly selected in the image, and all pixels in the two sub-images are exchanged.

Preferably, the descent strategy used in the random gradient descent method is:

wherein lr is the current learning rate, iter is the current iteration number, max _ iter is the maximum iteration number, power is a parameter for controlling the descending speed of the learning rate, and base _ lr is the basic learning rate; and/or, the model accuracy is calculated according to the following formula:

s_i＝p_i-r_iwhere M is the number of test sample sets, p_iIs a predicted deviation of the coordinates of the vertices, r, of a pair of rectangular images i_iIs the true deviation of the coordinates of the vertices of a pair of rectangular images i.

Preferably, the error formula is satisfied when the deviation between the coordinates of the four vertices of each frame picture and the target picture

A virtual object is displayed, wherein m refers to the number of frame pictures acquired at the current time,

t is a preset first threshold value, S is a preset second threshold value, and V is the deviation between the coordinates of four vertexes of each frame of picture and the target picture.

Preferably, the dummy is displayed when deviations between the four vertex coordinates of each frame picture and the target picture satisfy an error formula | V | < S, where | is a distance formula of 0 norm, 1 norm, or 2 norm, S is a preset third threshold, and V is a deviation matrix of deviations between the four vertex coordinates of each frame picture and the target picture.

The utility model provides a virtual object display device based on convolutional neural network, includes:

the acquisition module is used for acquiring each frame of picture shot by the camera at the current moment;

the output module is used for inputting the frame picture and the target picture into a pre-established convolutional neural network model aiming at any frame picture in each frame picture and outputting the deviation between the coordinates of four vertexes of the frame picture and the target picture;

and the display module is used for displaying the virtual object when the deviation between the coordinates of the four vertexes of each frame of picture and the target picture is less than a threshold value.

Preferably, the apparatus further comprises:

the model training module is used for making a training image set before the frame picture and the target picture are input into a pre-established convolutional neural network model by the input module, wherein the training image set comprises at least one pair of rectangular images with a homography corresponding relation, initializing each weight parameter in the convolutional neural network model to be trained, inputting the at least one pair of rectangular images with the homography corresponding relation into the convolutional neural network model to be trained, and training each weight parameter in the convolutional neural network model to be trained according to the deviation of the vertex coordinates of the at least one pair of rectangular images with the homography corresponding relation output by the convolutional neural network model to be trained and the vertex coordinates of the at least one pair of rectangular images with the homography corresponding relation to obtain the convolutional neural network model.

Preferably, the apparatus further comprises:

and the perturbation module is used for perturbing at least one of the brightness, the fuzziness, the noise and the sub-image position of one rectangular image in the at least one pair of rectangular images with the homography correspondence.

Preferably, the model training module is further configured to input the rectangular images in the training image set, which have the homography correspondence, into the convolutional neural network model to be trained according to a random gradient descent method, and construct the loss function according to a deviation of vertex coordinates of the rectangular images in the training image set, which have the homography correspondence, output by the convolutional neural network model to be trained, and a difference between the vertex coordinates of the rectangular images in the training image set, which have the homography correspondence, until the loss function conforms to a preset model precision value.

Preferably, the perturbation module is specifically configured to perturb the brightness of one rectangular image of the at least one pair of rectangular images having a homography correspondence by: generating a random number r aiming at a rectangular image to be disturbed, and determining a new gray value of each pixel point in the rectangular image according to the generated random number r through a formula P '═ px (1.0+ r), wherein P' represents the new gray value, P represents an original gray value, and r represents the random number; the mode of disturbing the fuzziness of one rectangular image in the at least one pair of rectangular images with the homography correspondence is as follows: generating a random number a aiming at a rectangular image to be disturbed, and carrying out Gaussian blurring on the rectangular image by taking the random number a as a blurring radius; the mode of disturbing the noise of one rectangular image in the at least one pair of rectangular images with the homography correspondence is as follows: generating a density random number and an intensity random number aiming at a rectangular image to be disturbed, and generating salt and pepper noise in the rectangular image according to the density random number and the intensity random number; the mode of disturbing the sub-image position of one rectangular image in the at least one pair of rectangular images with the homography correspondence is as follows: aiming at a rectangular image to be disturbed, two sub-images with different positions and the same size are randomly selected in the image, and all pixels in the two sub-images are exchanged.

Preferably, the descent strategy used in the random gradient descent method is:

Preferably, the display module is specifically configured to satisfy an error formula when a deviation between coordinates of four vertices of each frame picture and a target picture satisfies the error formula

A virtual object is displayed, wherein m refers to the number of frames acquired at the current time,

Preferably, the display module is specifically configured to display the virtual object when the deviation between the four vertex coordinates of each frame of the picture and the target picture satisfies an error formula | V | < S, where | is a distance formula of 0 norm, 1 norm or 2 norm, S is a preset third threshold, and V is a deviation matrix formed by the deviation between the four vertex coordinates of each frame of the picture and the target picture.

The embodiment of the application provides a virtual object display method and device based on a convolutional neural network, wherein the method comprises the following steps: acquiring each frame of picture shot by a camera at the current moment, inputting the frame of picture and a target picture into a pre-established convolutional neural network model aiming at any frame of picture in each frame of picture, outputting the deviation between the coordinates of four vertexes of the frame of picture and the target picture, and displaying a virtual object when the deviation between the coordinates of the four vertexes of the frame of picture and the target picture is less than a threshold value. By the method, whether the image of the camera at the current moment does not reach the designated position can be effectively determined.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application. In the drawings:

fig. 1 is a schematic diagram of a process of determining a homography matrix based on a convolutional neural network according to an embodiment of the present disclosure;

FIG. 2 is a diagram illustrating an embodiment of building a convolutional neural network model according to the present disclosure;

fig. 3A is a schematic diagram of a model structure of a convolutional neural network model to be trained according to an embodiment of the present application;

FIG. 3B is a block diagram of a starting block according to an embodiment of the present disclosure;

FIG. 4 is a diagram illustrating an embodiment of a method for creating a training image set according to the present disclosure;

FIG. 5 is a schematic diagram illustrating the front and back positions of sub-images perturbing a rectangular image according to an embodiment of the present disclosure;

fig. 6 is a schematic diagram of a track of each frame of corrected pictures shot by a camera according to an embodiment of the present disclosure;

fig. 7 is a schematic diagram of a video anti-shaking process based on a convolutional neural network according to an embodiment of the present application;

fig. 8 is a schematic view of a picture before and after correction taken by a camera according to an embodiment of the present disclosure;

fig. 9 is a schematic structural diagram of a virtual object display device based on a convolutional neural network according to an embodiment of the present application;

fig. 10 is a block diagram of a composition structure of a virtual object display system based on a convolutional neural network according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the technical solutions of the present application will be described in detail and completely with reference to the following specific embodiments of the present application and the accompanying drawings. It should be apparent that the described embodiments are only some of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

Fig. 1 is a process for determining a homography matrix based on a convolutional neural network according to an embodiment of the present application, which specifically includes the following steps:

s101: and inputting a pair of rectangular images with homography correspondence into a pre-established convolutional neural network model.

In practical applications, determining the homography matrix of any two images containing the same object can be widely applied to people's actual life, such as image correction, image alignment, camera anti-shake, and the like.

In the process of determining the homography matrix of any two images containing the same object, firstly, a convolutional neural network model needs to be established, and then, the homography matrix of any two images containing the same object can be determined through the established convolutional neural network model.

Further, the present application provides a specific implementation of building a convolutional neural network model, which is specifically shown in fig. 2:

s201: and making a training image set, wherein the training image set comprises at least one pair of rectangular images with homography correspondence.

S202: and initializing each weight parameter in the convolutional neural network model to be trained.

S203: and inputting at least one pair of rectangular images with homography correspondence into a convolutional neural network model to be trained.

S204: and training each weight parameter in the convolutional neural network model to be trained according to the deviation of the vertex coordinates of at least one pair of rectangular images with the homography corresponding relation output by the convolutional neural network model to be trained and the vertex coordinates of at least one pair of rectangular images with the homography corresponding relation to obtain the convolutional neural network model.

Here, a pair of rectangular images having a homography correspondence indicates that two images of the pair of rectangular images contain the same object. In addition, before the convolutional neural network model to be trained is trained, the number of convolutional layers of the convolutional neural network model to be trained, the number of convolutional kernels in the convolutional layers and the number of channels of the convolutional kernels are generally set and are not changed again in the training process, and the number of convolutional kernels in the convolutional layers of the convolutional neural network model to be trained and the number of channels of the convolutional kernels also determine the size and the shape of a pair of images input into the model, that is, the size and the shape of a pair of images input into the model are required to meet the model input requirement, so that the size and the shape of a pair of images with a homography correspondence included in the training image set are fixed.

In addition, it should be noted that, the present application also provides a model structure of a convolutional neural network model to be trained, specifically, as shown in fig. 3A, the model structure of the convolutional neural network model to be trained is composed of an Input Layer (Input), a convolutional Layer (convolutional Layer), an Activation Function (Activation Function), a Pooling Layer (Pooling Layer), and a Full Connection Layer (Full Connection Layer), and may also include other custom layers for accelerating network training, where the convolutional Layer (convolutional Layer) has a capability of extracting image abstract features, and as the number of layers is greater, the features are more abstract, and semantic features at a higher level can be learned; an Activation Function (Activation Function) is a method for improving network nonlinearity, and a common ReLU Activation Function is defaulted after each convolution layer; the Pooling Layer (Pooling Layer) is a data down-sampling method, which can improve the nonlinearity of a model and prevent the model from being over-fitted, and the present application uses two modes, namely Max Pooling and Avg Pooling, wherein Max Pooling is the output taking the maximum value in a receptive field as the Pooling Layer, Avg Pooling is the output taking the average value of arrays in the receptive field as the Pooling Layer, in the present application, the Pooling Layer is marked as the form of WxH + S, W represents the kernel width, H represents the kernel height, and S represents the stride; the Connection Layer (Full Connection Layer) plays a role of a classifier in the whole convolutional neural network, and an 8-dimensional vector is obtained.

In addition, because a Local Response Normalization (LRN) layer can carry out smoothing processing on the current feature map in depth, the accuracy rate is proved to be effectively improved in a classification task, and therefore the LRN layer is also used in the established convolutional neural network model. Secondly, the initiation module (inclusion) is also used in the established convolutional neural network model, the inclusion module can effectively improve the width of the network and increase the adaptability of the network to the scale, and the number of the inclusion modules in the model can be determined according to the actual situation, for example, 9. Fig. 3B shows a schematic structural diagram of the start module.

It should be noted that, in the convolutional neural network model to be trained, data between layers is referred to as a feature map, the feature map can be regarded as a three-dimensional matrix with width, height and depth, the size of a convolution kernel determines the size of a receptive field on the current feature map, the number of convolution kernels determines the depth of a next-layer feature map, and the size of a step determines the width and height of the next-layer feature map.

Further, since the training image set is required to be used when the convolutional neural network model to be trained is trained, in the present application, the training image set needs to be created before the convolutional neural network model to be trained is trained, and in practical applications, a pair of two images having homography correspondence and having a size and a shape meeting the input requirements of the model are usually difficult to directly obtain, so in the present application, the training image set can be created in a manner as shown in fig. 4:

s401: an original image set is acquired.

S402: and zooming any original image in the original image set to a preset size.

S403: determining a first rectangular image on the original image according to the preset length and the preset width, and respectively recording first positions of four vertexes of the first rectangular image in the original image.

S404: and randomly disturbing the four vertexes of the first rectangular image, and recording second positions of the four vertexes after random disturbance.

S405: and solving a homography matrix of the first position and the second position according to the first position of the four vertexes and the second position of the four vertexes.

S406: and converting the original image through the homography matrix.

S407: and finding four vertex pixels of a quadrangle surrounded by the second position in the original image on the converted image, zooming the quadrangle surrounded by the four vertex pixels according to the preset length and the preset width, and taking the zoomed quadrangle as a second rectangular image.

It should be noted that the first rectangular image and the second rectangular image are a pair of rectangular images having a homography correspondence relationship.

For example, an original image set is obtained, and for an original image X in the original image set, in this example, only the original image X in the original image set is taken as an example to illustrate, and the production process of other original images is consistent with that of the original image X, the original image X is scaled to 320 × 240 (i.e., a preset size), a first rectangular image a is determined on the original image X according to a preset length 128 and a preset width 128, first positions of four vertices of the first rectangular image a in the original image X are respectively recorded, eight random numbers n are generated, four vertices of the first rectangular image are randomly disturbed, second positions of the four vertices after random disturbance are recorded, a homography matrix H between the first position and the second position is solved according to the first positions of the four vertices and the second positions of the four vertices, and the original image X is converted by the homography matrix H, obtaining an image Y, finding four vertex pixels of a quadrangle surrounded by the four vertex pixels on the converted image (i.e. the image Y), zooming the quadrangle surrounded by the four vertex pixels according to a preset length 128 and a preset width 128, and taking the zoomed quadrangle as a second rectangular image B, wherein the first rectangular image A and the second rectangular image B are a pair of rectangular images with a homography correspondence relationship,

it should be noted that, according to the preset length 128 and the preset width 128, the first rectangular image a may be determined on the original image X by taking the center of the original image as the center point of the first rectangular image a and determining the four sides of the first rectangular image a according to the preset length 128 and the preset width 128, or by taking other points of the original image as the center point of the first rectangular image a and determining the four sides of the first rectangular image a according to the preset length 128 and the preset width 128, specifically by taking which point in the original image is the center point of the first rectangular image a, and determining the first rectangular image a may be determined according to actual situations. In addition, each vertex of the four vertices of the first rectangular image includes an abscissa and an ordinate, so eight random numbers n need to be generated, where the eight random numbers n may be the same, may have a part of different, and may also be different from each other, and subsequently, according to the eight random numbers n generated, the process of randomly perturbing the four vertices of the first rectangular image, and recording the second positions of the four vertices after being randomly perturbed specifically includes: assuming that the first position of the vertex 1 is (x, y), the random number corresponding to x in the first position of the vertex 1 is n1, and the random number corresponding to y is n2, the vertex 1 is perturbed according to the random numbers n1 and n2, and the second position (x + n1, y + n2) of the vertex 1 after random perturbation is recorded.

Further, in order to reduce the scale of the convolutional neural network model, in the present application, in the process of making the training image set, before scaling the original image to a preset size, the original image may be grayed, that is, the original image is converted into a grayed image, or after determining the second rectangular image, both the first rectangular image and the second rectangular image may be grayed.

Further, in order to improve algorithm robustness and adaptive capability, in the present application, after determining the first rectangular image and the second rectangular image, that is, after determining a pair of rectangular images having a homography correspondence, at least one of brightness, blur, noise and sub-image position of one rectangular image of the at least one pair of rectangular images having a homography correspondence is disturbed.

Further, the present application provides a manner of perturbing the brightness of one rectangular image of the at least one pair of rectangular images having a homography correspondence, specifically as follows:

and generating a random number r aiming at a rectangular image to be disturbed, and determining a new gray value of each pixel point in the rectangular image according to the generated random number r through a formula P '═ P x (1.0+ r), wherein P' represents the new gray value, P represents an original gray value, and r represents the random number.

It should be noted that, in practical applications, the random number r may be located between the intervals [ -0.1, 0.1 ].

Further, the present application provides a method for perturbing the blur degree of one rectangular image of the at least one pair of rectangular images having a homography correspondence, specifically as follows:

and generating a random number a aiming at a rectangular image to be disturbed, and carrying out Gaussian blurring on the rectangular image by taking the random number a as a blurring radius.

It should be noted that, in practical applications, the random number a may be located between the intervals [1, 5 ].

Further, the present application provides a method for disturbing noise of one rectangular image of the at least one pair of rectangular images having a homography correspondence, specifically as follows:

and generating a density random number and an intensity random number aiming at a rectangular image to be disturbed, and generating salt and pepper noise in the rectangular image according to the density random number and the intensity random number.

Further, the present application provides a manner of perturbing the sub-image position of one rectangular image of the at least one pair of rectangular images having a homography correspondence, specifically as follows:

specifically, as shown in fig. 5, the leftmost diagram is a first rectangular image, the middle diagram is a second rectangular image, and the rightmost diagram is an image obtained by perturbing the positions of the sub-images of the second rectangular image.

It should be noted that, when two or more types of disturbances are performed on one rectangular image of the at least one pair of rectangular images having a homography correspondence, the disturbance sequence may be determined according to actual situations, for example, luminance disturbance may be performed on one rectangular image of the at least one pair of rectangular images having a homography correspondence first, and then blur disturbance is performed, or blur disturbance may be performed on one rectangular image of the at least one pair of rectangular images having a homography correspondence first, and then luminance disturbance is performed.

And in addition, the rectangular image after disturbance is used as the rectangular image in the final training image set.

Further, in order to reduce the size of the convolutional neural network model, the kernel size of the last pooling layer within the convolutional neural network model is set to 4x4, and the number of channels of the convolutional kernel of the convolutional layer is set to 64.

Further, in the present application, in the process of each weight parameter in the convolutional neural network model to be trained, specifically, a rectangular image having a homography correspondence in the training image set may be input to the convolutional neural network model to be trained according to a random gradient descent method, and a loss function is constructed according to a deviation of vertex coordinates of the rectangular images having a homography correspondence in the training image set output by the convolutional neural network model to be trained and a difference between the vertex coordinates of the rectangular images having a homography correspondence in the training image set until the loss function conforms to a preset model precision value.

It should be noted here that, in order to provide the accuracy of the model, the euclidean distance may be used as the loss function in the present application, but of course, in practical applications, other types of loss functions may be used.

Further, in the embodiment of the present application, the descent strategy used in the random gradient descent method may be:

wherein lr is the current learning rate, iter is the current iteration number, max _ iter is the maximum iteration number, power is a parameter for controlling the decreasing speed of the learning rate, and base _ lr is the basic learning rate.

It should be noted that, in practical application, the number of training samples participating in the gradient update each time is set to 64, max _ iter is the maximum number of iterations is set to 400000, power is a parameter controlling the speed of the decrease of the learning rate is set to 0.5, and base _ lr is the basic learning rate is set to 0.001.

In addition, the application also provides a model precision calculation mode, which is specifically calculated by the following formula:

After the weight parameters in the convolutional neural network model are trained through the method, the convolutional neural network model is obtained.

Subsequently, when the homography matrix between the two images needs to be determined, the two images can be cut into the size and the shape meeting the input requirements of the convolutional neural network model, for example, if the size and the shape required by the input of the convolutional neural network model are 128 × 128 rectangles, the two images need to be cut into 128 × 128 rectangles, the two cut rectangular images must have a mutual corresponding relationship, that is, the two rectangular images must contain the same object, and then a pair of rectangular images having the homography corresponding relationship is input into the convolutional neural network model established in advance.

S102: and determining four vertex coordinates of the other rectangular image in the pair of rectangular images according to the deviation between the four vertex coordinates of the pair of rectangular images output by the convolutional neural network model and the known four vertex coordinates of one rectangular image in the pair of rectangular images.

In the embodiment of the application, after a pair of rectangular images with a homography correspondence relationship are input into a pre-established convolution neural network model, the deviation between the coordinates of four vertexes of the pair of rectangular images is finally output through the convolution neural network model.

The homography matrix calculation formula is specifically as follows:

wherein, H is a homography matrix of the two images, (u ', v') and (u, v) are mapping relationships of the same pixel in the two images, and according to a homography matrix calculation formula, it is finally desired to determine the homography matrix of the two images, four pairs of coordinates corresponding to the two images need to be known, and four vertex coordinates of one rectangular image of a pair of rectangular images can be definitely determined, that is, assuming that a central point of one rectangular image of a pair of rectangular images is taken as an origin, the four vertex coordinates of the rectangular image are fixed, and the center of the rectangular image is taken as an origin, the four vertex coordinates of the rectangular image can be determined, and the other rectangular image of a pair of rectangular images can be determined by taking the known four vertex coordinates of one rectangular image of a pair of rectangular images as the originThe deviations between the vertex coordinates of the output pair of rectangular images to which the respective vertex coordinates correspond are added, so that the vertex coordinates corresponding to the known four vertex coordinates of one of the pair of rectangular images can be determined in the other rectangular image.

S103: and determining the homography matrix corresponding to the pair of rectangular images according to the known four vertex coordinates of the one rectangular image and the four vertex coordinates of the other rectangular image.

In the embodiment of the application, after the four pairs of coordinates are determined, the homography matrix of the two images can be determined according to the homography matrix calculation formula.

By the method, because the training image set used in training the convolutional neural network model is subjected to brightness, fuzziness, noise and sub-image position disturbance, and the precision influence of image quality on the training and using models is fully considered, the robustness and the self-adaptive capacity of the model can be improved, and the precision is higher compared with the situation that a VGG-style network is used for determining the homography matrix corresponding to two images containing the same object.

It should be noted that, according to actual experimental tests, the size of the convolutional neural network model used in the present application is tested to be 12.52M, and the average error of the accuracy of the convolutional neural network model is tested to be 5.3, whereas the size of the VGG-style network model used in the prior art is tested to be 260.91M, and the average error of the accuracy of the model is tested to be 9.2.

The above is a way of establishing a convolutional neural network model and a way of determining a homography matrix of any two images containing the same object according to the convolutional neural network model, and in practical application, the method can be widely applied to the actual life of people by establishing the convolutional neural network model and determining the homography matrix of any two images containing the same object according to the convolutional neural network model.

The first application is as follows:

in practical application, since a camera may shake during shooting of a camera, so that a shot picture may have a sudden shake change instantly, in order to prevent the shot picture from having the sudden shake change instantly when the camera shakes, and to realize a smooth change of each frame of the shot picture, in the present application, coordinate deviations of four vertices of two adjacent images may be determined based on a trained convolutional neural network model, and the picture is corrected according to the coordinate deviations, so that the effect that the shot picture does not have the sudden shake change instantly, but each frame of the shot picture can have a smooth change is achieved.

The method comprises the following specific steps:

inputting a previous frame and an adjacent next frame (i.e., a frame sequence in fig. 7) into the above-established convolutional neural network model in sequence starting with a frame before the photographed frame jitters, outputting deviations between vertex coordinates of four vertices of the previous frame and the adjacent next frame (i.e., a frame sequence offset in fig. 7), determining deviations between vertex coordinates of each frame and the first frame (i.e., a camera motion trajectory in fig. 7) according to the deviations between the vertex coordinates of the previous frame and the adjacent next frame output by each pair of the convolutional neural network models, determining a corrected deviation between the vertex coordinates of each frame and the first frame (i.e., a camera motion trajectory in fig. 7) according to the determined deviations between the vertex coordinates of each frame and the first frame (i.e., a camera motion trajectory in fig. 7 is smooth), determining four vertex correction coordinates of each frame of picture according to the correction deviation between the four vertex coordinates of each frame of picture and the first frame of picture and the known four vertex coordinates of the first frame of picture, determining a homography matrix (namely, homography matrix transformation in the figure 7) between two appointed frames of pictures according to the determined four vertex correction coordinates of each frame of picture, and correcting each frame of picture.

It should be noted that, according to each pair of the convolutional neural network models, the convolutional neural network modelsThe deviation between the coordinates of the four vertices of the output previous frame and the adjacent next frame is determined, and the deviation between the coordinates of the four vertices of each frame and the first frame is determined (the camera motion trajectory in fig. 7 is smoothed), specifically: for any frame, determining the deviation between the coordinates of four vertexes of the previous frame and the next frame which are output by each pair of the convolutional neural network models and are positioned before the frame, taking the sum of the determined deviations between the coordinates of four vertexes of the previous frame and the next frame which are output by each pair of the convolutional neural network models and are positioned before the frame as the deviation between the coordinates of four vertexes of the frame and the first frame, namely,

p_tis the deviation, Δ, between the coordinates of the four vertices of the frame and the first frame_iIs the deviation between the coordinates of the four vertices of the ith frame picture and the ith-1 frame picture, thereby obtaining the deviation between the coordinates of the four vertices of each frame picture and the first frame picture as shown in fig. 6.

In addition, it should be noted that, in the present application, based on the determined deviation between the four vertex coordinates of each frame of picture and the first frame of picture, the corrected deviation between the four vertex coordinates of each frame of picture and the first frame of picture is determined (the camera motion trajectory in fig. 7 is smooth), and specifically: starting with the shot image jittering the previous frame image, sequentially passing correction formula p 'according to the correction deviation between the coordinates of four vertexes of the previous frame image and the first frame image and the deviation between the coordinates of four vertexes of the next frame image and the first frame image'_t＝argmin_p(α‖p-p′_t-1‖+(1-α)‖p-p_t|) determines a rectified deviation between the four vertex coordinates of the next and first frame pictures, where p'_tFor the corrected deviation between the coordinates of the four vertices of the next frame and the first frame, p_tIs the deviation between the coordinates of the four vertices of the next frame picture and the first frame picture, p'_t-1Four vertex coordinates of the previous frame and the first frameThe correction deviation between the four vertex coordinates of each frame and the first frame is obtained as shown in fig. 6, where α is a weight coefficient used for adjusting the stable condition and the reserved condition of the frame until the correction deviation between the four vertex coordinates of the frame and the first frame is determined to be completed.

Further, in the present application, according to the determined four vertex correction coordinates of each frame of picture, a homography matrix between two specified frames of pictures is determined (i.e., homography matrix transformation in fig. 7), and each frame of picture is corrected, specifically: when the two designated frames include: when a previous frame picture and an adjacent next frame picture are taken, determining the correction orientation of the next frame picture (wherein the correction orientation comprises vertex coordinates of four corrected vertexes of the next frame picture), aiming at any frame picture, determining a homography matrix between the previous frame picture and the frame picture according to the four determined vertex correction coordinates of the frame picture and the four determined vertex correction coordinates of the previous frame picture adjacent to the frame picture through a homography matrix formula, correcting the next frame picture into the previous frame picture through the determined homography matrix, repeating the process until the frame picture is corrected to be consistent with the first frame picture, and finally correcting all the frame pictures to be consistent with the first frame picture; when the two designated frames include: when a first frame picture and other frame pictures are taken, aiming at any frame picture, determining a homography matrix between the first frame picture and the frame picture according to the four determined vertex correction coordinates of the frame picture and the four determined vertex correction coordinates of the first frame picture through a homography matrix formula, correcting the frame picture into the first frame picture through the determined homography matrix, repeating the process until the frame picture is corrected to be consistent with the first frame picture, and finally correcting all the frame pictures to be consistent with the first frame picture.

Further, after all the frame images are corrected to be consistent with the first frame image, the common content in all the frame images is cut (i.e., the image cutting output in fig. 7), so that a smoother and stable video can be obtained, that is, the smooth change of each frame of the shot image can be realized, and the whole process is specifically shown in fig. 7.

It should be noted that, for each frame of corrected picture, the maximum inscribed rectangle of the non-black edge portion is obtained, and the aspect ratio of the rectangle should be the ratio of display, and generally, at least 80% of retention rate after clipping should be ensured.

For example, as shown in fig. 8, (a) and (b) in fig. 8 are two adjacent frames of a live-action captured video of the same camera, for the sake of simplicity and convenience, only two frames of (a) and (b) are taken as examples, and a multi-frame of a picture occurs when a shake occurs, but the principle is consistent with the two frames of (a) a frame before the shake of the captured picture (i.e., a first frame of the picture) and (b) an adjacent next frame of the picture, the (a) picture and the (b) picture are input to the above-established convolutional neural network model, the deviation between vertex coordinates of four vertices of the (a) picture and the (b) picture is output (i.e., the deviation between coordinates of four vertices of the (b) picture and the (a) picture is determined), and for the (b) picture, the (a) picture output by each pair of neural network models before the (b) picture and the four vertices of the (b) picture are determined And (c) determining the deviation between the coordinates, namely taking the sum of the deviations between the four vertex coordinates of the picture (a) and the picture (b) output by each pair of convolution neural network models positioned before the picture (b) as the deviation between the four vertex coordinates of the picture (b) and the picture (a), starting from the picture of the frame before the image is shaken, and sequentially correcting the deviation between the four vertex coordinates of the picture (a) and the picture (b) and the deviation between the four vertex coordinates of the picture (a) and the picture (b) according to a correction formula p'_t＝argmin_p(α‖p-p′_t-1‖+(1-α)‖p-p_t|) determining a rectification deviation between the coordinates of the four vertices of the (a) picture and the (b) picture, determining vertex rectification coordinates of the four vertices of the (b) picture based on the vertex rectification coordinates of the four vertices of the (a) picture and the rectification deviation between the coordinates of the four vertices of the (a) picture and the (b) picture as output, and finally, determining four vertex rectification coordinates of the (b) picture and the four vertex rectification coordinates of the (a) picture by a homography matrixThe formula comprises that a homography matrix between a picture (a) and a picture (b) is determined, the picture (b) is corrected into the picture (a) through the determined homography matrix, common contents in all frame pictures are cut, the picture (b) is cut to obtain a picture (c) in the picture 8, and the picture (c) is replaced with the original picture (b).

By the method, the shot picture can not generate violent shake change instantly, and the effect of smooth change of each frame of the shot picture can be realized.

The second application is as follows:

in practical application, watching live video has gradually become an important entertainment mode in daily life of people, and in the process of live video, in order to enhance the interaction between virtual and reality in live video, when the image of the camera at the current moment reaches a specified position, a preset virtual object can be displayed on a screen, and when the image of the camera at the current moment does not reach the specified position, the preset virtual object cannot be displayed on the screen.

The specific process is as follows:

extracting m frames of pictures at the current moment from a live broadcast video, sequentially inputting the pictures and a target picture into the established convolutional neural network model together for each picture in the m frames of pictures, outputting the deviation between the vertex coordinates of four vertexes of the pictures and the target picture, namely the position deviation of four pairs of vertexes until the deviation between the vertex coordinates of four vertexes of each picture and the target picture in the m frames of pictures is determined, and then determining the deviation V between the vertex coordinates of four vertexes of each picture and the target picture in the m frames of pictures according to the determined deviation V_iBy the formula:

to determine whether the image of the camera at the current moment has not reached the specified position.

It should be noted that the target picture refers to a picture corresponding to the designated position, which is known and determined in advance, and in addition, m in the formula refers to the number of m frames of pictures,

t and S are preset threshold values, and the two preset threshold values may be the same or different.

In addition, it should be noted that, the deviation V between the vertex coordinates of the four vertices of each of the m frames of pictures and the target picture is determined_iThen, whether the image of the current time of the camera does not reach the designated position can be determined by the formula | < V | < S, wherein | can be a 0 norm, a 1 norm, a 2 norm equidistance formula, S is a preset threshold, and V is an orientation deviation matrix [ V ] of the current m frames of pictures_ij]_m×8Of course, whether the image of the camera at the current time has not reached the specified position may be determined according to another formula, as long as the orientation deviation between the image at the current time and the target screen can be minimized by the formula.

Further, when the deviation between the vertex coordinates of four vertexes of each picture and the target picture in the m frames of pictures is determined to meet a formula, determining that the image of the camera at the current moment reaches the specified position; and when the deviation between the vertex coordinates of the four vertexes of each picture and the target picture in the m frames of pictures is determined to not meet the formula, determining that the image of the camera at the current moment does not reach the specified position.

Further, when it is determined that the image of the camera at the current time reaches the designated position, the preset virtual object is displayed on the screen, and when it is determined that the image of the camera at the current time does not reach the designated position, the preset virtual object is not displayed on the screen, and the camera needs to be moved continuously until the image of the camera at the current time is successfully matched with the target picture through the formula, which also indicates that the image of the camera at the current time reaches the designated position.

By the mode, whether the image of the camera at the current moment does not reach the designated position can be effectively determined.

In addition, in the process of using a camera to shoot a panoramic picture, the situation that the camera cannot be stabilized on the same horizontal line is inevitably caused, so that the splicing of the pictures of the adjacent previous frame and the next frame is unstable, therefore, in the application, the homography matrix of the two adjacent pictures can be determined based on the trained convolutional neural network model, the picture of the next frame is adjusted to the angle of the picture of the previous frame, and the pictures of the adjacent previous frame and the next frame are stably spliced.

The method comprises the following specific steps:

inputting the former frame picture and the next adjacent frame picture into the established convolutional neural network model, outputting the deviation between the vertex coordinates of the four vertexes of the former frame picture and the next adjacent frame picture, determining the vertex coordinates of the four vertexes of the next frame picture according to the vertex coordinates of the four vertexes of the former frame picture and the deviation between the vertex coordinates of the four vertexes of the former frame picture and the next adjacent frame picture, determining a homography matrix between the former frame picture and the next adjacent frame picture through a homography matrix calculation formula according to the vertex coordinates of the four pairs of vertexes of the former frame picture and the next adjacent frame picture, and finally converting each pixel in the next adjacent frame picture into a correction picture through the determined homography matrix to splice the correction picture and the former frame picture.

By the mode, the pictures of the adjacent previous frame and the next frame can be spliced stably.

Based on the same idea, the virtual object display method based on the convolutional neural network provided by the embodiment of the present application further provides a virtual object display device based on the convolutional neural network.

As shown in fig. 9, a virtual object display device based on a convolutional neural network according to an embodiment of the present application includes:

an obtaining module 901, configured to obtain each frame of picture shot by the camera at the current time;

an output module 902, configured to, for any frame of the frames, input the frame and a target frame into a pre-established convolutional neural network model, and output a deviation between coordinates of four vertices of the frame and the target frame;

a display module 903, configured to display the virtual object when a deviation between coordinates of four vertices of each frame of the picture and the target picture is smaller than a threshold.

The device further comprises:

a model training module 904, configured to make a training image set before the input module 901 inputs the frame picture and the target picture into a pre-established convolutional neural network model, where the training image set includes at least one pair of rectangular images with a homography correspondence, initialize each weight parameter in the convolutional neural network model to be trained, input the at least one pair of rectangular images with a homography correspondence into the convolutional neural network model to be trained, and train each weight parameter in the convolutional neural network model to be trained according to a deviation of vertex coordinates of the at least one pair of rectangular images with a homography correspondence output by the convolutional neural network model to be trained and the vertex coordinates of the at least one pair of rectangular images with a homography correspondence, so as to obtain the convolutional neural network model.

The at least one pair of rectangular images with the homography correspondence are both gray level images, and/or the at least one pair of rectangular images with the homography correspondence comprise the center points of the images and have the same size.

The device also includes:

a perturbation module 905, configured to perturb at least one of brightness, ambiguity, noise, and sub-image position of one rectangular image of the at least one pair of rectangular images having a homography correspondence.

The kernel size of the last pooling layer within the convolutional neural network model is 4x4, and the number of channels of the convolutional layer's convolutional kernel is 64.

The model training module 904 is further configured to input the rectangular images in the training image set having the homography correspondence to the convolutional neural network model to be trained according to a random gradient descent method, and construct a loss function according to a deviation of vertex coordinates of the rectangular images in the training image set having the homography correspondence output by the convolutional neural network model to be trained and a difference between the vertex coordinates of the rectangular images in the training image set having the homography correspondence until the loss function conforms to a preset model precision value.

The perturbation module 905 is specifically configured to perturb the brightness of one rectangular image of the at least one pair of rectangular images having a homography correspondence, where: generating a random number r aiming at a rectangular image to be disturbed, and determining a new gray value of each pixel point in the rectangular image according to the generated random number r through a formula P '═ px (1.0+ r), wherein P' represents the new gray value, P represents an original gray value, and r represents the random number; the mode of disturbing the fuzziness of one rectangular image in the at least one pair of rectangular images with the homography correspondence is as follows: generating a random number a aiming at a rectangular image to be disturbed, and carrying out Gaussian blurring on the rectangular image by taking the random number a as a blurring radius; the mode of disturbing the noise of one rectangular image in the at least one pair of rectangular images with the homography correspondence is as follows: generating a density random number and an intensity random number aiming at a rectangular image to be disturbed, and generating salt and pepper noise in the rectangular image according to the density random number and the intensity random number; the mode of disturbing the sub-image position of one rectangular image in the at least one pair of rectangular images with the homography correspondence is as follows: aiming at a rectangular image to be disturbed, two sub-images with different positions and the same size are randomly selected in the image, and all pixels in the two sub-images are exchanged.

The descending strategy used in the random gradient descending method is as follows:

wherein lr is the current learning rate, iter is the currentThe iteration times are max _ iter is the maximum iteration times, power is a parameter for controlling the descending speed of the learning rate, and base _ lr is the basic learning rate; and/or, the model accuracy is calculated according to the following formula:

The display module 903 is specifically configured to, when a deviation between coordinates of four vertices of each frame and a target frame satisfies an error formula

The display module 903 is specifically configured to display the virtual object when deviations between the four vertex coordinates of each frame and the target frame satisfy an error formula | < S, where | is a distance formula of 0 norm, 1 norm or 2 norm, S is a preset third threshold, and V is a deviation matrix formed by the deviations between the four vertex coordinates of each frame and the target frame.

In addition, the embodiment of the present application further provides a virtual object display system based on a convolutional neural network, and the system includes:

a processor, a computer readable memory, and a computer readable storage medium;

and the program is used for acquiring each frame of picture shot by the camera at the current moment, inputting the frame of picture and the target picture into a pre-established convolution neural network model aiming at any frame of picture in each frame of picture, outputting the deviation between the coordinates of four vertexes of the frame of picture and the target picture, and displaying the virtual object when the deviation between the coordinates of the four vertexes of each frame of picture and the target picture is less than a threshold value.

The program is stored on the computer readable storage medium for execution by the processor via the computer readable memory.

The processor, the computer readable memory, and the computer readable storage medium may be implemented by the processor, the internal memory, and the external memory of fig. 10.

Fig. 10 is a block diagram of a virtual object display system based on a convolutional neural network, in which the main components of the virtual object display system based on the convolutional neural network are shown. In FIG. 9, processor 1010, internal memory 1005, bus bridge 1020, and network interface 1015 are coupled to system bus 1040, bus bridge 1020 is coupled to system bus 1040 and I/O bus 1045, the I/O interface is coupled to I/O bus 1045, and the USB interface and external memory are coupled to the I/O interface. In FIG. 10, processor 1010 may be one or more processors, each of which may have one or more processor cores; the internal memory 1005 is a volatile memory such as a register, a buffer, various types of random access memories, or the like; while the convolutional neural network-based virtual object display system is running, the data in the internal memory 1005 includes an operating system and an application program; the network interface 1015 may be an ethernet interface, a fiber interface, or the like; system bus 1040 may be used to communicate data information, address information, and control information; bus bridge 1020 may be used to perform protocol conversions to convert a system bus protocol to an I/O protocol or vice versa for data transfers; the I/O bus 1045 is used for data information and control information, and can also be used for reducing signal reflection interference by a bus termination resistor or a circuit; the I/O interface 1030 is mainly connected to various external devices, such as a keyboard, a mouse, a sensor, and the like, the flash memory can be connected to the I/O bus through the USB interface, and the external memory is a nonvolatile memory, such as a hard disk, an optical disk, and the like. After the virtual object display system based on the convolutional neural network is operated, the processor can read the data stored in the external storage into the internal memory, and process the system instructions stored in the internal storage, so that the functions of an operating system and an application program are completed. The example convolutional neural network based virtual object display system may be located on a desktop, laptop, tablet, smartphone, or the like.

Preferably, the program is further configured to, before the frame picture and the target picture are input into a pre-established convolutional neural network model, make a training image set, where the training image set includes at least one pair of rectangular images having a homography correspondence, initialize each weight parameter in the convolutional neural network model to be trained, input the at least one pair of rectangular images having a homography correspondence into the convolutional neural network model to be trained, train each weight parameter in the convolutional neural network model to be trained according to a deviation of vertex coordinates of the at least one pair of rectangular images having a homography correspondence output by the convolutional neural network model to be trained and the vertex coordinates of the at least one pair of rectangular images having a homography correspondence, and obtain the convolutional neural network model.

Preferably, the program is further configured to determine that the at least one pair of rectangular images having a homography correspondence are both grayscale images, and/or that the at least one pair of rectangular images having a homography correspondence include a center point of the images and have the same size.

Preferably, the program is further configured to perturb at least one of brightness, blur, noise, and sub-image position of one of the at least one pair of rectangular images having a homographic correspondence.

Preferably, the program is further configured to determine that the kernel size of the last pooling layer in the convolutional neural network model is 4x4, and the number of channels of the convolutional kernel of the convolutional layer is 64.

Preferably, the program is further configured to input the rectangular images in the training image set having the homography correspondence to the convolutional neural network model to be trained according to a random gradient descent method, and construct the loss function according to a deviation of vertex coordinates of the rectangular images in the training image set having the homography correspondence output by the convolutional neural network model to be trained and a difference between the vertex coordinates of the rectangular images in the training image set having the homography correspondence, until the loss function conforms to a preset model precision value.

Preferably, the program is further configured to perturb the brightness of one rectangular image of the at least one pair of rectangular images having a homography correspondence by: generating a random number r aiming at a rectangular image to be disturbed, and determining a new gray value of each pixel point in the rectangular image according to the generated random number r through a formula P '═ px (1.0+ r), wherein P' represents the new gray value, P represents an original gray value, and r represents the random number; the mode of disturbing the fuzziness of one rectangular image in the at least one pair of rectangular images with the homography correspondence is as follows: generating a random number a aiming at a rectangular image to be disturbed, and carrying out Gaussian blurring on the rectangular image by taking the random number a as a blurring radius; the mode of disturbing the noise of one rectangular image in the at least one pair of rectangular images with the homography correspondence is as follows: generating a density random number and an intensity random number aiming at a rectangular image to be disturbed, and generating salt and pepper noise in the rectangular image according to the density random number and the intensity random number; the mode of disturbing the sub-image position of one rectangular image in the at least one pair of rectangular images with the homography correspondence is as follows: aiming at a rectangular image to be disturbed, two sub-images with different positions and the same size are randomly selected in the image, and all pixels in the two sub-images are exchanged.

Preferably, the program is further configured such that the descent strategy used in the random gradient descent method is:

Preferably, the program is further configured to satisfy an error formula when deviations between coordinates of four vertices of each frame picture and a target picture satisfy the error formula

Preferably, the program is further configured to display the virtual object when deviations between the four vertex coordinates of each frame picture and the target picture satisfy an error formula | < S, where | is a distance formula of 0 norm, 1 norm, or 2 norm, S is a preset third threshold, and V is a deviation matrix formed by the deviations between the four vertex coordinates of each frame picture and the target picture.

In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.

Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The above description is only an example of the present application and is not intended to limit the present application. Various modifications and changes may occur to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the scope of the claims of the present application.

Claims

1. A virtual object display method based on a convolutional neural network is characterized by comprising the following steps:

acquiring each frame of picture shot by a camera at the current moment;

inputting a pre-established convolutional neural network model into any frame of picture in each frame of picture and a target picture, and outputting the deviation between coordinates of four vertexes of the frame of picture and the target picture, wherein the target picture refers to a picture corresponding to a specified position; the convolutional neural network model is obtained by training at least one pair of rectangular images with homography correspondence;

when the deviation between the coordinates of the four vertexes of each frame of picture and the target picture is smaller than a threshold value, determining that the image of the camera at the current moment reaches a specified position, and displaying a virtual object on a screen;

when the deviation between the coordinates of the four vertexes of each frame of picture and the target picture is not less than the threshold, determining that the image of the camera at the current moment does not reach the specified position, and continuing moving the camera until the deviation between the image of the camera at the current moment and the coordinates of the four vertexes of the target picture is less than the threshold;

when the deviation between the coordinates of the four vertexes of each frame of picture and the target picture is less than a threshold, displaying a virtual object, specifically comprising:

when the deviation between the coordinates of the four vertexes of each frame picture and the target picture satisfies the error formula

Displaying a virtual object, wherein m refers to the number of frame pictures acquired at the current moment,

t is a preset first threshold value, S is a preset second threshold value,

is the deviation between the coordinates of the four vertices of each frame picture and the target picture.

2. The method of claim 1, wherein prior to inputting the frame picture and the target picture into the pre-established convolutional neural network model, the method further comprises:

making a training image set, wherein the training image set comprises at least one pair of rectangular images with homography correspondence;

initializing each weight parameter in a convolutional neural network model to be trained;

inputting the at least one pair of rectangular images with the homography correspondence into a convolutional neural network model to be trained;

and training each weight parameter in the convolutional neural network model to be trained according to the deviation of the vertex coordinates of the at least one pair of rectangular images with the homography corresponding relation output by the convolutional neural network model to be trained and the vertex coordinates of the at least one pair of rectangular images with the homography corresponding relation to obtain the convolutional neural network model.

3. The method according to claim 2, wherein the at least one pair of rectangular images with homography correspondence are both gray scale images, and/or the at least one pair of rectangular images with homography correspondence comprise the center points of the images and have the same size.

4. The method of claim 2, wherein the method comprises: perturbing at least one of brightness, ambiguity, noise and sub-image position of one rectangular image of the at least one pair of rectangular images having a homography correspondence.

5. The method of any of claims 1-4, wherein a kernel size of a last pooling layer within the convolutional neural network model is 4x4, and a number of channels of convolutional layer convolutional kernel is 64.

6. The method of claim 2, wherein the training of the weight parameters within the convolutional neural network model to be trained comprises:

inputting the rectangular images with the homography corresponding relation in the training image set into the convolutional neural network model to be trained according to a random gradient descent method;

and constructing a loss function according to the deviation of the vertex coordinates of the rectangular images with the homography corresponding relation in the training image set output by the convolutional neural network model to be trained and the difference value between the vertex coordinates of the rectangular images with the homography corresponding relation in the training image set until the loss function accords with a preset model precision value.

7. The method of claim 4,

the mode of disturbing the brightness of one rectangular image in the at least one pair of rectangular images with the homography correspondence is as follows: generating a random number r aiming at a rectangular image to be disturbed, and according to the generated random number r, obtaining a random image by a formula

Determining a new gray scale value for each pixel point in the rectangular image, wherein,

the new value of the gray scale is represented,

representing the original grey value, r represents a random number;

the mode of disturbing the fuzziness of one rectangular image in the at least one pair of rectangular images with the homography correspondence is as follows: generating a random number a aiming at a rectangular image to be disturbed, and carrying out Gaussian blurring on the rectangular image by taking the random number a as a blurring radius;

the mode of disturbing the noise of one rectangular image in the at least one pair of rectangular images with the homography correspondence is as follows: generating a density random number and an intensity random number aiming at a rectangular image to be disturbed, and generating salt and pepper noise in the rectangular image according to the density random number and the intensity random number;

the mode of disturbing the sub-image position of one rectangular image in the at least one pair of rectangular images with the homography correspondence is as follows: aiming at a rectangular image to be disturbed, two sub-images with different positions and the same size are randomly selected in the image, and all pixels in the two sub-images are exchanged.

8. The method of claim 6, wherein the stochastic gradient descent method uses a descent strategy that is:

wherein, in the step (A),lras a result of the current learning rate,iterfor the current number of iterations,max_iterin order to be the maximum number of iterations,powerto control the parameters of how fast the learning rate decreases,base_lra base learning rate; and/or

The model accuracy is calculated according to the following formula:

wherein M is the number of test sample sets,

as a pair of rectangular images

The predicted deviation of the coordinates of the vertices of (2),

is a pair of rectangular diagramsImage

True deviation of the vertex coordinates of (1).

9. The method according to claim 1, wherein when the deviation between the coordinates of the four vertices of each frame and the target frame is smaller than a threshold, displaying the virtual object specifically comprises:

Then the virtual object is displayed, wherein,

and the distance formula is a 0 norm, a 1 norm or a 2 norm, S is a preset third threshold, and V is a deviation matrix formed by the deviation between the coordinates of four vertexes of each frame of picture and the target picture.

10. A virtual object display device based on a convolutional neural network, comprising:

the output module is used for inputting the frame picture and a target picture into a pre-established convolutional neural network model aiming at any frame picture in each frame picture, and outputting the deviation between the coordinates of four vertexes of the frame picture and the target picture, wherein the target picture refers to a picture corresponding to a specified position; the convolutional neural network model is obtained by training at least one pair of rectangular images with homography correspondence;

the display module is used for displaying the virtual object when the deviation between the coordinates of the four vertexes of each frame of picture and the target picture is less than a threshold value; when the deviation between the coordinates of the four vertexes of each frame of picture and the target picture is not less than the threshold, determining that the image of the camera at the current moment does not reach the specified position, and continuing moving the camera until the deviation between the image of the camera at the current moment and the coordinates of the four vertexes of the target picture is less than the threshold;

the display module is specifically used for meeting an error formula when the deviation between the coordinates of four vertexes of each frame picture and a target picture meets an error formula

Displaying a virtual object, wherein m refers to the number of frames acquired at the current moment,

t is a preset first threshold value, S is a preset second threshold value,

11. The apparatus of claim 10, wherein the apparatus further comprises:

12. The apparatus according to claim 11, wherein the at least one pair of rectangular images having homography correspondence are both gray scale images, and/or the at least one pair of rectangular images having homography correspondence include a center point of the images and are the same size.

13. The apparatus of claim 11, further comprising:

14. The apparatus of any one of claims 10-13, wherein a kernel size of a last pooling layer within the convolutional neural network model is 4x4, and a number of channels of convolutional layer convolutional kernel is 64.

15. The apparatus according to claim 11, wherein the model training module is further configured to input the rectangular images in the training image set with the homography correspondence into the convolutional neural network model to be trained according to a stochastic gradient descent method, and construct the loss function according to a deviation of vertex coordinates of the rectangular images in the training image set with the homography correspondence output by the convolutional neural network model to be trained and a difference between the vertex coordinates of the rectangular images in the training image set with the homography correspondence until the loss function conforms to a preset model precision value.

16. The apparatus according to claim 13, wherein the perturbation module is specifically configured to perturb the brightness of one of the at least one pair of rectangular images having a homography correspondence by: generating a random number r aiming at a rectangular image to be disturbed, and according to the generated random number r, obtaining a random image by a formula

the new value of the gray scale is represented,

representing the original grey value, r represents a random number; the mode of disturbing the fuzziness of one rectangular image in the at least one pair of rectangular images with the homography correspondence is as follows: generating a random number a aiming at a rectangular image to be disturbed, and carrying out Gaussian blurring on the rectangular image by taking the random number a as a blurring radius; the mode of disturbing the noise of one rectangular image in the at least one pair of rectangular images with the homography correspondence is as follows: generating a density random number and an intensity random number aiming at a rectangular image to be disturbed, and generating salt and pepper noise in the rectangular image according to the density random number and the intensity random number; the mode of disturbing the sub-image position of one rectangular image in the at least one pair of rectangular images with the homography correspondence is as follows: aiming at a rectangular image to be disturbed, two sub-images with different positions and the same size are randomly selected in the image, and all pixels in the two sub-images are exchanged.

17. The apparatus of claim 15, wherein the stochastic gradient descent method uses a descent strategy that is:

wherein, in the step (A),lras a result of the current learning rate,iterfor the current number of iterations,max_iterin order to be the maximum number of iterations,powerto control the parameters of how fast the learning rate decreases,base_lra base learning rate; and/or, the model accuracy is calculated according to the following formula:

where M is the number of test sample sets，

As a pair of rectangular images

The predicted deviation of the coordinates of the vertices of (2),

as a pair of rectangular images

True deviation of the vertex coordinates of (1).

18. The apparatus of claim 10, wherein the display module is specifically configured to satisfy an error formula when a deviation between coordinates of four vertices of each frame and a target frame satisfies the error formula

Then the virtual object is displayed, wherein,