RU2665273C2 - Trained visual markers and the method of their production - Google Patents

Trained visual markers and the method of their production Download PDF

Info

Publication number
RU2665273C2
RU2665273C2 RU2016122082A RU2016122082A RU2665273C2 RU 2665273 C2 RU2665273 C2 RU 2665273C2 RU 2016122082 A RU2016122082 A RU 2016122082A RU 2016122082 A RU2016122082 A RU 2016122082A RU 2665273 C2 RU2665273 C2 RU 2665273C2
Authority
RU
Russia
Prior art keywords
neural network
images
visual markers
markers
visual
Prior art date
Application number
RU2016122082A
Other languages
Russian (ru)
Other versions
RU2016122082A3 (en
RU2016122082A (en
Inventor
Виктор Сергеевич Лемпицкий
Original Assignee
Автономная некоммерческая образовательная организация высшего образования "Сколковский институт науки и технологий"
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Автономная некоммерческая образовательная организация высшего образования "Сколковский институт науки и технологий" filed Critical Автономная некоммерческая образовательная организация высшего образования "Сколковский институт науки и технологий"
Priority to RU2016122082A priority Critical patent/RU2665273C2/en
Publication of RU2016122082A publication Critical patent/RU2016122082A/en
Publication of RU2016122082A3 publication Critical patent/RU2016122082A3/ru
Application granted granted Critical
Publication of RU2665273C2 publication Critical patent/RU2665273C2/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06NCOMPUTER SYSTEMS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computer systems based on biological models
    • G06N3/02Computer systems based on biological models using neural network models
    • G06N3/04Architectures, e.g. interconnection topology
    • G06N3/0454Architectures, e.g. interconnection topology using a combination of multiple neural nets
    • GPHYSICS
    • G05CONTROLLING; REGULATING
    • G05BCONTROL OR REGULATING SYSTEMS IN GENERAL; FUNCTIONAL ELEMENTS OF SUCH SYSTEMS; MONITORING OR TESTING ARRANGEMENTS FOR SUCH SYSTEMS OR ELEMENTS
    • G05B13/00Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion
    • G05B13/02Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion electric
    • G05B13/0265Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion electric the criterion being a learning criterion
    • G05B13/027Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion electric the criterion being a learning criterion using neural networks only
    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06KRECOGNITION OF DATA; PRESENTATION OF DATA; RECORD CARRIERS; HANDLING RECORD CARRIERS
    • G06K9/00Methods or arrangements for reading or recognising printed or written characters or for recognising patterns, e.g. fingerprints
    • G06K9/20Image acquisition
    • G06K9/2054Selective acquisition/locating/processing of specific regions, e.g. highlighted text, fiducial marks, predetermined fields, document type identification
    • G06K9/2063Selective acquisition/locating/processing of specific regions, e.g. highlighted text, fiducial marks, predetermined fields, document type identification based on a marking or identifier characterising the document or the area
    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06KRECOGNITION OF DATA; PRESENTATION OF DATA; RECORD CARRIERS; HANDLING RECORD CARRIERS
    • G06K9/00Methods or arrangements for reading or recognising printed or written characters or for recognising patterns, e.g. fingerprints
    • G06K9/62Methods or arrangements for recognition using electronic means
    • G06K9/6217Design or setup of recognition systems and techniques; Extraction of features in feature space; Clustering techniques; Blind source separation
    • G06K9/6256Obtaining sets of training patterns; Bootstrap methods, e.g. bagging, boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06KRECOGNITION OF DATA; PRESENTATION OF DATA; RECORD CARRIERS; HANDLING RECORD CARRIERS
    • G06K9/00Methods or arrangements for reading or recognising printed or written characters or for recognising patterns, e.g. fingerprints
    • G06K9/74Arrangements for recognition using optical reference masks
    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06NCOMPUTER SYSTEMS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computer systems based on biological models
    • G06N3/004Artificial life, i.e. computers simulating life
    • G06N3/006Artificial life, i.e. computers simulating life based on simulated virtual individual or collective life forms, e.g. single "avatar", social simulations, virtual worlds or particle swarm optimisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T1/00General purpose image data processing
    • G06T1/0014Image feed-back for automatic industrial control, e.g. robot with camera
    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T1/00General purpose image data processing
    • G06T1/0021Image watermarking
    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T11/002D [Two Dimensional] image generation
    • G06T11/001Texturing; Colouring; Generation of texture or colour

Abstract

FIELD: computer equipment.SUBSTANCE: group of inventions refers to the computing field of technology, in particular to visual markers and methods for their production, which can be used in robotics, virtual and augmented reality. Method comprises the steps of: forming a synthesizing neural network that translates a sequence of bits into images of visual markers; forming a render neural network that converts input images of visual markers into images, containing visual markers; forming a recognizing neural network that translates images, containing visual markers, in a sequence of bits; teaching together synthesizing, render and recognition neural network by minimizing the loss function, reflecting the probability of correctly recognizing random bit sequences; synthesizing visual markers by passing bit sequences through a trained synthesizing neural network; receiving a set of images of visual markers from a video data source; extracting from the resulting set of visual marker images the encoded bit sequences by the recognizing neural network.EFFECT: technical result is an increase in the accuracy of recognizing and localizing visual markers.21 cl, 11 dwg

Description

FIELD OF TECHNOLOGY
This technical solution generally relates to the computing field of technology, and in particular to visual markers and methods for their production, which can be used in robotics, virtual and augmented reality.
BACKGROUND
Currently, visual markers (also known as visual reference points or visual codes) are used to facilitate the living environment of humans and robots, as well as to assist computer vision algorithms in scenarios that are resource-limited and / or very important. Visual markers known from the prior art can be simple (linear) barcodes and their two-dimensional (matrix) copies, such as QR codes or Aztec codes, which are used to embed visual information objects in objects and scenes. AprilTags visual markers (Fig. 6) and similar methods, which are a popular way to simplify the identification of locations, objects and agents for robots, are very popular in robotics. As part of augmented reality, ARCodes visual markers and similar methods are used to provide camera position estimates with high accuracy, low latency, and on budget devices. In general, such markers can integrate visual information into the environment more compactly and independently of the language, and they can be recognized and used by autonomous as well as human-controlled devices.
Thus, all visual markers known in the prior art are developed heuristically, based on considerations of ease of recognition through computer (machine) vision algorithms. For the newly created family of markers, recognizer algorithms are designed and configured, the purpose of which is to ensure reliable localization and interpretation of visual markers. The creation of visual markers and recognizers of visual markers are thus divided into two stages, and this separation is not optimal (the specific type of markers is not optimal from the point of view of the recognizer in the mathematical sense). In addition, when creating visual markers, the aspect of aesthetics is missed, which leads to the appearance of “annoying” visual markers, which in many cases do not correspond to the style of the environment in which they are placed, or the goods on which they are applied, and make the appearance of this environment or products "computer friendly" (easy to recognize) and "not human friendly."
SUMMARY OF THE INVENTION
This technical solution is aimed at eliminating the disadvantages inherent in solutions known from the prior art.
The technical problem posed in this technical solution is the creation of families of visual markers that do not have similar problems from the prior art.
The technical result is to increase the accuracy of recognition of visual markers by taking into account perspective distortion, confusion with the background, low resolution, image blurring, etc., during training of the neural network. All such effects are modeled during the training of the neural network as piecewise differentiable transformations.
An additional technical result that manifests itself in solving the above technical problem is to increase the similarity of the visual marker to the visual style of the interior of the room or design of the goods.
The specified technical result is achieved due to the implementation of the method of producing a family of visual markers encoding information in which a synthesizing neural network is formed that translates a sequence of bits into images of visual markers; forming a rendering neural network that converts input images of visual markers into images containing visual markers by means of geometric and photometric transformations; form a recognizing neural network that translates images containing visual markers in a sequence of bits; they train together a synthesizing, rendering and recognizing neural network by minimizing the loss function, reflecting the probability of correct recognition of random bit sequences; receive a set of images of visual markers from a video source; the encoded bit sequences are extracted from the obtained set of images of visual markers by means of a recognizing neural network.
In some embodiments of the technical solution, a rendered neural network converts input images of visual markers into images containing visual markers placed on top of background images.
In some embodiments of the technical solution, the synthesizing neural network consists of one linear layer, followed by an element-wise sigmoid function.
In some embodiments of the technical solution, the synthesizing and / or recognizing neural network has a convolutional form (being a convolutional neural network).
In some embodiments of the technical solution in the learning process, a term characterizing the aesthetic acceptability of the markers is added to the optimization functional.
In some embodiments of the technical solution in the learning process, a member is added to the optimization functional that measures the correspondence of markers to the visual style specified in the form of a sample image.
In some embodiments of the technical solution, minimization of the loss function is performed using a stochastic gradient descent algorithm.
In some embodiments of the technical solution, the bit sequence during training is selected evenly from the set of vertices of the Boolean cube.
In some embodiments of the technical solution, the synthesizing, rendering, recognizing neural network is a direct distribution network.
Also, the specified technical result is achieved due to the implementation of the method of producing a family of visual markers encoding information in which variables corresponding to the pixel values of the created visual markers are obtained; forming a render neural network that converts the pixel values of visual markers into images containing visual markers by means of geometric and photometric transformations; form a recognizing neural network that translates images containing visual markers in a sequence of bits; they train together a synthesizing, rendering and recognizing neural network by minimizing the loss function, reflecting the probability of correct recognition of random bit sequences; receive a set of images of visual markers from a video source; retrieving marker class numbers from the resulting set of visual marker images.
In some embodiments of the technical solution, a rendered neural network converts input images of visual markers into images containing visual markers placed in the center of the background image.
In some embodiments of the technical solution in the learning process, a term characterizing the aesthetic acceptability of the markers is added to the optimization functional.
In some embodiments of the technical solution in the learning process, a member is added to the optimization functional that measures the correspondence of markers to the visual style specified in the form of a sample image.
In some embodiments of the technical solution, minimization of the loss function is performed using a stochastic gradient descent algorithm.
In some embodiments of the technical solution, the bit sequence during training is selected evenly from the set of vertices of the Boolean cube.
In some embodiments, the rendering and recognition neural network is a direct distribution network.
Also, the indicated technical result is achieved due to the implementation of the method of producing a family of visual markers encoding information in which variables corresponding to the pixel values of the created visual marker are obtained; forming a rendering neural network that converts input images of visual markers into images containing visual markers by means of geometric and photometric transformations; form a localizing neural network that translates images containing the marker into the marker position parameters; jointly synthesizing, rendering, and localizing the neural network is taught by minimizing the loss function, which reflects the probability of finding the marker position on the image; receive a set of images of visual markers from a video source; the encoded bit sequences are extracted from the obtained set of images of visual markers by means of a recognizing neural network.
In some embodiments of the technical solution, a rendered neural network converts input images of visual markers into images containing visual markers placed in the center of the background image.
In some embodiments of the technical solution in the learning process, a term characterizing the aesthetic acceptability of the markers is added to the optimization functional.
In some embodiments of the technical solution in the learning process, a member is added to the optimization functional that measures the correspondence of markers to the visual style specified in the form of a sample image.
In some embodiments of the technical solution, minimization of the loss function is performed using a stochastic gradient descent algorithm.
In some embodiments of the technical solution, the bit sequence during training is selected evenly from the set of vertices of the Boolean cube.
In some embodiments of the technical solution, the localizing, rendering, and recognizing neural network is a direct distribution network.
BRIEF DESCRIPTION OF THE DRAWINGS
The signs and advantages of this technical solution will become apparent from the following detailed description and the accompanying drawings, in which:
In FIG. 1 shows an example implementation of a method for creating and recognizing a visual marker;
In FIG. 2 shows a render neural network. An input marker M on the left output of the network, which is received through several states (all these are piecewise differentiable inputs); the outputs T (M; φ) for several random interference parameters φ are shown on the right. Using piecewise differentiable transformations in T allows you to use the back propagation of learning errors;
In FIG. 3 shows visual markers obtained through the implementation of this technical solution. The signatures on the figure show the bit length, the capacity of the resulting encoding (in bits), as well as the accuracy achieved during training. In each case, six markers are shown: (1) - a marker corresponding to a bit sequence containing 0; (2) - a marker corresponding to a bit sequence containing 1; (3) and (4) are markers corresponding to two random bit sequences differing in one bit; (5) and (6) are two markers corresponding to two or more bit sequences. Under many conditions, a characteristic pattern appears in the form of grids;
In FIG. Figure 4 shows examples of textured 64-bit marker families. The texture prototype is shown in the first column, while the remaining columns show markers for the following sequences: all zeros, all ones, 32 consecutive zeros, and at the end two random bit sequences that differ in one bit;
In FIG. 5 shows screenshots of reconstructed markers from a real-time video stream and a correctly recognized bit sequence;
In FIG. 6 shows AprilTags visual markers;
In FIG. Figure 7 shows the architecture of a rendered neural network: the network receives the batch of patterns (b × k × k × 3) and background images (b × s × s × 3). The network consists of rendering, affine transformation, color conversion and blur layers. Output form s × s × 3;
In FIG. Figure 8 shows a localizing neural network in which the input image passes through three layers and predicts 4 point maps corresponding to the position of each corner of the visual marker;
In FIG. Figure 9 shows the created family of visual markers for a rendered, localizing, classification recognizing neural network. For a person, these markers look the same, but a recognizing neural network reaches 99% recognition accuracy;
In FIG. 10 shows the architecture of a system for producing a family of visual markers encoding information;
In FIG. 11 shows an example of determining the position of markers (from the family shown in FIG. 9) using a trained localizing neural network. The position of each marker is set by the coordinates of the four corners. The predictions of the localizing neural network for the corners are shown by white dots.
DETAILED DESCRIPTION OF THE INVENTION
Below will be described the concepts and definitions necessary for the detailed disclosure of the ongoing technical solution.
The technical solution can be implemented as a distributed computer system.
In this solution, a system means a computer system, a computer (electronic computer), CNC (numerical control), PLC (programmable logic controller), computerized control systems and any other devices that can perform a given, well-defined sequence of operations (actions, instructions).
By a command processing device is meant an electronic unit or an integrated circuit (microprocessor) that executes machine instructions (programs).
The command processing device reads and executes machine instructions (programs) from one or more data storage devices. Storage devices may include, but are not limited to, hard disks (HDD), flash memory, ROM (read only memory), solid state drives (SSD), optical media (CD, DVD, etc.).
A program is a sequence of instructions intended for execution by a control device of a computer or a device for processing commands.
An artificial neural network (ANN) is a mathematical model, as well as its software or hardware implementation, built on the principle of a complex function that converts input information by applying a sequence of simple operations (also called layers), depending on the trained parameters of the neural network. The ANNs discussed below can be of any standard type (for example, a multilayer perceptron, a convolutional neural network, a recurrent neural network).
Training an artificial neural network is the process of adjusting the parameters of layers of an artificial neural network, as a result of which the predictions of the neural network on the training data are improved. Moreover, the quality of ANN predictions on the training data is set by the so-called loss function. Thus, the learning process corresponds to the mathematical minimization of the loss function.
The backpropagation method is an effective method for calculating the gradient of the loss function with respect to the parameters of neural network layers using recurrence relations using well-known analytical formulas for partial derivatives of individual layers of the neural network. By the back propagation method of the error, we also understand the algorithm for training a neural network using the aforementioned gradient calculation method.
The parameter of gradient methods of training neural networks is a parameter that allows you to control the magnitude of the correction of weights at each iteration.
A visual marker is a physical object, which is a printed image placed on one of the surfaces of a physical scene and designed to efficiently process digital photographs using machine vision algorithms. The result of processing the photo of the marker can be either the extraction of an information message (bit sequence) encoded with a marker, or determining the position of the camera relative to the position of the marker at the time of taking the digital photo. An example of markers of the first type are QR codes, an example of markers of the second type are ArUko Markers and April Tags.
A recognizing neural network is a neural network that receives an image containing a visual marker as an input and generates an informational message encoded in the marker as a result.
Localizing neural network - a neural network that receives an input image and provides as a result numerical information about the position of the visual marker on the image (for example, the position of the corners of the marker). As a rule, such information is sufficient to determine the position of the camera relative to the marker (subject to the availability of calibration information).
A synthesizing neural network is a neural network that receives some numerical information, such as a bit sequence, and converts it into a color or black and white image.
A rendered neural network is a neural network that receives an input image and converts it to another image so that the output image becomes like a digital photograph of a printed input image.
Convolutional ANN is one of the types of artificial neural networks that is widely used in pattern recognition, including in computer vision. A characteristic feature of convolutional neural networks is the use of data representation in the form of a set of images (cards), and the use of local convolution operations that modify and combine card data with each other.
Let us consider in detail the method of creating a trained visual marker shown in FIG. 1. The main goal is to create a synthesizing neural network S (b; θs) with training parameters θs that can encode the bit sequence b = {b 1 , ..., b n } containing n bits. We define a visual marker (sample) M k (b n ) as an image of size (k, k, 3) corresponding to the bit sequence b n . To simplify the notation in the following conclusions, we assume that bi ∈ {-1; one}.
For recognition of visual markers created by a synthesizing neural network, a recognizing neural network R (I; θR) with training parameters θR is created and used. This neural network receives an image I containing a visual marker and displays the estimated sequence τ = {τ1, ..., τn}. A recognizing neural network interacts with a synthesizing neural network to satisfy the condition r i = b i , i.e. the sign of the number deduced by the recognizing neural network corresponds to the bits encoded by the synthesizing neural network. In particular, recognition success can be measured using a simple loss function based on a sigmoidal curve:
Figure 00000001
where losses are distributed between -1 (perfect recognition) and 0.
In real life, algorithms that recognize markers do not directly receive image markers. Instead, visual markers are embedded in the environment (for example, by printing visual markers and placing on environmental objects or by using electronic displays to display visual markers), after which their images are captured using some camera controlled by a person or a robot.
Therefore, during the training of recognizing and synthesizing neural networks, the conversion is simulated between the visual marker created by the synthesizing neural network and the image of this marker using a special direct distribution network (render neural network) T (M; φ), where the parameters of the render network φ are selected during training and correspond to the variability of the background, variability of lighting, oblique perspective, blur core, color change / white balance of the camera, etc. During training, φ are selected from some distribution Φ, which should simulate the variability of the above effects under the conditions in which the use of visual markers is supposed.
In the case when the only goal is reliable marker recognition, the learning process can be implemented as minimization of the following functionality:
Figure 00000002
Here, the bit sequence b is selected uniformly from U (n) = {- 1; +1} n passed through a synthesizing neural network, a rendering and recognizing neural network, and the loss function (1) is used to measure recognition success. The parameters of the synthesizing neural network and the recognizing neural network are optimized to minimize the expectation of the loss function. Minimization of expression (2) can then be performed using a stochastic gradient descent algorithm, such as ADAM [1]. Each iteration of the algorithm displays a mini-batch of different bit sequences in the form of a set of different parameters of the layers of the rendered neural network and updates the parameters of the synthesizing neural network and the recognizing neural network to minimize the loss function (1) of these samples.
In some embodiments, a localizing neural network is also added to the learning process (Fig. 8), which detects examples of markers in the video stream and determines their position on the frame (for example, finds the coordinates of their angles). The coordinates are converted to a binary map with dimensions equal to the shape of the input images. The binary map has a zero value everywhere, except for the location of the corners, where the value is one. The localizing network is trained to predict these binary maps, which can then be used to align the marker before applying it to the input of the recognizing neural network (Fig. 10) or used to estimate the position of the camera relative to the marker in applications where such an assessment is necessary. When a similar localizing neural network is added to training, the synthesizing neural network is adapted to create markers that differ from the background and have easily identifiable angles.
In some embodiments, a single marker or a small number of tokens is created that is substantially smaller than the number of bit sequences of substantial length. In such embodiments, a synthesizing network is not used. The parameters of the synthesizing network in optimization are replaced directly by the pixel values of the markers (or marker). In these cases, as a rule, a localizing neural network is used, and a recognizing neural network is either implemented as a classifier for the number of classes equal to the number of markers, or is not used at all (in the case of the single-marker variant). An example of trained markers in this embodiment is shown in FIG. 9.
As shown above, the components of the architecture, namely, a synthesizing neural network, a rendering neural network, recognizing a neural network, and localizing a neural network can be implemented, for example, as direct distribution networks or as other architectures that allow training using the back propagation method of error. The recognition network can be implemented as a convolutional neural network [2] with n outputs. A synthesizing neural network may also have a convolutional architecture (being a convolutional neural network). A localizing neural network can also have a convolutional architecture (being a convolutional neural network).
To implement the rendered neural network T (M; φ) shown in FIG. 2, the use of custom layers is required. A rendered neural network is implemented as a chain of layers, each of which introduces some “interfering” transformation. A special layer is also implemented that superimposes an input image (sample) over a background image taken from a random set of images simulating the appearance of surfaces onto which trained markers can be applied when used. To implement geometric distortion, a spatial transformed layer is used [5]. A color change or a change in intensity can be realized through the use of differentiable transformations of elements (linear, multiplicative, gamma conversion). Interference transform layers can be applied sequentially, forming a render neural network that can simulate complex geometric and photometric transformations (Fig. 2).
Interestingly, under variable conditions, optimizing the results of expression (2) leads to markers that have a consistent and interesting visual texture (Fig. 3). Despite such visual "interest", it is desirable to control the appearance of the resulting markers more specifically, for example, through the use of some sample images.
For such control, in some embodiments, the training task (2) is supplemented by a loss function that measures the difference between the textures of the obtained markers and the texture of the image of the sample [6]. We briefly describe this loss function introduced in [6]. Consider a direct distribution network C (M; γ), which calculates the result of the t-th convolutional layer of a network trained to classify a large-scale image, such as VGGNet [7]. For the image M, the output of the network C (M; γ) contains k two-dimensional channels (maps). Network C uses parameters γ that are pre-trained on a large data set and which are not part of this learning process. Then, the image style M is determined using the following Gram matrix G (M; γ) of size k by k, where each element is defined as:
Figure 00000003
where C i and C j are the i-th and j-th maps and the scalar product is taken over all spatial positions. Given the texture of the prototype M 0 , the training task can be supplemented by the following expression:
Figure 00000004
The inclusion of expression (4) allows the markers S (b; θ S ) created by the synthesizing neural network to have a visual appearance similar to texture instances defined by the prototype M 0 [6].
For longer bit sequences, in some embodiments, an error correction coding method is used. Thus, the recognizing neural network returns the coefficients for each bit in the reconstructed signal, and the claimed technical solution is suitable for any probabilistic coding with error correction.
In some embodiments, for the lossless texture experiments, a simple synthesizing neural network is used, which consists of a single linear layer (with a 3m 2 × n matrix and a displacement vector), followed by an element-wise sigmoid function. In some embodiments, the synthesizing neural network has a convolutional form, taking the binary code as input and transforming it with one or more multiplicative layers and sets of convolutional layers. In the latter case, convergence in the learning state greatly benefits from the addition of a normalization batch [8] after each convolutional layer.
In some embodiments, the render network parameters may be selected as follows. Spatial transformation is performed as an affine transformation, where 6 affine parameters are selected from [1, 0, 0, 0, 0, 1, 0] + N (0, σ) (assuming the origin at the center of the marker). An example for σ = 1 is shown in FIG. 2. Take the image x, then you can implement the color conversion layer as
Figure 00000005
, where the parameters are selected from the uniform distribution of U [-δ, δ]. Since it has been revealed that printed visual markers tend to reduce contrast, a contrast reduction layer is added that converts each value to kx + (1-k) [0.5] for random k.
In some embodiments of the technical solution, the recognizing and localizing neural network may be convolutional.
The results of this technical solution shown in FIG. 4, it can be understood that a technical solution can successfully recover encoded signals with a small number of errors. The number of errors can be further reduced by applying a set (ensemble) of recognizing neural networks or by applying a recognizing neural network to several distorted versions of the image (test-time data augmentation).
In some embodiments, to improve accuracy, the marker may be aligned with a predetermined square (shown as part of the user interface in FIG. 5). As you can see, the results deteriorate with an increase in alignment error.
USED INFORMATION SOURCES
1. D.P. Kingma and J.V. Adam. A method for stochastic optimization. International Conference on Learning Representation, 2015.
2. Y. LeCun, B. Boser, J.S. Denker, D. Henderson, R.E. Howard, W. Hubbard, and L.D. Jackel. Backpropagation applied to handwritten zip code recognition. Neural computation, 1 (4): 541-551, 1989.
3. A. Dosovitskiy, J.T. Springenberg, and T. Brox. Learning to generate chairs with convolutional neural networks. Conf. on Computer Vision and Pattern Recognition (CVPR), 2015.
4. M.D. Zeiler, G.W. Taylor, and R. Fergus. Adaptive deconvolutional networks for mid and high level feature learning. Int. Conf. on Computer Vision (ICCV), pp. 2018-2025, 2011.
5. M. Jaderberg, K. Simonyan, A. Zisserman, et al. Spatial transformer networks. Advances in Neural Information Processing Systems, pp. 2008-2016, 2015.
6. L. Gatys, A.S. Ecker, and M. Bethge. Texture synthesis using convolutional neural networks. Advances in Neural Information Processing Systems, NIPS, pp. 262-270, 2015.
7. K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv: 1409.1556, 2014.
8. S. loffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. Proc. International Conference on Machine Learning, ICML, pp. 448-456, 2015.
9. E. Olson. Apriltag: A robust and flexible visual fiducial system. Robotics and Automation (ICRA), 2011 IEEE International Conference on, pp. 3400-3407. IEEE, 2011.

Claims (42)

1. A method of producing a family of visual markers encoding information, comprising the following steps:
- form a synthesizing neural network that translates a sequence of bits into images of visual markers;
- form a rendering neural network that converts input images of visual markers into images containing visual markers through geometric and photometric transformations;
- form a recognizing neural network that translates images containing visual markers in a sequence of bits;
- they teach jointly synthesizing, rendering, and recognizing neural networks by minimizing the loss function, which reflects the probability of correct recognition of random bit sequences;
- synthesize visual markers by passing bit sequences through a trained synthesizing neural network;
- get a set of images of visual markers from a video source;
- the encoded bit sequences are extracted from the obtained set of images of visual markers by means of a recognizing neural network.
2. The method according to claim 1, characterized in that the rendered neural network converts input images of visual markers into images containing visual markers placed in the center of the background image.
3. The method according to claim 1, characterized in that the synthesizing neural network consists of one linear layer, followed by an element-wise sigmoid function.
4. The method according to p. 1, characterized in that the synthesizing and / or recognizing neural network has a convolutional form.
5. The method according to p. 1, characterized in that during the learning process a member is added to the optimization functional that characterizes the aesthetic acceptability of the markers.
6. The method according to claim 1, characterized in that in the learning process, a member is added to the optimization functional that measures the correspondence of markers to the visual style specified in the form of a sample image.
7. The method according to p. 1, characterized in that the minimization of the loss function is performed using the stochastic gradient descent algorithm.
8. The method according to p. 1, characterized in that in the learning process, the bit sequence is selected evenly from the Boolean cube.
9. The method according to p. 1, characterized in that the synthesizing, rendering, recognizing neural network is a direct distribution network.
10. A method of producing a family of visual markers encoding information, comprising the following steps:
- create variables corresponding to the pixel values of the created visual markers;
- form a rendering neural network that converts the pixel values of visual markers into images containing visual markers by means of geometric and photometric transformations;
- form a recognizing neural network that translates images containing visual markers in a sequence of bits;
- train the synthesizing, rendering, and recognizing neural network together by minimizing the loss function, reflecting the probability of correct recognition of random bit sequences;
- synthesize visual markers by creating raster images with pixel values found as a result of training;
- get a set of images of visual markers from a video source;
- retrieve marker class numbers from the resulting set of visual marker images.
11. The method according to p. 10, characterized in that the rendered neural network converts input images of visual markers into images containing visual markers placed in the center of the background image.
12. The method according to p. 10, characterized in that during the learning process a member is added to the optimization functional that characterizes the aesthetic acceptability of the markers.
13. The method according to p. 10, characterized in that in the learning process, a member is added to the optimization functional that measures the correspondence of markers to the visual style specified in the form of a sample image.
14. The method according to p. 10, characterized in that the minimization of the loss function is performed using the stochastic gradient descent algorithm.
15. The method according to p. 10, characterized in that the render and recognition neural network is a direct distribution network.
16. A method of producing a family of visual markers encoding information, comprising the following steps:
- create variables corresponding to the pixel values of the created visual marker;
- form a rendering neural network that converts input images of visual markers into images containing visual markers through geometric and photometric transformations;
- form a localizing neural network that translates images containing the marker into the marker position parameters;
- they teach together a synthesizing, rendering and localizing neural network by minimizing the loss function, reflecting the probability of finding the marker position on the image;
- synthesize visual markers by creating raster images with pixel values found as a result of training;
- get a set of images of visual markers from a video source;
- the encoded bit sequences are extracted from the obtained set of images of visual markers by means of a recognizing neural network.
17. The method according to p. 16, characterized in that the rendered neural network converts input images of visual markers into images containing visual markers placed in the center of the background image.
18. The method according to p. 16, characterized in that during the learning process a member is added to the optimization functional that characterizes the aesthetic acceptability of the markers.
19. The method according to p. 16, characterized in that during the training, a member is added to the optimization functional that measures the correspondence of markers to the visual style specified in the form of a sample image.
20. The method according to p. 16, characterized in that the minimization of the loss function is performed using the stochastic gradient descent algorithm.
21. The method according to p. 16, characterized in that the localizing, rendering and recognizing neural network is a direct distribution network.
RU2016122082A 2016-06-03 2016-06-03 Trained visual markers and the method of their production RU2665273C2 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
RU2016122082A RU2665273C2 (en) 2016-06-03 2016-06-03 Trained visual markers and the method of their production

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
RU2016122082A RU2665273C2 (en) 2016-06-03 2016-06-03 Trained visual markers and the method of their production
PCT/RU2017/050048 WO2017209660A1 (en) 2016-06-03 2017-06-05 Learnable visual markers and method of their production

Publications (3)

Publication Number Publication Date
RU2016122082A RU2016122082A (en) 2017-12-07
RU2016122082A3 RU2016122082A3 (en) 2018-07-13
RU2665273C2 true RU2665273C2 (en) 2018-08-28

Family

ID=60478901

Family Applications (1)

Application Number Title Priority Date Filing Date
RU2016122082A RU2665273C2 (en) 2016-06-03 2016-06-03 Trained visual markers and the method of their production

Country Status (2)

Country Link
RU (1) RU2665273C2 (en)
WO (1) WO2017209660A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
RU2721190C1 (en) * 2018-12-25 2020-05-18 Общество с ограниченной ответственностью "Аби Продакшн" Training neural networks using loss functions reflecting relationships between neighbouring tokens
WO2021038227A1 (en) * 2019-08-27 2021-03-04 Zeta Motion Ltd Determining object pose from image data

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5373566A (en) * 1992-12-24 1994-12-13 Motorola, Inc. Neural network-based diacritical marker recognition system and method
US20100251169A1 (en) * 2009-03-31 2010-09-30 Microsoft Corporation Automatic generation of markers based on social interaction
RU139520U1 (en) * 2013-10-28 2014-04-20 Арташес Валерьевич Икономов Device for creating a graphic code
US20140355861A1 (en) * 2011-08-25 2014-12-04 Cornell University Retinal encoder for machine vision
US20150161522A1 (en) * 2013-12-06 2015-06-11 International Business Machines Corporation Method and system for joint training of hybrid neural networks for acoustic modeling in automatic speech recognition

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7003140B2 (en) * 2003-11-13 2006-02-21 Iq Biometrix System and method of searching for image data in a storage medium
JP4479478B2 (en) * 2004-11-22 2010-06-09 株式会社日立製作所 Pattern recognition method and apparatus
US20090003646A1 (en) * 2007-06-29 2009-01-01 The Hong Kong University Of Science And Technology Lossless visible watermarking
US8370759B2 (en) * 2008-09-29 2013-02-05 Ancestry.com Operations Inc Visualizing, creating and editing blending modes methods and systems
US20160098633A1 (en) * 2014-10-02 2016-04-07 Nec Laboratories America, Inc. Deep learning model for structured outputs with high-order interaction

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5373566A (en) * 1992-12-24 1994-12-13 Motorola, Inc. Neural network-based diacritical marker recognition system and method
US20100251169A1 (en) * 2009-03-31 2010-09-30 Microsoft Corporation Automatic generation of markers based on social interaction
US20140355861A1 (en) * 2011-08-25 2014-12-04 Cornell University Retinal encoder for machine vision
RU139520U1 (en) * 2013-10-28 2014-04-20 Арташес Валерьевич Икономов Device for creating a graphic code
US20150161522A1 (en) * 2013-12-06 2015-06-11 International Business Machines Corporation Method and system for joint training of hybrid neural networks for acoustic modeling in automatic speech recognition

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
RU2721190C1 (en) * 2018-12-25 2020-05-18 Общество с ограниченной ответственностью "Аби Продакшн" Training neural networks using loss functions reflecting relationships between neighbouring tokens
WO2021038227A1 (en) * 2019-08-27 2021-03-04 Zeta Motion Ltd Determining object pose from image data

Also Published As

Publication number Publication date
RU2016122082A3 (en) 2018-07-13
WO2017209660A1 (en) 2017-12-07
RU2016122082A (en) 2017-12-07

Similar Documents

Publication Publication Date Title
Bilinski et al. Dense decoder shortcut connections for single-pass semantic segmentation
Sun et al. Convolution neural networks with two pathways for image style recognition
Tang et al. Deep fishernet for object classification
CN106650617A (en) Pedestrian abnormity identification method based on probabilistic latent semantic analysis
Skočaj et al. Incremental and robust learning of subspace representations
KR20170038622A (en) Device and method to segment object from image
CN109558832B (en) Human body posture detection method, device, equipment and storage medium
RU2665273C2 (en) Trained visual markers and the method of their production
Liu et al. Hybrid deep learning for plant leaves classification
CN111444881A (en) Fake face video detection method and device
Gallant et al. Positional binding with distributed representations
Grinchuk et al. Learnable visual markers
Mohanty et al. Robust pose recognition using deep learning
Mitra et al. CreativeAI: Deep learning for graphics
CN111882026A (en) Optimization of unsupervised generative confrontation networks by latent spatial regularization
Shih et al. Video interpolation and prediction with unsupervised landmarks
Levinson et al. Latent feature disentanglement for 3D meshes
Abdelaziz et al. Few-shot learning with saliency maps as additional visual information
KR20200095303A (en) Pose estimation
Ding et al. Skeleton-Based Square Grid for Human Action Recognition With 3D Convolutional Neural Network
Ridge et al. Learning to write anywhere with spatial transformer image-to-motion encoder-decoder networks
CN112070181B (en) Image stream-based cooperative detection method and device and storage medium
Salem et al. Semantic Image Inpainting Using Self-Learning Encoder-Decoder and Adversarial Loss
Popov et al. Recognition of Dynamic Targets using a Deep Convolutional Neural Network
Si et al. Image semantic segmentation based on improved DeepLab V3 model