RU2803287C1

RU2803287C1 - Method for obtaining a set of objects of a three-dimensional scene

Info

Publication number: RU2803287C1
Application number: RU2022117661A
Authority: RU
Inventors: Андрей Владимирович Новиков; Владимир Николаевич Герасимов; Роман Александрович Горбачев; Никита Евгеньевич Швиндт; Владимир Иванович Новиков; Андрей Евгеньевич Ефременко; Дмитрий Леонидович Шишков; Михаил Нилович Зарипов; Филипп Александрович Козин; Алексей Михайлович Старостенко
Original assignee: Общество С Ограниченной Ответственностью "Нейроассистивные Технологии"
Filing date: 2019-12-10
Publication date: 2023-09-12

Abstract

FIELD: systems and methods for recognizing objects in a three-dimensional scene.

SUBSTANCE: invention relates to determining the true dimensions of objects in a three-dimensional scene from its two-dimensional images. A method for obtaining a set of objects of a three-dimensional scene includes simultaneously obtaining images of frames from stereo cameras, generating a disparity map by the method of semi-global stereo matching for each image point with pixel coordinates, determining the true coordinates of the specified point, generating a point depth map in true coordinates, and generating a two-dimensional image in grey scale, in which the brightness of a point depends on the true distance to the point, and the detection and identification of objects by one of the methods selected from the Viola-Jones method, the SSD-mobilenet neural network method and the Mask R-CNN neural network method, obtaining a set of objects of a three-dimensional scene. In this case, the disparity map is formed by the method of semi-global stereo matching. The true coordinates of the point are determined taking into account the focal lengths of the cameras of the stereo camera and the distance between them. The brightness of a point is assumed to be zero if the true distance to it is outside the specified range.

EFFECT: increase in the accuracy of recognition of objects of complex and random colours, transparent objects, complexly coloured objects against a complexly coloured background, including such patterns and colours.

10 cl, 1 dwg

Description

Изобретение относится к системам и способам распознавания объектов трехмерной сцены, в частности, определения истинных размеров объектов трехмерной сцены по ее двухмерным изображениям, и может быть использовано для систем технического зрения в робототехнике и других областях техники, в том числе для систем манипулирования объектами, предназначенными для помощи пользователям с ограниченной подвижностью.The invention relates to systems and methods for recognizing objects in a three-dimensional scene, in particular, determining the true dimensions of objects in a three-dimensional scene from its two-dimensional images, and can be used for technical vision systems in robotics and other fields of technology, including systems for manipulating objects intended for assistance for users with limited mobility.

Известно множество разных способов построения трехмерных сцен, в частности, получения трехмерной информации по множеству двухмерных изображений сцены. Эта задача является одной из наиболее сложных в компьютерном анализе изображений и в настоящий момент решена только для ряда частных случаев. Для ее решения требуется предварительное построение карты диспаратности.There are many different ways of constructing three-dimensional scenes, in particular, obtaining three-dimensional information from multiple two-dimensional images of a scene. This problem is one of the most difficult in computer image analysis and has currently been solved only for a number of special cases. To solve it, a preliminary construction of a disparity map is required.

Карта диспаратности - это визуальное отображение сдвигов между одинаково расположенными фрагментами снимков левой и правой камер (чем ближе находится точка сцены, тем эти сдвиги больше). Как известно, это «расхождение» можно представить как числовой массив, элементы которого показывают разность в пикселах точек правого и левого изображений, привязанную к одному из них. Ректификация разноракурсных изображений (выравнивание правого и левого снимков по горизонтали) позволяют уменьшить размерность массива - свести его к двумерному. Для удобства восприятия эта матрица представляется в графическом виде: чем больше расхождение между снимками, тем светлее соответствующие пикселы изображения.A disparity map is a visual display of shifts between identically located fragments of images from the left and right cameras (the closer the scene point is, the greater these shifts). As you know, this “divergence” can be represented as a numeric array, the elements of which show the difference in pixels between the points of the right and left images, tied to one of them. Rectification of multi-angle images (aligning the right and left images horizontally) allows you to reduce the dimension of the array - reduce it to two-dimensional. For ease of perception, this matrix is presented in graphical form: the greater the discrepancy between the images, the lighter the corresponding image pixels.

Для построения карт диспаратности используется ряд алгоритмов, в целом подразделяющихся на три класса: локальные, глобальные и полуглобальные (частично глобальные).To construct disparity maps, a number of algorithms are used, generally divided into three classes: local, global and semi-global (partially global).

Локальные алгоритмы рассчитывают диспаратность в отдельности для каждого пиксела, при этом учитывая информацию лишь из узкой его окрестности. Алгоритмы используют, в основном, квадратные или прямоугольные окна фиксированного размера и по какой-либо метрике сравнивают суммы абсолютных значений яркости внутри этих окон. Такие алгоритмы характеризуются высокой скоростью и вычислительной эффективностью. Однако приемлемое качество работы обеспечивается только при условии гладкости функции интенсивности пикселей. На границах объектов, где функция интенсивности терпит разрыв, алгоритмы допускают значительное количество ошибок. Дальнейшее развитие методов привело к появлению многооконных алгоритмов и окон с адаптивной структурой, что улучшило качество расчета диспаратности. Но «платой» за это стало значительное увеличение времени работы, что зачастую приводит к невозможности анализа изображений в реальном времени.Local algorithms calculate disparity separately for each pixel, taking into account information only from a narrow neighborhood. The algorithms mainly use square or rectangular windows of a fixed size and, using some metric, compare the sums of absolute brightness values inside these windows. Such algorithms are characterized by high speed and computational efficiency. However, acceptable performance is ensured only if the pixel intensity function is smooth. At the boundaries of objects, where the intensity function breaks down, the algorithms make a significant number of errors. Further development of methods led to the emergence of multi-window algorithms and windows with an adaptive structure, which improved the quality of disparity calculations. But the “payment” for this was a significant increase in operating time, which often leads to the impossibility of analyzing images in real time.

Глобальные алгоритмы основаны на вычислении диспаратности одновременно для всего изображения, при этом каждый пиксел изображения оказывает влияние на решение во всех остальных пикселах. Глобальные алгоритмы различаются как видом унарного и парного потенциалов, так и алгоритмами минимизации и структурой графа. Несмотря на то, что, как правило, по результативности глобальные алгоритмы превосходят локальные, полученные карты диспаратности не свободны от ошибок, обусловленных теми упрощениями, которые изначально заложены в формулу для функционала энергии. При этом глобальные алгоритмы являются более медленными.Global algorithms are based on calculating disparity simultaneously for the entire image, with each pixel in the image influencing the solution in all other pixels. Global algorithms differ both in the type of unary and pair potentials, as well as in their minimization algorithms and graph structure. Despite the fact that, as a rule, global algorithms are superior in performance to local ones, the resulting disparity maps are not free from errors due to the simplifications that were initially included in the formula for the energy functional. At the same time, global algorithms are slower.

Полуглобальные, или частично глобальные, способы являются разумным компромиссом между быстрыми, но неточными локальными методами и более точными, но медленными глобальными, позволяющим рационально использовать их «сильные стороны». Идея методов состоит в независимости решения для каждого пиксела с учетом влияния всех (или части, не ограниченной локальной окрестностью) остальных пикселов изображения.Semi-global, or partially global, methods are a reasonable compromise between fast, but inaccurate local methods and more accurate, but slow global methods, allowing rational use of their “strengths”. The idea of the methods is the independence of the solution for each pixel, taking into account the influence of all (or the part not limited to the local neighborhood) of the remaining pixels of the image.

Одной из наиболее известных реализаций способа частично глобального установления стереосоответствий является метод Semi-Global Matching (далее также SGM), описанный, например, в Heiko Hirschmuller. Accurate and Efficient Stereo Processing by Semi-Global Matching and Mutual Information. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), San Diego, CA, USA. June 20-26, 2005. Граф в алгоритме не содержит циклов и представляет собой дерево фиксированной формы: совокупность лучей, выходящих из одной точки. Такой граф строится для каждого пиксела, а затем осуществляются несколько проходов по всем лучам, исходящим из этого пиксела. Глобальный минимум вычисляется методами динамического программирования.One of the most well-known implementations of the method of partially global establishment of stereo correspondence is the Semi-Global Matching method (hereinafter also SGM), described, for example, in Heiko Hirschmuller. Accurate and Efficient Stereo Processing by Semi-Global Matching and Mutual Information. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), San Diego, CA, USA. June 20-26, 2005. The graph in the algorithm does not contain cycles and is a tree of a fixed shape: a collection of rays emanating from one point. Such a graph is built for each pixel, and then several passes are made along all the rays emanating from this pixel. The global minimum is calculated using dynamic programming methods.

Метод SGM считается наиболее практичным или функциональным методом для использования в системах реального времени. Это обеспечивает как высокое качество карты глубины, так и, по сравнению с большинством других алгоритмов, низкую потребность в вычислительной мощности и памяти.The SGM method is considered to be the most practical or functional method for use in real-time systems. This provides both high quality depth maps and, compared to most other algorithms, low computational power and memory requirements.

Построение карты диспаратности осуществляют следующим образом:The disparity map is constructed as follows:

1) получают два изображения с левой и правой монокамер стереокамеры;1) two images are obtained from the left and right mono cameras of the stereo camera;

2) к полученной паре изображений, или стереопаре, применяют метод SGM, при котором для каждой точки с координатами в пикселах (х, у) на левом снимке со стереопары находят соответствующую ей точку на правом снимке стереопары, и находят распределение d(x, y) - диспаратность, определяющую, на сколько пикселов на правом снимке эта точка левее, чем на левом снимке, то есть на правом снимке координаты этой точки будут (x-d, y). Если каждой точке левого снимка (х, у) сопоставить ее диспаратность d, получается карта диспаратности.2) the SGM method is applied to the resulting pair of images, or stereopair, in which for each point with coordinates in pixels (x, y) on the left image from the stereopair, the corresponding point on the right image of the stereopair is found, and the distribution d(x, y) is found ) - disparity, which determines how many pixels on the right image this point is to the left than on the left image, that is, on the right image the coordinates of this point will be (x-d, y). If each point on the left image (x, y) is associated with its disparity d, a disparity map is obtained.

Далее, зная координаты точки (х, у) и диспаратность d, получают истинные координаты (X, Y, Z) данной точки в пространстве по следующим формулам:Next, knowing the coordinates of the point (x, y) and the disparity d, the true coordinates (X, Y, Z) of this point in space are obtained using the following formulas:

X=(x⋅Q00+Q03)/W,X=(x⋅Q00+Q03)/W,

Y=(y⋅Q11+Q13)/W,Y=(y⋅Q11+Q13)/W,

Z=Q23/W,Z=Q23/W,

где W=d⋅Q32+Q33, a Q00, Q03, Q11, Q13, Q23 - постоянные, вычисляемые по фокусным расстояниям монокамер стереокамеры и по расстоянию между монокамерами. Указанные постоянные вычисляются один раз и больше не меняются.where W=d⋅Q32+Q33, a Q00, Q03, Q11, Q13, Q23 are constants calculated from the focal lengths of monocameras of stereo cameras and from the distance between monocameras. The specified constants are calculated once and do not change again.

Примером использования метода SGM является способ определения карты глубины из пар стереоизображений, раскрытый в патенте США US 10223802, в котором соответствующее несоответствие, по меньшей мере, для одного соответствующего пикселя одной из пар стереоизображений определяется из дискретного количества предопределенных значений несоответствия, которые распределены по всему предопределенному диапазону значений несоответствия с распределением, которое имеет по меньшей мере два разных интервала между разными соседними значениями диспаратности. Еще в одном варианте реализации способ определения карты глубины содержит этапы получения одной пары стереоизображений; предоставления в блоке оценки предварительно определенного набора дискретных значений диспаратности, которые охватывают диапазон значений диспаратности, причем интервалы между последовательными значениями диспаратности включают в себя первые интервалы и вторые интервалы, причем первые интервалы имеют величину, меньшую, чем величина вторые интервалы; определения соответствующего неравенства для соответствующего пикселя опорного изображения, по меньшей мере, одной пары изображений стерео, содержащего выбор соответствующего неравенства из числа дискретных значений несоразмерности в заранее определенном наборе; и определения соответствующего значение глубины для соответствующего пикселя путем вычисления из соответствующего несоответствия, которое было определено для соответствующего пикселя.An example of the use of the SGM method is the method of determining a depth map from pairs of stereo images, disclosed in US Pat. a range of disparity values with a distribution that has at least two different intervals between different adjacent disparity values. In yet another embodiment, a method for determining a depth map comprises the steps of obtaining one pair of stereo images; providing in the estimator a predetermined set of discrete disparity values that span a range of disparity values, wherein the intervals between successive disparity values include first intervals and second intervals, the first intervals having a value less than the value of the second intervals; determining a corresponding disparity for a corresponding pixel of the reference image of at least one pair of stereo images, comprising selecting a corresponding disparity from among the discrete disparity values in a predetermined set; and determining a corresponding depth value for the corresponding pixel by calculating from the corresponding disparity that has been determined for the corresponding pixel.

С целью дальнейшей экономии вычислительных мощностей проводились разработки, направленные на оптимизацию метода SGM. Так, в патенте США US 9704253 предлагается определять диспаратность для объектов, находящихся вдали от плоскости камеры, с удвоенным разрешением; а для объектов, находящихся вблизи от плоскости камеры, определение диспаратности проводят с обычным разрешением. Таким образом, удается получить более точную карту глубин, в том числе для объектов, находящихся на удалении от плоскости камеры.In order to further save computing power, developments were carried out aimed at optimizing the SGM method. Thus, US patent US 9704253 proposes to determine disparity for objects located far from the camera plane with double resolution; and for objects located close to the camera plane, disparity is determined with normal resolution. Thus, it is possible to obtain a more accurate depth map, including for objects located at a distance from the camera plane.

Для решения задачи идентификации объекта на сформированной модулем технического зрения двумерной карте глубин применяются методы машинного обучения.To solve the problem of identifying an object on a two-dimensional depth map generated by a technical vision module, machine learning methods are used.

Известен способ распознавания объектов, разработанный П. Виолой и М. Дж. Джонсом (P. Viola, M.J. Jones. Robust Real-Time Face Detection International Journal of Computer Vision 57 (2), 137-154, 2004) (далее также метод Виолы-Джонса), также известный как каскады Хаара, обеспечивающий относительно высокую скорость и сравнительно низкую потребную вычислительную мощность. Недостатком этого способа является повышенная чувствительность к обучающим данным, что в дальнейшем может привести к невозможности идентификации объекта, если такой объект окажется расположенным в условиях, сильно отличающихся от условий обучающей выборки (например, при слабом освещении сцены, наличии апериодических помех в виде теней и т.п.).There is a known method for object recognition developed by P. Viola and M.J. Jones (P. Viola, M.J. Jones. Robust Real-Time Face Detection International Journal of Computer Vision 57 (2), 137-154, 2004) (hereinafter also Viola’s method -Jones), also known as Haar cascades, providing relatively high speed and relatively low computing power requirements. The disadvantage of this method is the increased sensitivity to training data, which in the future may lead to the impossibility of identifying an object if such an object is located in conditions very different from the conditions of the training sample (for example, in dimly lit scenes, the presence of aperiodic noise in the form of shadows, etc.) .P.).

Еще одним известным способом распознавания объектов является применение нейронных сетей. Так, в заявке на патент Китая CN 109398688 раскрыто применение нейронной сети с архитектурой SSD-mobilenet для распознавания в режиме реального времени объекта с передачей полученных данных манипулятору транспортного средства. А в публикации Kaiming Не, Georgia Gkioxari, Piotr Dollar Ross Girshick. Mask R-CNN (опубликовано 24.01.2018, доступно по ссылке в сети Интернет https://arxiv.org/pdf/1703.06870.pdf) предложена нейронная сеть архитектуры Mask Region-Based Convolutional Neural Network (сокращенно Mask R-CNN), которая обеспечивает высокую точность распознавания объектов даже в неблагоприятной среде сцены. Однако в сравнении с методом Виолы-Джонса и нейронной сетью архитектуры SSD-mobilenet, сеть архитектуры Mask R-CNN требует, со слов разработчиков, ориентировочно в 20 раз больше вычислительного времени при равных вычислительных мощностях.Another well-known method for recognizing objects is the use of neural networks. Thus, the Chinese patent application CN 109398688 discloses the use of a neural network with SSD-mobilenet architecture for real-time recognition of an object with transmission of the received data to a vehicle manipulator. And in the publication of Kaiming Ne, Georgia Gkioxari, Piotr Dollar Ross Girshick. Mask R-CNN (published on January 24, 2018, available via the Internet link https://arxiv.org/pdf/1703.06870.pdf) a neural network of the Mask Region-Based Convolutional Neural Network architecture (abbreviated Mask R-CNN) is proposed, which ensures high accuracy of object recognition even in unfavorable scene environments. However, in comparison with the Viola-Jones method and the neural network of the SSD-mobilenet architecture, the Mask R-CNN architecture network requires, according to the developers, approximately 20 times more computing time with equal computing power.

Указанные выше способы находят широкое применение в совершенно разных областях техники. Одной из таких областей являются роботизированные системы, предназначенные для помощи пользователям, в том числе с низкой или ограниченной подвижностью.The above methods are widely used in completely different fields of technology. One such area is robotic systems designed to assist users, including those with low or limited mobility.

Так, изобретение согласно заявке на патент США US 2007016425 направлено на повышение качества жизни пользователя, страдающего параличом, и заключается в распознавании в режиме реального времени трехмерной сцены, регистрируемой стереоскопическим модулем системы содействия пользователю, для последующей передачи полученных данных модулю манипуляции указанной системы содействия. Распознавание включает в себя идентификацию объекта, находящегося в пределах указанной сцены. Система содействия содержит модуль манипуляции, модуль технического зрения и модуль обработки и хранения данных. Посредством модуля технического зрения, который включает модуль отслеживания положения глаз пользователя, регистрируется сцена, в которой расположен предполагаемый объект интереса пользователя. Данные, полученные при регистрации сцены, обрабатываются и передаются модулю манипуляции. Модуль манипуляции может включать по меньшей мере один манипулятор для манипулирования объектом.Thus, the invention according to US patent application US 2007016425 is aimed at improving the quality of life of a user suffering from paralysis, and consists in recognizing in real time a three-dimensional scene recorded by the stereoscopic module of a user assistance system, for subsequent transmission of the received data to the manipulation module of the said assistance system. Recognition involves identifying an object within a specified scene. The assistance system contains a manipulation module, a technical vision module, and a data processing and storage module. Through the technical vision module, which includes a module for tracking the user's eye position, the scene in which the user's intended object of interest is located is registered. The data received during scene registration is processed and transmitted to the manipulation module. The manipulation module may include at least one manipulator for manipulating an object.

Применение системы технического зрения для содействия пользователю с нарушениями зрения раскрыто в заявке на патент США US 2007016425. Предлагается распознавать положение объектов в пространстве и затем преобразовывать эти данные в сигналы, которые обеспечат тактильные ощущения пользователю, что позволит ощущать пространство и расположение предметов в пространстве. В качестве средства определения расстояния используется стереокамера, изображения с которой позволяют получить карту глубин. Для этого строят карту диспаратности, которую затем преобразуют в карту глубин. Данные с карты глубин далее направляются в тактильный интерфейс для формирования тактильных ощущений пациента. Недостатком этого известного решения является отсутствие средств и методов распознавания объектов, находящихся в пространстве. Другим недостатком является рекомендуемый алгоритм вычисления диспаратности и карты глубин, который требует большой вычислительной мощности.The use of a vision system to assist a user with a visual impairment is disclosed in US patent application US 2007016425. It is proposed to recognize the position of objects in space and then convert this data into signals that provide tactile sensations to the user, which allows them to sense space and the location of objects in space. A stereo camera is used as a means of determining distance, images from which make it possible to obtain a depth map. To do this, a disparity map is built, which is then converted into a depth map. Data from the depth map is then sent to the tactile interface to generate tactile sensations for the patient. The disadvantage of this known solution is the lack of means and methods for recognizing objects located in space. Another disadvantage is the recommended algorithm for calculating disparity and depth maps, which requires a lot of computing power.

У описанных выше, а также других известных способов распознавания объектов есть недостатки. Во-первых, для повышения точности и эффективности требуются мощные вычислительные системы. Во-вторых, известные способы с большими ошибками или вообще не работают со сложными объектами сцен, такими как объекты сложной и случайной расцветки, прозрачные объекты, сложно окрашенные объекты на сложно окрашенном фоне и т.п.The methods described above, as well as other known methods of object recognition, have disadvantages. First, high-power computing systems are required to improve accuracy and efficiency. Secondly, known methods with large errors or do not work at all with complex scene objects, such as objects of complex and random colors, transparent objects, complexly colored objects on a complexly colored background, etc.

Таким образом, существует задача разработки такого способа распознавания объектов трехмерной сцены, который позволяет надежно работать со сложными объектами, как перечислены выше, не требуя при этом исключительных вычислительных ресурсов.Thus, there is a challenge to develop a method for recognizing objects in a three-dimensional scene that allows you to reliably work with complex objects, as listed above, without requiring exceptional computing resources.

Техническим результатом заявленного изобретения является повышение точности распознавания объектов сложной и случайной расцветки, прозрачных объектов, сложно окрашенных объектов на сложно окрашенном фоне, в том числе таких узоров и окрасок, которых не было и не могло быть в обучающей выборке.The technical result of the claimed invention is to increase the accuracy of recognition of objects of complex and random colors, transparent objects, complexly colored objects on a complexly colored background, including such patterns and colors that were not and could not be in the training set.

Поставленная задача решается, а заявленный технический результат достигается в заявленном способе получения набора объектов трехмерной сцены, в котором одновременно получают изображения кадров с левой камеры и правой камеры (в составе стереокамеры), для каждой точки изображения с пиксельными координатами формируют карту диспаратности методом полуглобального установления стереосоответствий, по ней определяют истинные координаты указанной точки, формируют карту глубин точек в истинных координатах, формируют двухмерное изображение в шкале серого, в котором яркость точки зависит от истинного расстояния до точки, и на полученном двухмерном изображении в шкале серого выполняют детекцию и идентификацию объектов одним из методов, выбранных из метода Виолы-Джонса, метода нейронной сети SSD-mobilenet и метода нейронной сети Mask R-CNN, с получением набора объектов трехмерной сцены. При этом карту диспаратности формируют методом полуглобального установления стереосоответствий. Истинные координаты точки определяют с учетом фокусных расстояний камер стереокамеры и расстояния между ними. Яркость точки принимается равной нулю, если истинное расстояние до нее выходит за заданный диапазон.The stated problem is solved, and the stated technical result is achieved in the claimed method for obtaining a set of objects of a three-dimensional scene, in which images of frames from the left camera and the right camera (as part of a stereo camera) are simultaneously obtained, for each image point with pixel coordinates a disparity map is formed by the method of semi-global establishment of stereo correspondences , the true coordinates of the specified point are determined from it, a map of the depths of the points in the true coordinates is formed, a two-dimensional gray scale image is formed, in which the brightness of the point depends on the true distance to the point, and on the resulting two-dimensional gray scale image, detection and identification of objects is performed using one of methods selected from the Viola-Jones method, the SSD-mobilenet neural network method and the Mask R-CNN neural network method, obtaining a set of 3D scene objects. In this case, the disparity map is formed by the method of semi-global establishment of stereo correspondences. The true coordinates of the point are determined taking into account the focal lengths of the stereo cameras and the distance between them. The brightness of a point is taken equal to zero if the true distance to it is outside the specified range.

В частности, заявленный способ получения набора объектов трехмерной сцены включает следующие шаги.In particular, the claimed method for obtaining a set of three-dimensional scene objects includes the following steps.

Обеспечивают по существу одновременное получение левого кадра с левой камеры и правого кадра с правой камеры при съемке сцены.Provides for substantially simultaneous acquisition of a left frame from the left camera and a right frame from the right camera when shooting a scene.

Формируют карту диспаратности способом полуглобального установления стереосоответствий с получением диспаратности d(x, y) для каждой точки изображения с пиксельными координатами (х, у).A disparity map is formed by the method of semi-global establishment of stereo correspondences, obtaining disparity d(x, y) for each image point with pixel coordinates (x, y).

Определяют истинные координаты (X, Y, Z) точки с пиксельными координатами (х, у) по формулам:Determine the true coordinates (X, Y, Z) of a point with pixel coordinates (x, y) using the formulas:

X=(x⋅Q00+Q03)/W,X=(x⋅Q00+Q03)/W,

Y=(y⋅Q11+Q13)/W,Y=(y⋅Q11+Q13)/W,

Z=Q23/W,Z=Q23/W,

где W=d⋅Q32+Q33, a Q00, Q03, Q11, Q13, Q23 - постоянные, определяемые фокусными расстояниями левой камеры и правой камеры и расстоянием между левой камерой и правой камерой.where W=d⋅Q32+Q33, a Q00, Q03, Q11, Q13, Q23 are constants determined by the focal lengths of the left camera and the right camera and the distance between the left camera and the right camera.

Формируют карту глубин D(x, y), где D - истинное расстояние от левой камеры или правой камеры до точки с пиксельными координатами (х, у),A depth map D(x, y) is generated, where D is the true distance from the left camera or right camera to the point with pixel coordinates (x, y),

Формируют двухмерное изображение в шкале серого, в котором яркость Ф(х, у) точки с пиксельными координатами (х, у) определяют по формулам:A two-dimensional image is formed in a gray scale, in which the brightness Ф(x, y) of a point with pixel coordinates (x, y) is determined by the formulas:

Ф(х,у)=0, если D(x,y)<Dmin,Ф(x,y)=0, if D(x,y)<Dmin,

Ф(х,у)=255, если D(x,y)>Dmax,Ф(x,y)=255, if D(x,y)>Dmax,

Ф(х,у)=255⋅(D(x,y) - Dmin)/(Dmax - Dmin) - в остальных случаях,F(x,y)=255⋅(D(x,y) - Dmin)/(Dmax - Dmin) - in other cases,

где Dmin и Dmax - заданные соответственно минимальное и максимальное значения глубины, определяемые из контекста применения заявленного способа. Например, если стереокамера обслуживает манипулятор для захвата и перемещения объектов с диаметром рабочей зоны манипулятора 3 м, при этом находясь на расстоянии 1,5 м от центра рабочей зоны, можно взять Dmin=0,2 м, полагая, что в более ближней к стереокамере зоне манипуляции не планируются, a Dmax=5 м, чтобы гарантированно отображать рабочую зону манипулятора и ее окрестность, т.е. задав запас примерно 0,3 м и 0,5 м соответственно от ближней границы рабочей зоны манипулятора и от дальней границы рабочей зоны манипулятора.where Dmin and Dmax are the specified minimum and maximum depth values, respectively, determined from the context of application of the claimed method. For example, if a stereo camera serves a manipulator for grasping and moving objects with a diameter of the manipulator’s working area of 3 m, while being at a distance of 1.5 m from the center of the working area, you can take Dmin = 0.2 m, assuming that in the one closest to the stereo camera manipulation zone is not planned, and Dmax = 5 m, in order to guarantee that the working area of the manipulator and its surroundings are displayed, i.e. setting a margin of approximately 0.3 m and 0.5 m, respectively, from the near boundary of the manipulator’s working area and from the far boundary of the manipulator’s working area.

На полученном двухмерном изображении в шкале серого выполняют детекцию и идентификацию объектов одним из методов, выбранных из метода Виолы-Джонса, метода нейронной сети SSD-mobilenet и метода нейронной сети Mask R-CNN, с получением набора объектов трехмерной сцены.On the resulting two-dimensional gray scale image, object detection and identification is performed using one of the methods selected from the Viola-Jones method, the SSD-mobilenet neural network method and the Mask R-CNN neural network method, obtaining a set of three-dimensional scene objects.

Главной особенностью заявленного способа, отличающего его от известных аналогов, является то, что детекцию и идентификацию объектов выполняют не на изображении точек в пиксельных координатах, а на двухмерном изображении в шкале серого (предпочтительно, 8-битного), в котором яркость точки зависит от истинного расстояния до точки, т.е. от истинных координат точки. При этом детектированию и идентификации подвергают не узоры, рисунки, надписи на объектах и т.п., а темные силуэты объектов на светлом фоне. Поскольку фон является более удаленным, чем объекты, и между фоном и объектами есть некоторое расстояние, то фон на двухмерном изображении в шкале серого светлее объектов, причем есть контрастная граница между фоном и объектами. Объекты выглядят компактными, контрастными темными силуэтами именно потому, что они расположены ближе, чем фон, и чем ближе объект, тем темнее силуэт объекта. Как следствие, периодические, квазипериодические и стохастические узоры, вообще свойства прозрачности фона и объектов не влияют на процесс детекции и идентификации, потому что обрабатывается только геометрический силуэт, полученный из карты глубин, а на этом этапе данные о цветовых и оптических характеристиках объекта уже отсутствуют, т.к. они были отсеяны на этапе стереореконструкции, когда вместо видимого двумерного изображения работают с картой глубин, на которой отсутствуют данные о раскраске объекта.The main feature of the claimed method, which distinguishes it from known analogues, is that detection and identification of objects is performed not on an image of points in pixel coordinates, but on a two-dimensional image in a gray scale (preferably 8-bit), in which the brightness of a point depends on the true distance to the point, i.e. from the true coordinates of the point. In this case, it is not patterns, drawings, inscriptions on objects, etc., that are subject to detection and identification, but dark silhouettes of objects on a light background. Since the background is more distant than the objects, and there is some distance between the background and the objects, the background in a 2D grayscale image is lighter than the objects, and there is a contrasting border between the background and the objects. Objects appear as compact, contrasting dark silhouettes precisely because they are closer than the background, and the closer the object, the darker the object's silhouette. As a consequence, periodic, quasi-periodic and stochastic patterns, in general, the transparency properties of the background and objects do not affect the detection and identification process, because only the geometric silhouette obtained from the depth map is processed, and at this stage there is no data on the color and optical characteristics of the object, because they were eliminated at the stage of stereo reconstruction, when instead of a visible two-dimensional image they work with a depth map, which does not contain data on the coloring of the object.

Устойчивость предложенного способа обусловлена тем, что при непосредственном анализе изображений, как это принято в аналогах, помехообразующие факторы раскраски и прозрачности влияют непосредственно на менее устойчивый к ошибкам алгоритм 20-распознавания. В заявленном способе изображения сначала проводят стереореконструкцию, результат которой несравнимо более устойчив к помехообразующим факторам, и получаемая карта глубин не подвержена данным факторам. Иными словами, стереореконструкция используется как фильтр, удаляющий помехообразующие факторы раскраски и прозрачности объектов и фона, так что даже простое поднесение к камерам плоского изображения объекта, например фотографии, будет распознано именно как плоский объект-фотография. При этом эффективность способа обусловлена тем, что результат стереореконструкции несравнимо более устойчив к помехообразующим факторам, чем этап детекции и идентификации объектов, и за счет этого производится более устойчивая и точная детекция и идентификация объектов со сложной раскраской, с полной или частичной прозрачностью и т.п.The stability of the proposed method is due to the fact that when directly analyzing images, as is common in analogues, the noise-causing factors of coloring and transparency directly affect the 20-recognition algorithm, which is less resistant to errors. In the claimed image method, stereo reconstruction is first carried out, the result of which is incomparably more resistant to noise-forming factors, and the resulting depth map is not affected by these factors. In other words, stereo reconstruction is used as a filter that removes noise-forming factors of coloring and transparency of objects and backgrounds, so that even simply bringing a flat image of an object, such as a photograph, to the cameras will be recognized as a flat photographic object. Moreover, the effectiveness of the method is due to the fact that the result of stereo reconstruction is incomparably more resistant to noise-forming factors than the stage of detection and identification of objects, and due to this, more stable and accurate detection and identification of objects with complex coloring, with full or partial transparency, etc. is produced. .

Детекцию и идентификацию объектов для получения набора объектов трехмерной сцены выполняют одним из методов, выбранных из метода Виолы-Джонса, метода нейронной сети SSD-mobilenet и метода нейронной сети Mask R-CNN.Detection and identification of objects to obtain a set of three-dimensional scene objects is performed using one of the methods selected from the Viola-Jones method, the SSD-mobilenet neural network method and the Mask R-CNN neural network method.

При выборе метода Виолы-Джонса является предпочтительным, если область изображения просматривают с применением процедуры скользящего, поскольку объекты могут присутствовать в любом месте изображения. Скользящее окно - это окно, размер которого вначале совпадает с изображением сцены, затем пропорционально уменьшается с заданным шагом, например, шагом 0,1 от размеров окна на предыдущем шаге. При каждом размере окна данным окном последовательно накрывают различные участки изображения сцены и выполняют проверку наличия объекта интереса в окне. Скользящее окно применяется в задачах детектирования объекта на изображении для накрытия им всех участков, которые могут быть заняты объектом, с последующей проверкой соответствующим классификатором нахождения объектов в окне. Также предпочтительно, если формируют обучающую выборку и проводят обучение классификатора до этапа детекции и идентификации объектов. При этом обучение классификатора включает в себя представление тестового изображения вектором признаков, установление принадлежности изображения определенному классу изображений, оценку правильности классификации, причем в случае ошибки вывода корректируют по меньшей мере одно из описания класса изображений и модели объекта, и формирование усредненного объекта, относящегося к данному классу изображений, и правила, по которому классификация осуществляется наиболее точно. Например, цветное изображение рассматривается как набор чисел (признаков), по которым производится детектирование объекта. Обученный детектор объектов - это описание, каким должно быть входное изображение (размер и цветность); описание, каким способом входное изображение преобразуют в набор признаков-чисел для подачи на вход детектора (построчное чтение и нормировка); и собственно сам обученный детектор объектов, дающий либо бинарное суждение (метод Виолы-Джонса) либо «оценку правильности», то есть вес принадлежности объекта к заданной категории, например: объект на 97% кошка, на 2% собака, на 1% кирпич. Выбирают категорию, вес которой максимален.When choosing the Viola-Jones method, it is preferable if the image area is scanned using a sliding procedure, since objects can be present anywhere in the image. A sliding window is a window whose size first matches the scene image, then proportionally decreases with a given step, for example, a step of 0.1 from the window size at the previous step. For each window size, different parts of the scene image are sequentially covered with this window and the presence of an object of interest in the window is checked. A sliding window is used in problems of detecting an object in an image to cover all areas that can be occupied by the object, followed by checking the location of objects in the window with the appropriate classifier. It is also preferable if a training sample is formed and the classifier is trained before the stage of object detection and identification. In this case, training the classifier includes representing a test image as a feature vector, establishing that the image belongs to a certain class of images, assessing the correctness of the classification, and in the case of an output error, at least one of the descriptions of the image class and the object model is corrected, and the formation of an averaged object related to this class of images, and the rules by which classification is carried out most accurately. For example, a color image is considered as a set of numbers (features) by which an object is detected. A trained object detector is a description of what the input image should be (size and color); description of how the input image is converted into a set of features-numbers to be fed to the detector input (line-by-line reading and normalization); and the trained object detector itself, which gives either a binary judgment (Viola-Jones method) or a “correctness assessment”, that is, the weight of an object’s belonging to a given category, for example: the object is 97% cat, 2% dog, 1% brick. Select the category whose weight is maximum.

Для реализации метода Виолы-Джонса можно использовать функцию cvHaarDetectObjects() открытой библиотеки OpenCV.To implement the Viola-Jones method, you can use the cvHaarDetectObjects() function of the OpenCV open source library.

При выборе метода нейронной сети SSD-mobilenet или метода нейронной сети Mask R-CNN также предпочтительно, если формируют обучающую выборку и проводят обучение классификатора до этапа детекции и идентификации объектов. При этом формирование обучающей выборки включает выделение объектов на плоском цветном изображении, формирование для каждого объекта первого образца объекта из плоского цветного изображения и второго образца объекта из соответствующего участка карты диспаратности. Обучающую выборку применяют до тех пор, пока точность распознавания не достигнет заданного значения, при котором, в частности, вероятность ошибки первого рода (не обнаружить наличествующий объект) и вероятность ошибки второго рода (детекция объекта, который на самом деле отсутствует) менее заданного значения (обычно варьируется от 0,001 до 0,01), а относительная ошибка позиционирования (отношение площади разности рамок объектов к площади объединения рамок), например, менее 0,1.When choosing the SSD-mobilenet neural network method or the Mask R-CNN neural network method, it is also preferable if a training sample is formed and the classifier is trained before the stage of object detection and identification. In this case, the formation of a training sample includes the selection of objects in a flat color image, the formation for each object of a first sample of an object from a flat color image and a second sample of an object from the corresponding section of the disparity map. The training set is used until the recognition accuracy reaches a specified value, at which, in particular, the probability of a type I error (not detecting an object that is present) and the probability of a second type error (detecting an object that is actually absent) is less than a specified value ( usually varies from 0.001 to 0.01), and the relative positioning error (the ratio of the area of the difference of the object frames to the area of the union of the frames), for example, is less than 0.1.

Выбор конкретных методов нейронный сетей SSD-mobilenet и Mask R-CNN обусловлен тем, что в данном классе задач SSD-mobilenet оптимально сочетает качество и скорость распознавания при выделении объекта прямоугольной рамкой, а Mask R-CNN оптимально сочетает качество и скорость распознавания при построении бинарной маски, максимально аккуратно накрывающей объект, когда относительная разность области, ограниченной границей объекта и области, накрытой маской, минимальна. Здесь относительная разность областей - это отношение площади разности областей к площади их объединения. Данные нейронные сети могут быть реализованы, например, в среде tensorflow как приложение на языке Python.The choice of specific neural network methods SSD-mobilenet and Mask R-CNN is due to the fact that in this class of problems SSD-mobilenet optimally combines the quality and speed of recognition when selecting an object with a rectangular frame, and Mask R-CNN optimally combines the quality and speed of recognition when constructing a binary a mask that covers the object as accurately as possible when the relative difference between the area limited by the boundary of the object and the area covered by the mask is minimal. Here, the relative difference of the regions is the ratio of the area of the difference of the regions to the area of their union. These neural networks can be implemented, for example, in the tensorflow environment as an application in Python.

Так как для каждого используемого метода детекции и идентификации объектов обучение происходит не в реальном времени (т.е. затраты времени на него слабо лимитированы), а также обученный классификатор можно тиражировать столько раз, сколько нужно, целесообразно обучить все три классификатора, соответствующие указанным выше методам детекции и идентификации объектов.Since for each used method of detection and identification of objects, training does not occur in real time (i.e., the time spent on it is weakly limited), and the trained classifier can be replicated as many times as necessary, it is advisable to train all three classifiers corresponding to the above methods of detection and identification of objects.

Тогда появляется возможность использования классификатора, который обеспечит максимальное качество распознавания по следующим критериям:Then it becomes possible to use a classifier that will provide maximum recognition quality according to the following criteria:

- устойчивость, минимальная зависимость от вида освещения и фоновых объектов сцены;- stability, minimal dependence on the type of lighting and background objects in the scene;

- минимизация ошибки первого рода, когда присутствующий на сцене объект не определяется, т.е. не распознается;- minimization of the first type error, when an object present on the scene is not detected, i.e. not recognized;

- минимизация ошибки второго рода, когда определяется объект, который на самом деле отсутствует;- minimizing errors of the second type, when an object is identified that is actually absent;

- минимизация ошибок оценки форм-фактора, когда сформированная рамка объекта отличается от «истинной» рамки, ограничивающей объект. При этом используется универсальный относительный критерий близости двух рамок - отношение площади симметрической разности рамок (то есть участков, которые находятся внутри одной рамки, но вне другой), к площади объединения двух рамок.- minimizing form factor estimation errors when the generated object frame differs from the “true” frame limiting the object. In this case, a universal relative criterion for the proximity of two frames is used - the ratio of the area of the symmetrical difference between the frames (that is, the areas that are inside one frame but outside the other) to the area of the union of the two frames.

В каждом случае применяется тот метод детекции и идентификации объектов, который обеспечит максимальное качество детектирования объектов. Выбор метода детекции и идентификации объектов осуществляют на основании анализа сцены, анализа фона и анализа окружения, в частности, на основании эмпирических данных о наилучшем методе детекции при данной структуре сцены. Например, если происходит классификация объекта как объекта фиксированной известной формы (например, круглое яблоко, цилиндрический стакан), то достаточно данных классификатора метода Виолы-Джонса или данных классификатора метода нейронной сети SSD-mobilenet, так как они определят класс объекта и с достаточной точностью отрисуют рамку, ограничивающую объект. В случае, если форма объекта может сильно меняться (могут появляться выступы, изгибы, впадины в достаточно произвольных местах), предпочтительней будет использовать метод нейронной сети Mask R-CNN, которая, помимо прочего, позволит определить формы объекта, указывая бинарной маской истинную текущую форму объекта.In each case, the method of object detection and identification is used that will ensure maximum quality of object detection. The choice of a method for detecting and identifying objects is carried out on the basis of scene analysis, background analysis and environment analysis, in particular, on the basis of empirical data on the best detection method for a given scene structure. For example, if an object is classified as an object of a fixed known shape (for example, a round apple, a cylindrical glass), then the data from the classifier of the Viola-Jones method or the data from the classifier of the SSD-mobilenet neural network method is sufficient, since they will determine the class of the object and draw with sufficient accuracy a frame that encloses an object. If the shape of an object can change greatly (protrusions, bends, depressions may appear in fairly arbitrary places), it would be preferable to use the Mask R-CNN neural network method, which, among other things, will allow you to determine the shape of the object, indicating the true current shape with a binary mask object.

Заявленный способ многократно испытывали на разных объектах сложной формы и текстуры, в том числе при распознавании прозрачных объектов и объектов неизвестной (случайной) окраски, а также сложном фоне.The claimed method was repeatedly tested on various objects of complex shape and texture, including the recognition of transparent objects and objects of unknown (random) color, as well as complex backgrounds.

При испытаниях заявленного способа получения набора объектов трехмерной сцены использовался видеорежим 640x480 для левой камеры и правой камеры стереокамеры. Расстояние до объектов варьировалось от 1 до 5 м, характерные размеры объектов составляли 0,03 до 0,5 метров. В качестве объектов использовались сделанные из папье-маше яблоки диаметром примерно 0,1 м, картонные и пластиковые стаканы емкостью 0,25-0,5 л, стеклянные и пластиковые бутылки той же емкости, другое. Яблоки из папье-маше имели монохромную раскраску зеленого, желтого, красного цвета; стаканы и бутылки использовались прозрачные, монохромные, а также с различными цветными узорами и рисунками на боковых поверхностях.When testing the claimed method for obtaining a set of objects in a three-dimensional scene, the 640x480 video mode was used for the left camera and the right camera of the stereo camera. The distance to the objects varied from 1 to 5 m, the characteristic sizes of the objects were 0.03 to 0.5 meters. The objects used were apples made of papier-mâché with a diameter of approximately 0.1 m, cardboard and plastic glasses with a capacity of 0.25-0.5 liters, glass and plastic bottles of the same capacity, and more. Papier-mâché apples had a monochrome coloring of green, yellow, and red; glasses and bottles were transparent, monochrome, as well as with various colored patterns and designs on the side surfaces.

Пример реализации заявленного способа приведен на фигуре. На левом кадре представлено изображение от стереокамеры, на правом кадре - соответствующая ему карта глубин (цветные изображения были переведены в изображения в градациях серого). Прямоугольными рамками выделен результат работы классификатора, совместно обрабатывающего данные цветности и глубины (рамки на левом и на правом кадре идентичны). Виден захват классификатором прозрачных объектов (пластиковых бутылок) именно за счет их четкого отображения на карте глубин.An example of the implementation of the claimed method is shown in the figure. The left frame shows the image from the stereo camera, the right frame shows the corresponding depth map (color images were converted to grayscale images). Rectangular frames highlight the result of the classifier, which jointly processes color and depth data (the frames on the left and right frames are identical). The classifier’s capture of transparent objects (plastic bottles) is visible precisely due to their clear display on the depth map.

При использовании способов-аналогов, основанных на распознавании двумерного изображения, прозрачные пластиковые бутылки не детектировались.When using analogous methods based on two-dimensional image recognition, transparent plastic bottles were not detected.

Кроме того, рисунок, нанесенный на объект, может выполнять маскировочную роль, то есть мешать распознаванию объекта либо вызвать распознавание нанесенного двумерного изображения вместо фактического объекта. Заявленный способ лишен и этого недостатка.In addition, a pattern applied to an object can play a camouflage role, that is, interfere with recognition of the object or cause recognition of the applied two-dimensional image instead of the actual object. The claimed method is free of this drawback.

Таким образом, заявленный способ получения набора объектов трехмерной сцены выполняет распознавание объектов сложной и случайной расцветки, прозрачных объектов, сложно окрашенных объектов на сложно окрашенном фоне, причем таких узоров и окрасок, которых не было и не могло быть в обучающей выборке. Способ позволяет искать объекты не только заданного назначения, но и объекты форм-фактора, удобного для упаковки, удобного для манипуляций данной моделью манипулятора и т.д. При этом реализация способа не имеет особых требований к аппаратным ресурсам, поскольку сводится к стереореконструкции и методам детекции и идентификации объектов типа методов Виолы-Джонса, нейронных сетей SSD-mobilenet nMask R-CNN, а значит, быстр и прост в использовании.Thus, the claimed method for obtaining a set of objects in a three-dimensional scene recognizes objects of complex and random colors, transparent objects, complexly colored objects on a complexly colored background, and such patterns and colors that were not and could not be in the training set. The method allows you to search for objects not only for a given purpose, but also for objects of a form factor that is convenient for packaging, convenient for manipulation by a given model of manipulator, etc. At the same time, the implementation of the method does not have any special requirements for hardware resources, since it comes down to stereo reconstruction and methods of detection and identification of objects such as Viola-Jones methods, neural networks SSD-mobilenet nMask R-CNN, which means it is fast and easy to use.

Claims

1. A method for obtaining a set of three-dimensional scene objects, in which the following steps are performed:

a) provide substantially simultaneous acquisition of a left frame from the left camera and a right frame from the right camera when filming a scene,

b) forming a disparity map by semi-globally establishing stereo correspondences to obtain disparity d(x, y) for each image point with pixel coordinates (x, y),

c) determine the true coordinates (X, Y, Z) of a point with pixel coordinates (x, y) using the formulas

X=(x⋅Q00+Q03)/W,

Y=(y⋅Q11+Q13)/W,

Z=Q23/W,

where W=d⋅Q32+Q33, a Q00, Q03, Q11, Q13, Q23 are constants determined by the focal lengths of the left camera and the right camera and the distance between the left camera and the right camera,

d) form a depth map D(x, y), where D is the true distance from the left camera or right camera to the point with pixel coordinates (x, y),

e) form a two-dimensional image in a gray scale, in which the brightness Ф(x, y) of a point with pixel coordinates (x, y) is determined by the formulas:

Ф(x, y)=0, if D(x, y)<Dmin,

Ф(x, y)=255, if D(x, y)>Dmax,

Ф(x, y)=255⋅(D(x, y)-Dmin)/(Dmax-Dmin) - in other cases,

where Dmin and Dmax are the specified minimum and maximum depth values, respectively,

f) on the resulting two-dimensional gray scale image, objects are detected and identified using one of the methods selected from the Viola-Jones method, the SSD-mobilenet neural network method and the Mask R-CNN neural network method, obtaining a set of three-dimensional scene objects.

2. The method according to claim 1, wherein step f) is performed by the Viola-Jones method, the image region being viewed using a sliding window procedure.

3. The method according to claim 2, in which, before the start of stage f), a training sample is formed and the classifier is trained.

4. The method according to claim 3, in which training the classifier includes:

- representation of the test image by a feature vector,

- establishing whether an image belongs to a certain class of images,

- assessing the correctness of the classification, and in the case of an output error, at least one of the description of the image class and the object model is corrected, and

- formation of an average object belonging to a given class of images, and the rule by which classification is carried out most accurately.

5. The method according to claim 1, in which step f) is performed using the SSD-mobilenet neural network method.

6. The method according to claim 1, in which step f) is performed using the Mask R-CNN neural network method.

7. The method according to claim 5 or 6, in which, before the start of step f), a training sample is formed and the classifier is trained.

8. The method according to claim 7, in which the formation of a training sample includes:

- highlighting objects on a flat color image,

- formation for each object of the first sample of an object from a flat color image and the second sample of an object from the corresponding section of the disparity map.

9. The method according to claim 8, in which training the classifier includes applying a training sample until the recognition accuracy reaches a specified value.

10. The method according to claim 1, in which at step f) the selection of a method for detecting and identifying objects is carried out based on scene analysis, background analysis and environment analysis.