RU2816047C1

RU2816047C1 - Method and device for marking sign language gestures

Info

Publication number: RU2816047C1
Application number: RU2023117837A
Authority: RU
Inventors: Алексей Леонидович Приходько; Михаил Геннадьевич Гриф
Filing date: 2023-07-06
Publication date: 2024-03-26

Abstract

FIELD: data processing.

SUBSTANCE: group of inventions relates to information technology, in particular to a device and a method for marking sign language gestures. Disclosed is a device for implementing the method, which comprises steps of: displaying a video fragment in a user interface, comprising one gesture component, a gesture or a sign language phrase, wherein in the frame there is one full-face speaker performing at least one sign language gesture during the video process, wherein the frame completely covers the region of space in which the gesture is performed, and parts of the speaker’s body involved in the gesture, providing an interface element providing an indication of a right hand and a left hand, displaying in the user interface a set of visual representations characterizing gesture components included in a predetermined sign language notation system, wherein the set of visual representations comprises at least the following units: hand configuration visual representations unit, hand orientation visual representations unit, wherein the visual representation is an image, an animated image or a short video, which simply characterizes the corresponding gesture component, wherein at least one visual representation unit comprises visual representation sub-units for each hand, wherein the sub-units are formed according to the gesture component parameters, wherein the gesture component parameters include the hand configuration type or the hand orientation type, is received from a user for a gesture contained in a video fragment, for each hand participating in the gesture, an indication of the hand and indication of one or more visual representations from each unit of visual representations performed using the onscreen pointer, for each hand participating in the gesture, forming a gesture identifier corresponding to the notation of this gesture in the notation system, according to received instructions from user corresponding to this hand, and forming an entry in the markup file, containing an indication of the video fragment and for each hand participating in the gesture, a gesture identifier, wherein the data set comprising the markup file and the video fragments specified in the markup file is used for machine learning of the gesture recognition model, video fragments are used as training data, and gesture identifiers from a markup file are used as markup.

EFFECT: group of inventions provides high rate of marking by simplifying the interpretation of gesture components.

11 cl, 10 dwg

Description

Область техники, к которой относится изобретениеField of technology to which the invention relates

Изобретение относится к информационным технологиям, в частности к устройству и способу для разметки жестов жестового языка.The invention relates to information technology, in particular to a device and method for marking sign language gestures.

Уровень техникиState of the art

Традиционными и привычными средствами коммуникации для слышащих людей являются звуковые языки, которые лишь в качестве вспомогательной функции или для усиления самовыражения дополняются невербальными средствами. В отличие от звуковых языков, жестовый язык глухих является естественным языком, в котором жест, наоборот, является главной смысловой единицей. Между глухими и слышащими, как и между носителями разных жестовых и звуковых языков, существует языковой барьер. Кроме того, для звуковых языков разработано множество систем распознавания речи и ее преобразования в текст, тогда как жестовые языки пока не имеют таких широко доступных инструментов. Одна из важных социальных задач состоит в устранении подобных барьеров путем создания технических средств коммуникации, позволяющих быстро выполнять перевод с жестового языка на звуковой и обратно, а также перевод с жестового языка в текст и обратно. Для реализации этих задач необходимо разработать систему распознавания жестового языка. Настоящее изобретение нацелено на содействие созданию такой системы на основе методов машинного обучения. Поскольку жестовый язык является визуальным, для его распознавания необходима видеозапись или изображение говорящего. Соответственно, для обучения модели распознавания необходим набор данных, содержащий видеозаписи или изображения с разметкой записанных на них жестов жестового языка.Traditional and familiar means of communication for hearing people are auditory languages, which are supplemented by non-verbal means only as an auxiliary function or to enhance self-expression. Unlike audio languages, sign language of the deaf is a natural language in which gesture, on the contrary, is the main semantic unit. There is a language barrier between deaf and hearing people, as well as between speakers of different sign and audio languages. In addition, many speech recognition and text-to-text systems have been developed for audio languages, while signed languages do not yet have such widely available tools. One of the important social tasks is to eliminate such barriers by creating technical means of communication that allow rapid translation from sign language to audio and vice versa, as well as translation from sign language to text and vice versa. To implement these tasks, it is necessary to develop a sign language recognition system. The present invention aims to facilitate the creation of such a system based on machine learning techniques. Since sign language is visual, recognition requires a video or image of the speaker. Accordingly, to train a recognition model, a dataset containing video recordings or images with markings of sign language gestures recorded on them is required.

Известен раскрытый в RU 2737600 C1 способ формирования обучающих данных для нейронной сети в точке розничных продаж, заключающийся в том, что для каждого пользователя динамически формируют изображения набора товаров-покупок устройством регистрации изображения, которое осуществляет запись, хранение и передачу регистрируемого изображения товаров с меткой времени, а также кассовым аппаратом с возможностью формирования набора данных, при этом данные могут включать как позиции, регистрируемые кассовым аппаратом и содержащиеся в кассовых чеках, так и данные регистрации данных позиций, соответствующих объектам, размещаемым в области видимости устройства регистрации изображения, для отправки на сервер обработки данных. Данное решение позволяет формировать обучающие данные, однако оно неприменимо для разметки жестов жестового языка.There is a known method for generating training data for a neural network at a retail point, disclosed in RU 2737600 C1, which consists in the fact that for each user images of a set of purchases are dynamically generated by an image recording device, which records, stores and transmits the registered image of goods with a time stamp , as well as a cash register with the ability to generate a data set, wherein the data may include both positions registered by the cash register and contained in cash receipts, and registration data for position data corresponding to objects placed in the visibility area of the image recording device, for sending to the server data processing. This solution allows you to generate training data, but it is not applicable for marking sign language gestures.

Известна раскрытая в RU 2754095 C1 методика подготовки наборов фотографий для машинного анализа для персональной идентификации животных по морде. Данное решение позволяет формировать набор фотографий для машинного анализа, однако оно неприменимо для разметки жестов жестового языка.The method disclosed in RU 2754095 C1 for preparing sets of photographs for machine analysis for personal identification of animals by face is known. This solution allows you to generate a set of photographs for machine analysis, but it is not applicable for marking sign language gestures.

Известен раскрытый в WO 2022/226642 A1 способ генерации размеченного набора данных, в котором производится видеозапись человека, и затем пользователь выбирает временную метку в пределах видео и вводит разметку, соответствующую этой временной метке. Указывается, что данное решение применимо для разметки жестов жестового языка, однако оно не учитывает особенности жестового языка и не предусматривает особенности съемки. Так, необходимо отметить, что жесты в жестовом языке могут выполняться как правой или левой рукой отдельно, так и обеими руками вместе, и данное решение не дает ответ на то, как при разметке охватить все эти случаи, не усложняя процесс. Кроме того, при распознавании жестов существует проблема сегментации разных частей тела, но аналогичные трудности испытывает и пользователь, выполняющий разметку, а данное решение не обеспечивает пути упрощения интерпретации жестов.A method disclosed in WO 2022/226642 A1 is known for generating a labeled data set in which a video is recorded of a person, and then the user selects a timestamp within the video and enters a markup corresponding to that timestamp. It is indicated that this solution is applicable for marking sign language gestures, but it does not take into account the features of sign language and does not provide for the features of filming. Thus, it should be noted that gestures in sign language can be performed either with the right or left hand separately, or with both hands together, and this solution does not answer how to cover all these cases when marking without complicating the process. In addition, when recognizing gestures, there is a problem of segmenting different parts of the body, but the user performing the marking experiences similar difficulties, and this solution does not provide a way to simplify the interpretation of gestures.

Более того, для символьного представления жестового языка разработано несколько систем нотации, в которых запись жеста производится путем описания комбинации компонентов этого жеста с помощью специальных знаков. При этом, например, число знаков в одной из самых известных систем нотации SignWriting превышает 38 тысяч, и простое перечисление этих знаков на экране компьютера для того, чтобы пользователь выбрал из них нужный в качестве разметки, потребует от пользователя высокого уровня знания языка и даже при этом займет чрезвычайно много времени.Moreover, for the symbolic representation of sign language, several notation systems have been developed in which a gesture is recorded by describing the combination of components of this gesture using special signs. At the same time, for example, the number of characters in one of the most famous notation systems, SignWriting, exceeds 38 thousand, and a simple listing of these characters on the computer screen in order for the user to select the one needed as markup will require a high level of knowledge of the language from the user, and even this will take an extremely long time.

Таким образом, в уровне техники существует потребность в создании технических средств, позволяющих упростить процесс разметки жестов жестового языка.Thus, there is a need in the prior art to create technical means to simplify the process of marking sign language gestures.

Сущность изобретенияThe essence of the invention

Настоящее изобретение направлено на создание устройств и систем, позволяющих устранить по меньшей мере некоторые из указанных выше недостатков предшествующего уровня техники.The present invention is directed to devices and systems that eliminate at least some of the above disadvantages of the prior art.

В частности, предложен способ разметки жестов жестового языка, реализуемый с помощью компьютерного устройства, содержащего процессор, память, экран и средство управления экранным указателем, содержащий этапы, на которых:In particular, a method for marking sign language gestures is proposed, implemented using a computer device containing a processor, memory, a screen and a screen pointer control device, containing the steps of:

- отображают в пользовательском интерфейсе видеофрагмент, содержащий один компонент жеста, жест или фразу жестового языка, при этом в кадре находится один говорящий в анфас, выполняющий в процессе видео по меньшей мере один жест жестового языка, при этом кадр полностью охватывает область пространства, в которой выполняется жест, и части тела говорящего, участвующие в выполнении жеста,- display in the user interface a video fragment containing one component of a gesture, a gesture or a phrase of sign language, while the frame contains one speaker in frontal view, performing at least one sign language gesture during the video, and the frame completely covers the area of space in which the gesture is being performed, and the parts of the speaker's body involved in performing the gesture are

- предоставляют элемент интерфейса, обеспечивающий указание правой руки и левой руки,- provide an interface element providing indication of the right hand and left hand,

- отображают в пользовательском интерфейсе набор визуальных представлений, характеризующих компоненты жестов, входящие в предварительно заданную систему нотации жестового языка, при этом набор визуальных представлений содержит по меньшей мере следующие блоки: блок визуальных представлений конфигурации руки, блок визуальных представлений ориентации руки,- display in the user interface a set of visual representations characterizing the components of gestures included in a predefined sign language notation system, wherein the set of visual representations contains at least the following blocks: a block of visual representations of the hand configuration, a block of visual representations of the hand orientation,

причем визуальное представление представляет собой изображение, анимированное изображение или короткое видео, упрощенно характеризующее соответствующий компонент жеста,wherein the visual representation is an image, animated image or short video that simplifies the corresponding component of the gesture,

причем по меньшей мере один блок визуальных представлений содержит подблоки визуальных представлений для каждой руки, причем подблоки сформированы согласно параметрам компонентов жеста,wherein the at least one visual representation block contains visual representation sub-blocks for each hand, wherein the sub-blocks are formed according to the parameters of the gesture components,

причем параметры компонентов жеста включают в себя тип конфигурации руки или тип ориентации руки,wherein the parameters of the gesture components include a hand configuration type or a hand orientation type,

- принимают от пользователя для жеста, содержащегося в видеофрагменте, для каждой руки, участвующей в жесте, выполненное с помощью экранного указателя указание руки и указание одного или более визуальных представлений из каждого блока визуальных представлений,- receiving from the user for a gesture contained in a video fragment, for each hand participating in the gesture, an indication of a hand made using an on-screen pointer and an indication of one or more visual representations from each block of visual representations,

- для каждой руки, участвующей в жесте, формируют идентификатор жеста, соответствующий нотации этого жеста в системе нотации, согласно принятым указаниям от пользователя, соответствующим этой руке, и- for each hand participating in the gesture, a gesture identifier is generated corresponding to the notation of this gesture in the notation system, according to the received instructions from the user corresponding to this hand, and

- формируют запись в файле разметки, содержащую указание видеофрагмента и для каждой руки, участвующей в жесте, идентификатор жеста,- create a record in the markup file containing an indication of the video fragment and, for each hand involved in the gesture, the gesture identifier,

причем набор данных, содержащий файл разметки и видеофрагменты, указанные в файле разметки, используется для машинного обучения модели распознавания жестов жестового языка, видеофрагменты используются в качестве обучающих данных, а идентификаторы жестов из файла разметки используются в качестве разметки.wherein the dataset containing the markup file and video fragments specified in the markup file are used for machine learning of a sign language gesture recognition model, the video fragments are used as training data, and the gesture IDs from the markup file are used as markup.

В одном из вариантов осуществления способ дополнительно содержит этап, на котором:In one embodiment, the method further comprises the step of:

- выполняют видеозапись человека, говорящего на жестовом языке.- make a video recording of a person speaking sign language.

В одном из вариантов осуществления говорящий одет в одежду, цвет которой является контрастным к цвету фона, к цвету лица и к цвету кисти или перчаток, одежда полностью закрывает тело до шеи говорящего и руки по меньшей мере в области плеч и не имеет элементов, выступающих выше шеи, одежда близко повторяет форму тела говорящего, цвет кисти или перчаток является контрастным к цвету одежды, цвет фона является контрастным к цвету лица и к цвету кисти или перчаток.In one embodiment, the speaker is wearing clothing the color of which contrasts with the background color, the color of the face and the color of the hand or gloves, the clothing completely covers the body up to the speaker's neck and arms at least in the shoulder area and has no elements projecting above necks, clothing closely follows the shape of the speaker's body, the color of the brush or gloves is in contrast to the color of clothing, the background color is in contrast to the color of the face and the color of the hand or gloves.

В одном из вариантов осуществления способ дополнительно содержит этапы, на которых:In one embodiment, the method further comprises the steps of:

- принимают и отображают в пользовательском интерфейсе входное видео, содержащее по меньшей мере один жест жестового языка,- receiving and displaying in the user interface an input video containing at least one sign language gesture,

- принимают ввод от пользователя, указывающий один или два момента времени в пределах длительности входного видео для выделения видеофрагмента,- receive input from the user indicating one or two points in time within the duration of the input video to select a video fragment,

- выделяют из входного видео видеофрагмент, содержащий один компонент жеста, жест или фразу, причем видеофрагмент содержит один или более последовательных видеокадров.- extracting from the input video a video fragment containing one component of a gesture, a gesture or a phrase, wherein the video fragment contains one or more consecutive video frames.

В одном из вариантов осуществления набор визуальных представлений дополнительно содержит блок визуальных представлений локализации жеста, блок визуальных представлений движения руки и блок визуальных представлений немануальных компонентов жеста,In one embodiment, the set of visual representations further comprises a block of visual representations of gesture localization, a block of visual representations of hand movement, and a block of visual representations of non-manual components of the gesture,

параметры компонентов жеста дополнительно включают в себя место исполнения жеста, тип движения или немануальный объект, участвующий в жесте.parameters of the gesture components further include the location of the gesture, the type of movement, or the non-manual object involved in the gesture.

- формируют и сохраняют в памяти набор данных, содержащий файл разметки и видеофрагменты, указанные в файле разметки.- form and save in memory a data set containing a markup file and video fragments specified in the markup file.

Кроме того, предложено устройство для разметки жестов жестового языка, содержащее процессор, память, экран и средство управления экранным указателем, причем устройство выполнено с возможностью:In addition, a device for marking sign language gestures is proposed, comprising a processor, memory, a screen and a means for controlling a screen pointer, wherein the device is configured to:

- отображать в пользовательском интерфейсе видеофрагмент, содержащий один компонент жеста, жест или фразу жестового языка, при этом в кадре находится один говорящий в анфас, выполняющий в процессе видео по меньшей мере один жест жестового языка, при этом кадр полностью охватывает область пространства, в которой выполняется жест, и части тела говорящего, участвующие в выполнении жеста,- display in the user interface a video fragment containing one component of a gesture, a gesture or phrase of a sign language, while the frame contains one speaker in frontal view, performing at least one sign language gesture during the video, and the frame completely covers the area of space in which the gesture is being performed, and the parts of the speaker's body involved in performing the gesture are

- предоставлять элемент интерфейса, обеспечивающий указание правой руки и левой руки,- provide an interface element that provides indication of right hand and left hand,

- отображать в пользовательском интерфейсе набор визуальных представлений, характеризующих компоненты жестов, входящие в предварительно заданную систему нотации жестового языка, при этом набор визуальных представлений содержит по меньшей мере следующие блоки: блок визуальных представлений конфигурации руки, блок визуальных представлений ориентации руки,- display in the user interface a set of visual representations characterizing the components of gestures included in a predefined sign language notation system, wherein the set of visual representations contains at least the following blocks: a block of visual representations of hand configuration, a block of visual representations of hand orientation,

- принимать от пользователя для жеста, содержащегося в видеофрагменте, для каждой руки, участвующей в жесте, выполненное с помощью экранного указателя указание руки и указание одного или более визуальных представлений из каждого блока визуальных представлений,- receive from the user, for a gesture contained in a video fragment, for each hand participating in the gesture, an on-screen pointer indication of a hand and an indication of one or more visual representations from each block of visual representations,

- для каждой руки, участвующей в жесте, формировать идентификатор жеста, соответствующий нотации этого жеста в системе нотации, согласно принятым указаниям от пользователя, соответствующим этой руке,- for each hand participating in the gesture, generate a gesture identifier corresponding to the notation of this gesture in the notation system, according to the accepted instructions from the user corresponding to this hand,

- формировать запись в файле разметки, содержащую указание видеофрагмента и для каждой руки, участвующей в жесте, идентификатор жеста,- generate a record in a markup file containing an indication of the video fragment and, for each hand involved in the gesture, the gesture identifier,

причем набор данных, содержащий файл разметки и видеофрагменты, используется для машинного обучения модели распознавания жестов жестового языка, видеофрагменты используются в качестве обучающих данных, а идентификаторы жестов из файла разметки используются в качестве разметки.wherein the dataset containing the markup file and video fragments is used for machine learning of a sign language gesture recognition model, the video fragments are used as training data, and the gesture IDs from the markup file are used as markup.

В одном из вариантов осуществления устройство дополнительно содержит камеру, выполненную с возможностью:In one embodiment, the device further comprises a camera configured to:

- выполнять видеозапись человека, говорящего на жестовом языке.- make a video recording of a person speaking sign language.

В одном из вариантов осуществления устройство дополнительно выполнено с возможностью:In one embodiment, the device is further configured to:

- принимать и отображать в пользовательском интерфейсе входное видео, содержащее по меньшей мере один жест жестового языка,- receive and display in the user interface an input video containing at least one sign language gesture,

- принимать ввод от пользователя, указывающий один или два момента времени в пределах длительности входного видео для выделения видеофрагмента,- accept input from the user indicating one or two points in time within the duration of the input video to select a video fragment,

- выделять из входного видео видеофрагмент, содержащий один компонент жеста, жест или фразу, причем видеофрагмент содержит один или более последовательных видеокадров.- extract from the input video a video fragment containing one component of a gesture, a gesture or a phrase, wherein the video fragment contains one or more consecutive video frames.

- формировать и сохранять в памяти набор данных, содержащий файл разметки и видеофрагменты, указанные в файле разметки.- generate and save in memory a data set containing a markup file and video fragments specified in the markup file.

Технический результатTechnical result

Настоящее изобретение обеспечивает повышение скорости и точности разметки за счет упрощения интерпретации компонентов жеста.The present invention improves the speed and accuracy of marking by simplifying the interpretation of gesture components.

Эти и другие преимущества настоящего изобретения станут понятны при прочтении нижеследующего подробного описания со ссылкой на сопроводительные чертежи.These and other advantages of the present invention will become apparent upon reading the following detailed description taken in conjunction with the accompanying drawings.

Краткое описание чертежейBrief description of drawings

На Фиг. 1 показан пример системы разметки, обучения и распознавания жестов.In FIG. Figure 1 shows an example of a system for marking, training and recognizing gestures.

На Фиг. 2 показана блок-схема устройства для разметки жестов.In FIG. Figure 2 shows a block diagram of a device for marking gestures.

На Фиг. 3 показаны примеры записи нескольких жестов в разных системах нотации.In FIG. Figure 3 shows examples of recording several gestures in different notation systems.

На Фиг. 4 показана блок-схема способа разметки жестов.In FIG. Figure 4 shows a block diagram of a method for marking gestures.

На Фиг. 5-6 показаны примеры интерфейса устройства для разметки жестов.In FIG. Figures 5-6 show examples of the device interface for marking gestures.

На Фиг. 7 показан пример файла разметки.In FIG. Figure 7 shows an example of a markup file.

На Фиг. 8А-8B показаны дополнительные примеры блока визуальных представлений.In FIG. 8A-8B show additional examples of a visual representation block.

На Фиг. 9 приведены результаты испытаний настоящего изобретения.In FIG. 9 shows the test results of the present invention.

Следует понимать, что фигуры могут быть представлены схематично и не в масштабе и предназначены, главным образом, для улучшения понимания настоящего изобретения.It should be understood that the figures may be presented schematically and not to scale and are intended primarily to enhance understanding of the present invention.

Подробное описаниеDetailed description

В структуре жестового языка основное внимание обращено на фокусную область, известную как «жестовое пространство»; все жесты по существу сконцентрированы в этой области. Область пространства вокруг тела жестикулирующего лица представляет собой «пузырь». Жестовое пространство простирается вперёд от груди жестикулирующего лица и включает пространство от талии до верха головы и всю ширину плеч. Именно в этой области жестикулирующие лица двигают руками при разговоре. Говорящие на жестовом языке знают о необходимости держать это пространство ясным и не блокировать его от тех, с кем они разговаривают. Жестовое пространство по существу является расширением тела жестикулирующего лица. Пространство вокруг тела человека активно создаётся независимо от тела и языка. Теория жестового пространства обращает внимание на то, что пространство является необходимым компонентом для общения в жестовом языке.In the structure of sign language, the focus is on a focal area known as "sign space"; all gestures are essentially concentrated in this area. The area of space around the body of the gesturing person is a “bubble”. The gestural space extends forward from the chest of the gesturing person and includes the space from the waist to the top of the head and the entire width of the shoulders. It is in this area that gesturing individuals move their hands when speaking. Sign language speakers know the need to keep this space clear and not block it from those they are speaking to. Gesture space is essentially an extension of the body of the gesturing person. The space around the human body is actively created independently of the body and language. Sign space theory emphasizes that space is a necessary component for communication in sign language.

Область жеста может менять свой размер в зависимости от ситуации, в которой находится человек. В случаях, когда человек «кричит», его жестовое пространство будет охватывать гораздо большую область, чтобы учесть увеличенный размер движений знака. Напротив, когда жестикулирующий хочет «шептать», размер его или её жестового пространства будет уменьшаться и часто смещаться, бывает в области живота, чтобы удерживать жесты скрытыми от зрения. Независимо от размера жестового пространства, используемое пространство является важным расширением тела и необходимым для общения. Жестовое пространство увеличивает или уменьшает свой размер в зависимости от количества внимания, которое желает привлечь говорящий на жестовом языке.The gesture area can change its size depending on the situation in which the person is. In cases where a person is "shouting", their gestural space will cover a much larger area to account for the increased size of the sign's movements. In contrast, when a gesticulator wants to “whisper,” the size of his or her gestural space will decrease and often shift, sometimes to the abdomen, to keep the gestures hidden from view. Regardless of the size of the gestural space, the space used is an important extension of the body and necessary for communication. Sign space increases or decreases in size depending on the amount of attention the sign language speaker wishes to attract.

В лингвистике жест – это элемент жестового языка, который состоит из пяти компонентов: конфигурации руки; ориентации руки; места исполнения жеста; движения; немануального компонента. В настоящем изобретении предлагается использовать данную модель для разметки жестов.In linguistics, a gesture is an element of sign language that consists of five components: hand configuration; hand orientation; places where the gesture is performed; movements; non-manual component. The present invention proposes to use this model for marking gestures.

При изменении одного из компонентов жеста меняется его значение. Например, при изменении лишь одного компонента может измениться лексическое значение (направление сверху – вниз и справа – налево: ПАПА – МАМА; ориентация и движение: МОСКВА – БАБУШКА – СТАРЫЙ) или морфологическое (грамматическое) значение (направление движения: (я) ДАЮ – (мне) ДАЮТ; однократность – повторяемость движения: ВСПОМНИТЬ – ВСПОМИНАТЬ).When you change one of the components of a gesture, its meaning changes. For example, when only one component changes, the lexical meaning (direction from top to bottom and from right to left: PAPA – MOTHER; orientation and movement: MOSCOW – BABUSHKA – OLD) or morphological (grammatical) meaning (direction of movement: (I) GIVE – (to me) GIVEN; one-time - repetition of movement: REMEMBER - REMEMBER).

Важным свидетельством того, что то или иное различие является в заданном языке фонологическим (смыслоразличительным), является наличие минимальной пары. Минимальная пара – это две разных морфемы или словоформы, различающихся только одним компонентом в одной и той же позиции.An important evidence that a particular difference is phonological (meaning-distinctive) in a given language is the presence of a minimal pair. A minimal pair is two different morphemes or word forms that differ in only one component in the same position.

Далее компоненты жеста будут описаны более подробно.Next, the components of the gesture will be described in more detail.

1. Конфигурация руки1. Hand configuration

Конфигурация представляет собой форму кисти руки при исполнении жеста. Примером минимальной пары в русском жестовом языке, в которой жесты различаются только конфигурацией, являются жесты «Сибирь» и «Новосибирск»: в жесте «Сибирь» руки находятся в конфигурации «С» (С-образная рука), а в жесте «Новосибирск» – в конфигурации «Н». Место исполнения обоих жестов, ориентация рук и движение одни и те же.The configuration represents the shape of the hand when performing a gesture. An example of a minimal pair in Russian sign language, in which the gestures differ only in configuration, are the “Siberia” and “Novosibirsk” gestures: in the “Siberia” gesture the hands are in the “C” configuration (C-shaped hand), and in the “Novosibirsk” gesture – in the “H” configuration. The place of execution of both gestures, the orientation of the hands and the movement are the same.

2. Ориентация руки2. Hand orientation

Ориентация представляет собой положение ладони и пальцев в пространстве по отношению к корпусу тела говорящего и положение рук по отношению друг к другу. Ладонь может быть развернута вверх, вниз, вправо, влево, по направлению к говорящему или от говорящего, ребром к говорящему. Кончики пальцев могут быть направлены вверх, вниз, вправо, влево, к говорящему, от говорящего, по диагонали и т. д. В двуручных жестах ориентации рук могут быть симметричными, руки могут быть расположены параллельно друг другу, одна рука может быть расположена над другой рукой, позади или впереди нее, кисти рук могут быть скрещены, ладони могут касаться друг друга ребрами и т.д. Например, в жесте русского жестового языка ГОРЯЧИЙ кисть повернута кончиками пальцев вверх ладонью к говорящему, в жесте ТЕМА кисти рук повернуты кончиками пальцев вверх ладонями от говорящего, а в жесте СИДЕТЬ кисти рук развернуты ладонями вниз параллельно полу.Orientation is the position of the palm and fingers in space in relation to the body of the speaker and the position of the hands in relation to each other. The palm can be turned up, down, right, left, towards the speaker or away from the speaker, edge towards the speaker. The fingertips can point up, down, right, left, towards the speaker, away from the speaker, diagonally, etc. In two-handed gestures, the hand orientations can be symmetrical, the hands can be parallel to each other, one hand can be placed above the other hand, behind or in front of it, the hands can be crossed, the palms can touch each other with ribs, etc. For example, in the Russian sign language gesture HOT the hand is turned with the fingertips up with the palm facing the speaker, in the TOPIC gesture the hands are turned with the fingertips up with the palms facing away from the speaker, and in the SIT gesture the hands are turned with the palms down parallel to the floor.

3. Локализация3. Localization

Локализация жеста включает в себя два признака: место исполнения и сеттинг. Место исполнения – это несколько крупных областей в пределах жестового пространства, в которых может производиться жест: голова, корпус, нейтральное жестовое пространство и пассивная рука. Сеттинг уточняет местоположение жеста внутри этой крупной области. Например, в пределах места исполнения, обозначаемого как «голова», можно выделить такие сеттинги как «лоб», «подбородок», «нос», «ухо». минимальной пары жестов. Например, жесты МАЛЬЧИК и ДЕВОЧКА исполняются на уровне головы говорящего, при этом жест МАЛЬЧИК исполняется на уровне лба, а ДЕВОЧКА – на уровне щеки.Gesture localization includes two features: place of execution and setting. The performance site is the several large areas within the gestural space in which a gesture can be produced: the head, the body, the neutral gestural space, and the passive hand. The setting specifies the location of the gesture within this large area. For example, within the place of performance, designated as “head”, one can distinguish such settings as “forehead”, “chin”, “nose”, “ear”. a minimum pair of gestures. For example, the BOY and GIRL gestures are performed at the level of the speaker's head, while the BOY gesture is performed at forehead level, and GIRL at cheek level.

4. Движение4. Movement

Движение является наиболее сложным и внутренне неоднородным параметром в структуре жеста. Выделяются два основных его типа: траекторное и локальное.Movement is the most complex and internally heterogeneous parameter in the structure of a gesture. There are two main types of it: trajectory and local.

Траекторное движение – это перемещение руки от одной локализации к другой. В траекторном движении, в свою очередь, важны такие его признаки, как направление (перемещение руки относительно вертикальной, горизонтальной и сагиттальной осей) и характер (по прямой, по дуге, по зигзагу, по спирали, резкое, плавное и т.п.).Trajectory movement is the movement of the hand from one location to another. In trajectory movement, in turn, such features as direction (movement of the hand relative to the vertical, horizontal and sagittal axes) and character (in a straight line, in an arc, in a zigzag, in a spiral, sharp, smooth, etc.) are important. .

Например, в жесте русского жестового языка ОТЕЦ рука движется вертикально сверху вниз от локализации у лба к локализации у подбородка. В жесте ГОРА рука совершает плавное дугообразное движение по воображаемой плоскости, параллельной телу говорящего. При исполнении жеста РЕКА рука движется вперед в горизонтальной плоскости по зигзагу.For example, in the Russian sign language FATHER gesture, the hand moves vertically from top to bottom from localization at the forehead to localization at the chin. In the MOUNTAIN gesture, the hand makes a smooth arcing movement along an imaginary plane parallel to the speaker's body. When performing the RIVER gesture, the hand moves forward in a horizontal plane in a zigzag.

5. Немануальный компонент5. Non-manual component

Одно из распространенных заблуждений относительно жестовых языков состоит в том, что это «языки рук», то есть что лингвистические единицы артикулируются в них только руками. Мануальные артикуляторы – руки – играют важную роль в жестовой речи, однако не менее важны и другие артикуляторы – корпус тела, голова, плечи и части лица. Лингвистически значимые компоненты жестовой речи, исполняемые не руками, называются немануальными компонентами или немануальными маркерами.One common misconception about sign languages is that they are “languages of the hands,” that is, that linguistic units are articulated only with the hands. Manual articulators - the hands - play an important role in sign speech, but other articulators - the body body, head, shoulders and parts of the face - are no less important. Linguistically significant components of signed speech that are not performed with the hands are called non-manual components or non-manual markers.

В структуру многих жестов, помимо мануальных компонентов, рассмотренных выше, в качестве обязательной составляющей может входить и немануальный компонент – определенное движение головы и/или корпуса, мимика, маусинг и жесты рта. Такие жесты называются комбинированными. В русском жестовом языке многие жесты являются комбинированными. Например, жест БОЛЕТЬ обычно исполняется с поджатыми губами; жест НЕ ВЕРИТЬ сочетается с легкими поворотами головы вправо-влево; в жесте ПРОСНУТЬСЯ глаза в начальной фазе исполнения жеста закрыты, а в конечной – открыты; при исполнении жеста ДРАЗНИТЬ глаза немного прищурены.The structure of many gestures, in addition to the manual components discussed above, may also include a non-manual component as a mandatory component - a certain movement of the head and/or body, facial expressions, mousing and mouth gestures. Such gestures are called combined. In Russian sign language, many gestures are combined. For example, the gesture to SICK is usually performed with pursed lips; the DO NOT BELIEVE gesture is combined with slight turns of the head left and right; in the WAKE UP gesture, the eyes are closed in the initial phase of the gesture, and open in the final phase; When performing the TEASE gesture, the eyes are slightly narrowed.

Маусинг представляет собой беззвучную артикуляцию соответствующего слова (или его части) звукового языка. Например, в нидерландском жестовом языке мануальный жест ЦВЕТОК сопровождается артикуляцией соответствующего голландского слова BLOEM, а жест МАТЬ – артикуляцией слова MOEDER. В русском жестовом языке тоже имеются жесты, исполнение которых обычно сопровождается артикуляцией соответствующего русского слова или его части. Например, жесты ДЕТИ и ДОМ обычно включают беззвучную артикуляцию соответствующего слова, а жест УЧИТЬСЯ артикуляцию “учи”.Mousing is the silent articulation of the corresponding word (or part thereof) of a sound language. For example, in Dutch sign language, the manual gesture FLOWER is accompanied by the articulation of the corresponding Dutch word BLOEM, and the gesture MOTHER is accompanied by the articulation of the word MOEDER. Russian sign language also has gestures, the execution of which is usually accompanied by the articulation of the corresponding Russian word or part of it. For example, the gestures CHILDREN and HOUSE usually include silent articulation of the corresponding word, and the gesture LEARN articulation of “teach.”

К жестам рта относятся различные движения или положения губ и языка, артикуляция некоторых звуков или их сочетаний, а также вдох или выдох ртом. В отличие от маусинга, жесты рта не связаны со звуковым языком и не являются результатом влияния звукового языка. Примеры жестов рта, сопровождающих движения рук, имеются и в русском жестовом языке. При исполнении жестов ЦЕЛОВАТЬ и КАЧЕСТВЕННЫЙ губы вытянуты вперед. Жест НИ ЗА ЧТО (указательный палец совершает резкое повторяющееся горизонтальное движение вдоль подбородка) исполняется с высунутым кончиком языка. В жесте СОВСЕМ НИЧЕГО (рука в конфигурации “- О ” подносится ко лбу) беззвучно артикулируется [u], в жесте ГОВОРИТЬ [a], а в жесте РАЗГОВАРИВАТЬ [vаvаvа]. Исполнение одного из жестов со значением ‘не мочь’ (обе руки в конфигурации “1”; указательный палец активной руки резко ударяет по указательному пальцу пассивной руки) сопровождается положением губ, напоминающим беззвучную артикуляцию звука [u] и резким выдохом.Mouth gestures include various movements or positions of the lips and tongue, the articulation of certain sounds or combinations thereof, and inhalation or exhalation through the mouth. Unlike mousing, mouth gestures are not associated with auditory language and are not the result of the influence of auditory language. Examples of mouth gestures accompanying hand movements are also available in Russian sign language. When performing the KISS and QUALITY gestures, the lips are extended forward. The NO WHAT gesture (the index finger makes a sharp, repeated horizontal movement along the chin) is performed with the tip of the tongue sticking out. In the gesture AT ALL NOTHING (the hand in the “- O ” configuration is brought to the forehead) [u] is silently articulated, in the gesture TO SPEAK [a], and in the gesture TO TALK [vavava]. The execution of one of the gestures with the meaning ‘cannot’ (both hands in configuration “1”; the index finger of the active hand sharply strikes the index finger of the passive hand) is accompanied by a position of the lips reminiscent of the silent articulation of the sound [u] and a sharp exhalation.

Минимальные пары по немануальному компоненту имеются и в русском жестовом языке. Например, жесты БЕСПЛАТНЫЙ и БЕЗДЕЛЬНИЧАТЬ имеют одни и те же мануальные параметры, но БЕСПЛАТНЫЙ сопровождается беззвучной артикуляцией слога “бес”. Жесты СМЫСЛ и МЕЧТАТЬ отличаются тем, что при исполнении жеста МЕЧТАТЬ взгляд обычно направлен вверх, рот приоткрыт, а голова чуть отклонена назад.Minimal pairs for the non-manual component are also available in Russian sign language. For example, the gestures FREE and Idleness have the same manual parameters, but FREE is accompanied by silent articulation of the syllable “bes”. The MEANING and DREAM gestures differ in that when performing the DREAM gesture, the gaze is usually directed upward, the mouth is slightly open, and the head is slightly tilted back.

Настоящее изобретение направлено на создание устройства и способа (инструмента) для разметки жестов жестового языка на видеозаписи и создания основанного на этом обучающего набора данных, с помощью которого можно обучить систему распознавания жестов.The present invention is aimed at creating a device and method (tool) for marking sign language gestures on a video recording and creating a training data set based on this, with which a gesture recognition system can be trained.

Пример системы разметки, обучения и распознавания жестов представлен на Фиг. 1. Далее в данном документе An example of a system for marking, training and recognizing gestures is presented in Fig. 1. Further in this document

Система разметки, обучения и распознавания жестов содержит устройство 100 для разметки жестов, устройство 200 для обучения модели и устройство 300 для распознавания жестов.The gesture marking, training and recognition system includes a gesture marking device 100, a model training device 200 and a gesture recognition device 300.

В состав устройства 300 для распознавания жестов входят компьютерное устройство 310 и камера 320 – например, ПК с внешней веб-камерой или ноутбук/планшет/смартфон со встроенной камерой. Камера 320 захватывает изображение человека (пользователя), говорящего на жестовом языке, компьютер 310 с помощью алгоритма распознавания жестов определяет, какой жест был произведен говорящим, и выдает текст или звук, соответствующий этому жесту. Алгоритм распознавания жестов может работать на базе модели машинного обучения – например, нейронной сети. The gesture recognition device 300 includes a computing device 310 and a camera 320—for example, a PC with an external webcam or a laptop/tablet/smartphone with a built-in camera. Camera 320 captures an image of a person (user) speaking a sign language, computer 310 uses a gesture recognition algorithm to determine what gesture was made by the speaker and outputs text or sound corresponding to that gesture. The gesture recognition algorithm can work based on a machine learning model - for example, a neural network.

Обучение модели производится заранее в устройстве 200 для обучения модели. Между тем, чтобы обучить такую модель, требуется наличие размеченного набора данных, содержащего множество видеофайлов или изображений с различными жестами.The model is trained in advance in the model training device 200. Meanwhile, to train such a model, a labeled data set containing many video files or images with various gestures is required.

Формирование размеченного набора данных и разметка жестов выполняется с помощью устройства 100 для разметки жестов согласно настоящему изобретению, примерная блок-схема которого представлена на Фиг. 2. Устройство 100 для разметки жестов содержит инструмент 110 разметки, камеру 120 и устройство 130 для хранения набора данных. Инструмент 110 разметки представляет собой компьютерное устройство, такое как компьютер, ноутбук, планшет и т.п., содержащее процессор 111, память 112, экран 113 и средство 114 управления экранным указателем. Например, в компьютере экраном может быть экран монитора, а средством управления экранным указателем может быть компьютерная мышь или клавиатура, в то время как в планшете экраном может быть сенсорный экран, а средством управления экранным указателем является встроенное в сенсорный экран средство сенсорного ввода. В необязательном варианте осуществления инструмент 110 разметки может содержать камеру 115 для захвата видео, в таком случае она может использоваться в качестве вышеуказанной камеры 120 или в дополнение к ней. Еще в одном необязательном варианте осуществления память 112 инструмента 110 разметки может обеспечивать хранение набора данных, в таком случае устройство 130 для хранения набора данных может рассматриваться как часть инструмента 110 разметки.Generation of the tagged data set and gesture tagging is performed by the gesture tagging device 100 according to the present invention, an exemplary block diagram of which is shown in FIG. 2. The gesture marking device 100 includes a marking tool 110, a camera 120, and a data set storage device 130. The markup tool 110 is a computing device, such as a computer, laptop, tablet, or the like, including a processor 111, a memory 112, a screen 113, and a screen pointer 114. For example, on a computer, the screen may be a monitor screen and the on-screen pointer control may be a computer mouse or keyboard, while on a tablet, the screen may be a touch screen and the on-screen pointer control may be a touch input device built into the touch screen. In an optional embodiment, markup tool 110 may include a video capture camera 115, in which case it may be used as or in addition to camera 120 above. In yet another optional embodiment, the memory 112 of the markup tool 110 may support storage of a data set, in which case the data set storage device 130 may be considered part of the markup tool 110.

Быстрое создание размеченного набора данных с жестами жестового языка является нетривиальной задачей. Приведенная выше компонентная модель представления жестового языка позволяет охватить все возможные комбинации жестовых компонентов без потери значимых данных. Существуют различные систем нотации, основанные на этих принципах, такие как SignWriting, HamNoSys и т.д. В системе нотации запись жеста производится путем описания комбинации компонентов этого жеста с помощью специальных знаков. Примеры записи нескольких жестов в разных системах нотации показаны на Фиг. 3 и демонстрируют, что стандартными средствами сделать подобные записи невозможно, поэтому разметчику требуется инструмент, с помощью которого он мог бы вводить необходимые знаки. При этом, например, число знаков в одной из самых известных систем нотации SignWriting превышает 38 тысяч, поэтому простое перечисление этих знаков на экране компьютера во всем возможном множестве вариаций для того, чтобы пользователь выбрал из них нужную в качестве разметки, потребует от пользователя высокого уровня знания языка и даже при этом займет чрезвычайно много времени.Rapidly generating a labeled dataset of sign language gestures is a non-trivial task. The above component model of sign language representation allows us to cover all possible combinations of sign components without losing meaningful data. There are various notation systems based on these principles, such as SignWriting, HamNoSys, etc. In the notation system, a gesture is recorded by describing the combination of components of this gesture using special characters. Examples of recording several gestures in different notation systems are shown in Fig. 3 and demonstrate that it is impossible to make such entries using standard means, so the marker needs a tool with which he could enter the necessary characters. At the same time, for example, the number of characters in one of the most famous notation systems, SignWriting, exceeds 38 thousand, so a simple listing of these characters on the computer screen in all possible variations in order for the user to select the one they need as markup will require a high level of user experience. knowledge of the language and even then it will take an extremely long time.

Во избежание излишних затрат времени и трудностей интерпретации в настоящем изобретении способ разметки жестов жестового языка предлагается выполнять следующим образом (см. блок-схему способа на Фиг. 4).To avoid unnecessary time consumption and difficulties of interpretation, the present invention proposes to perform a method for marking sign language gestures as follows (see the flow diagram of the method in Fig. 4).

Сначала выполняется видеозапись человека, говорящего на жестовом языке, при этом в кадре находится один говорящий человек на жестовом языке в анфас, этот человек выполняет в процессе видео один или более жестов жестового языка, при этом кадр полностью охватывает жестовое пространство, то есть область пространства, в которой выполняется жест, а также части тела говорящего, участвующие в выполнении жеста. Например, видео может содержать область тела говорящего от головы до пояса, или область головы, или все тело от головы до ног. Для целей нормализации создаваемой базы данных видео может записываться таким образом, чтобы в кадр попадала только необходимая область – например, только область тела говорящего от головы до пояса. Камера может фокусироваться таким образом, чтобы расстояние от головы до верхней границы кадра было минимальным. Ориентация и соотношение сторон кадра выбирается в соответствии с размерами жестового пространства. Таким образом, обеспечивается максимальный охват различных вариантов жестов жестового языка, при этом границы жестового пространства, используемого для выполнения различных жестов, максимально приближены к границам кадра, за счет чего предотвращается потеря данных и обеспечивается упрощение интерпретации жестов как пользователю, выполняющему разметку, так и в дальнейшем модели машинного обучения, которая будет обучаться на этом наборе данных. В предпочтительном варианте осуществления используется горизонтальная ориентация кадра, говорящий человек на жестовом языке находится в центре кадра по пояс, расстояние от верха головы говорящего до верхней границы кадра минимизировано. Еще в одном варианте осуществления используется горизонтальная ориентация кадра, в центре кадра находится голова говорящего на жестовом языке, расстояние от верха головы говорящего до верхней границы кадра и от подбородка до нижней части кадра минимизировано.First, a video recording of a person speaking a sign language is made, while in the frame there is one person speaking a sign language in frontal view, this person performs one or more sign language gestures during the video, and the frame completely covers the sign space, that is, the area of space in which the gesture is performed, as well as parts of the speaker’s body involved in performing the gesture. For example, a video may contain the speaker's body area from head to waist, or the head area, or the entire body from head to toe. For the purpose of normalizing the created database, the video can be recorded in such a way that only the necessary area is included in the frame - for example, only the area of the speaker’s body from head to waist. The camera can focus in such a way that the distance from the head to the top of the frame is minimal. The orientation and aspect ratio of the frame are chosen in accordance with the dimensions of the gestural space. In this way, maximum coverage of the various variants of sign language gestures is ensured, while the boundaries of the gestural space used to perform various gestures are as close as possible to the boundaries of the frame, thereby preventing data loss and simplifying the interpretation of gestures both for the user performing the marking and in further, a machine learning model that will be trained on this data set. In the preferred embodiment, a horizontal frame orientation is used, the sign language speaker is waist-high in the center of the frame, and the distance from the top of the speaker's head to the top of the frame is minimized. Another embodiment uses a horizontal frame orientation, with the sign language speaker's head in the center of the frame, and the distance from the top of the speaker's head to the top of the frame and from the chin to the bottom of the frame is minimized.

Для упрощения интерпретации жестов как пользователю, выполняющему разметку, так и в дальнейшем модели машинного обучения, которая будет обучаться на этом наборе данных, говорящий на жестовом языке может быть одет в одежду, цвет которой является контрастным к цвету фона, к цвету лица и к цвету кисти или перчаток, одежда может полностью закрывать тело до шеи и руки по меньшей мере в области плеч (предпочтительно до запястья) и не иметь элементов, выступающих выше шеи (и предпочтительно от запястья в сторону кисти), одежда может близко повторять форму тела говорящего, цвет кисти или перчаток может быть контрастным к цвету одежды, цвет фона может быть контрастным к цвету лица и к цвету кисти или перчаток.To make gestures easier to interpret for both the marker and the machine learning model that will be trained on the data set, the sign language speaker can be dressed in clothing that is a contrasting color to the background, complexion, and color of the sign language. hands or gloves, clothing can completely cover the body up to the neck and arms at least in the shoulder area (preferably up to the wrist) and have no elements protruding above the neck (and preferably from the wrist towards the hand), clothing can closely follow the shape of the speaker’s body, the color of the brush or gloves can be in contrast to the color of clothing, the background color can be in contrast to the color of the face and the color of the brush or gloves.

Записанное видео должно содержать по меньшей мере один жест жестового языка.The recorded video must contain at least one sign language gesture.

Записанное видео принимается компьютером и отображается в пользовательском интерфейсе, общий вид которого показан на Фиг. 5.The recorded video is received by the computer and displayed on the user interface, an overview of which is shown in FIG. 5.

Пользователь просматривает (или иными словами, инструмент 110 разметки воспроизводит с помощью экрана 113) видео и указывает с помощью интерфейса временные границы в пределах этого видео, в рамках которых содержится видеофрагмент, содержащий один компонент жеста, жест или фразу, если она может быть размечена. При необходимости это может быть одиночный кадр, и тогда пользователь указывает один момент времени, соответствующий этому кадру в пределах длительности входного видео. Указание может выполняться путем остановки видео в необходимой точке и нажатия на кнопку фиксации момента времени. В другом варианте осуществления жест может выполняться в течение более чем одного кадра, и тогда пользователь может указывать начало и конец требуемого видеофрагмента (то есть указывать первый и последний кадр путем ввода времени, выбора соответствующих точек на шкале просмотра и т.д.). Следует понимать, что в зависимости от того, является ли видеофрагмент одним кадром или последовательностью кадров, меняется структура входных данных для обучения, поэтому для обработки разных видеофрагментов могут потребоваться разные модели.The user views (or in other words, the tagging tool 110 plays via the screen 113) a video and specifies, through the interface, the time boundaries within that video within which a video segment containing a single gesture component, a gesture or a phrase if it can be tagged, is contained. If necessary, this can be a single frame, and then the user specifies one point in time corresponding to this frame within the duration of the input video. The indication can be performed by stopping the video at the required point and pressing the button to fix the moment in time. In another embodiment, the gesture can be performed over more than one frame, and the user can then indicate the start and end of the desired video segment (ie, indicate the first and last frame by entering a time, selecting appropriate points on the view bar, etc.). It should be understood that depending on whether video fragment in one frame or a sequence of frames, the structure of the input data for training changes, so different models may be required to process different video fragments.

На основании указания пользователя инструмент 110 разметки (в частности, процессор 111) выделяет из входного видео видеофрагмент.Based on the user's instructions, the markup tool 110 (specifically, the processor 111) extracts a video segment from the input video.

В пользовательском интерфейсе также отображается окно с набором визуальных представлений, характеризующих компоненты жестов, входящие в предварительно заданную систему нотации жестового языка, такую как SignWriting. Набор визуальных представлений содержит по меньшей мере следующие блоки (по очереди или все одновременно): блок визуальных представлений конфигурации руки, блок визуальных представлений ориентации руки. Наличие соответствующей разметки позволяет обеспечить высокое качество обучения модели машинного обучения для распознавания жестов. Кроме того, в дополнительных вариантах осуществления набор визуальных представлений может также содержать блок визуальных представлений локализации жеста, блок визуальных представлений движения руки и блок визуальных представлений немануальных компонентов жеста, чтобы дополнительно повысить метрики обучения модели машинного обучения для распознавания жестов. При необходимости блоки могут пересекаться, чтобы сэкономить экранное пространство и иметь возможность увеличить размеры визуальных представлений. В другом варианте осуществления блоки не перекрывают друг друга, чтобы не затруднять поиск необходимого визуального представления, что бывает актуально, например, для случаев, когда ранее выбрано ошибочное визуальное представление, и в соответствующем предыдущем блоке необходимо изменить выбор, не закрывая текущий блок.The user interface also displays a window with a set of visual representations that characterize the gesture components included in a predefined sign language notation system, such as SignWriting. The set of visual representations contains at least the following blocks (in turn or all at the same time): a block of visual representations of a hand configuration, a block of visual representations of a hand orientation. The presence of appropriate markup allows for high-quality training of a machine learning model for gesture recognition. Additionally, in additional embodiments, the set of visual representations may also comprise a block of visual representations of gesture localization, a block of visual representations of hand movement, and a block of visual representations of non-manual components of a gesture to further enhance the training metrics of a machine learning model for gesture recognition. If necessary, blocks can overlap to save screen space and be able to increase the size of visual representations. In another embodiment, the blocks do not overlap each other, so as not to make it difficult to find the required visual representation, which can be important, for example, for cases where an erroneous visual representation was previously selected, and in the corresponding previous block it is necessary to change the selection without closing the current block.

Визуальное представление представляет собой изображение, анимированное изображение или короткое видео, упрощенно характеризующее соответствующий компонент жеста. Например, визуальное представление может быть выполнено в виде изображения или контуров той части тела, которая участвует в жесте, в форме или ориентации, соответствующей данному варианту выбора в рамках данного компонента жеста, или может одним изображением описывать движение, совершаемое этой частью тела, или может одним изображением описывать место исполнения жеста. Таким образом, упрощается поиск и интерпретация необходимых вариантов выбора компонентов жеста для его корректной разметки.A visual representation is an image, animated image, or short video that simplistically characterizes the corresponding component of a gesture. For example, the visual representation may be in the form of an image or outline of the part of the body that is involved in the gesture, in a shape or orientation corresponding to a given choice within a given component of the gesture, or may describe the movement performed by that part of the body in a single image, or may use one image to describe the location of the gesture. Thus, the search and interpretation of the necessary options for selecting the components of a gesture for its correct markup is simplified.

В одном из дополнительных вариантов осуществления визуальное представление может быть выполнено в виде контуров, но при наведении на него указателя может воспроизводиться окрашивание элементов визуального представления для упрощения и ускорения его идентификации.In one additional embodiment, the visual representation may be in the form of outlines, but when hovering over it, the visual elements may be colored to make it easier and faster to identify.

Также в одном из дополнительных вариантов осуществления визуальное представление может быть выполнено в виде статичного изображения или контуров, но при наведении на него указателя может воспроизводиться анимация, соответствующая данному компоненту жеста, для упрощения и ускорения его идентификации.Also in one additional embodiment, the visual representation may be in the form of a static image or outlines, but when hovered over, an animation corresponding to that component of the gesture may be played to make it easier and faster to identify.

В одном из дополнительных вариантов осуществления каждое визуальное представление сразу в пределах блока может сопровождаться пояснительным текстом, для упрощения и ускорения его идентификации.In one additional embodiment, each visual representation immediately within a block may be accompanied by explanatory text to make it easier and faster to identify.

Еще в одном из дополнительных вариантов осуществления текстовая подсказка может появляться лишь при наведении указателя на визуальное представление, для упрощения и ускорения его идентификации, но без сокращения экранного пространства.In yet another additional embodiment, a text prompt may appear only when the pointer is hovered over the visual representation to make it easier and faster to identify, but without reducing screen real estate.

Из каждого блока визуальных представлений пользователь выбирает необходимое визуальное представление, соответствующее данному жесту. Выбор может производиться, например, путем непосредственного нажатия на это представление или на сопровождающий его текст. В другом варианте осуществления выбор производится, если нажатие на необходимое представление длится не менее чем в течение предварительно заданного промежутка времени – например, 1 сек. Еще в одном варианте осуществления выбор может производиться путем нажатия на поле указания (например, чекбокс), расположенное рядом с этим визуальным представлением.From each block of visual representations, the user selects the required visual representation corresponding to a given gesture. The selection can be made, for example, by directly clicking on this representation or on the accompanying text. In another embodiment, a selection is made if the desired representation is clicked for at least a predetermined amount of time - for example, 1 second. In yet another embodiment, selection may be made by clicking on an indication field (eg, a checkbox) located adjacent to the visual representation.

В пользовательском интерфейсе также предоставляется элемент интерфейса, обеспечивающий указание правой руки и левой руки. Каждая рука может выполнять действия в процессе выполнения жеста, и эти действия могут быть несимметричны, поэтому необходимо отдельно указывать, какая рука выполняет какие действия. Пользователь выполняет разметку для руки и указывает эту руку, нажимая на соответствующую кнопку – например, «Сохранить правую». При этом в памяти сохраняется разметка для данной руки. Если в жесте участвует две руки, то пользователь выполняет разметку сначала для одной руки, а потом для другой. Устройство 100 для разметки жестов не требует от пользователя соблюдать порядок разметки, обеспечивая ему возможность произвольно определять, какие компоненты и какие руки в каком порядке указывать. Тем самым, пользователь имеет возможность выполнять разметку по мере интерпретации им тех или иных компонентов жеста, начиная с наиболее очевидных и понятных, за счет чего разметка упрощается и ускоряется, при этом учитываются особенности жестового языка.The user interface also provides an interface element that provides right-hand and left-hand indication. Each hand may perform actions during a gesture, and these actions may not be symmetrical, so it is necessary to separately indicate which hand performs which actions. The user makes a markup for a hand and indicates that hand by clicking on the appropriate button - for example, “Save Right”. At the same time, the markings for this hand are stored in memory. If two hands are involved in a gesture, then the user performs the marking first for one hand and then for the other. The gesture marking device 100 does not require the user to follow a marking order, allowing the user to arbitrarily determine which components and which hands to indicate in which order. Thus, the user has the opportunity to carry out marking as he interprets certain components of the gesture, starting with the most obvious and understandable ones, due to which the marking is simplified and accelerated, while taking into account the features of the sign language.

В другом варианте осуществления у пользователя нет необходимости производить отдельное нажатие кнопки «Сохранить правую» или «Сохранить левую», вместо этого в интерфейсе отображается отдельно подблок визуальных представлений для правой руки и подблок визуальных представлений для левой руки. Такой вариант особенно применим, например, для визуальных представлений ориентации руки, когда вслед за выбором визуального представления конфигурации руки в интерфейсе отображается отдельно подблок визуальных представлений ориентации руки для правой руки и подблок визуальных представлений ориентации руки для левой руки, причем визуальные представления ориентации руки в каждом подблоке описывают выбранную конфигурацию руки с учетом того, правая это рука или левая. В частности, как показано в примере на Фиг. 6, визуальные представления ориентации руки в подблоке для правой руки, который отображается в правой части интерфейса, отличаются от визуальных представлений ориентации руки в подблоке для левой руки, который отображается в левой части интерфейса, потому что выбранная конфигурация руки выглядит по-разному при ее выполнении левой или правой рукой. Соответственно, пользователю (разметчику) проще сделать выбор, и он делает выбор сразу для нужной руки. Таким образом, разметка упрощается и ускоряется, при этом учитываются особенности жестового языка.In another embodiment, the user does not need to separately click the "Save Right" or "Save Left" button, but instead displays a separate right-hand visualization sub-block and a left-hand visualization subblock in the interface. This option is particularly applicable, for example, to hand orientation visual representations where, following the selection of a hand configuration visual representation, the interface displays a separate hand orientation visual representation subblock for the right hand and a hand orientation visual representation subblock for the left hand, with hand orientation visual representations in each subblock describes the selected hand configuration, taking into account whether it is the right hand or the left. Specifically, as shown in the example of FIG. 6, Visual representations of hand orientation in the right-hand sub-block, which is displayed on the right side of the interface, are different from visual representations of hand orientation in the left-hand sub-block, which is displayed on the left side of the interface, because the selected hand configuration looks different when executed. left or right hand. Accordingly, it is easier for the user (marker) to make a choice, and he makes a choice immediately for the desired hand. Thus, marking is simplified and accelerated, while taking into account the features of sign language.

Когда выбор для жеста, содержащегося в видеофрагменте, для каждой руки, участвующей в жесте, сделан, инструмент 110 разметки с помощью средства 114 управления экранным указателем принимает от пользователя соответствующий ввод.When a selection is made for a gesture contained in a video sequence for each hand involved in the gesture, marking tool 110 receives appropriate input from the user via on-screen pointer control 114.

Для каждой руки, участвующей в жесте, инструмент 110 разметки с помощью процессора 111 формирует идентификатор жеста, соответствующий нотации этого жеста в системе нотации, согласно принятым указаниям от пользователя, соответствующим этой руке. Для этого каждому указанному визуальному представлению инструмент 110 разметки с помощью процессора 111 ставит в соответствие (буквенно-цифровой) идентификатор представления. В разных вариантах осуществления идентификатор может содержать буквы, цифры или другие символы или любые их комбинации. Далее путем конкатенации идентификаторов представления, соответствующих заданной руке, инструмент 110 разметки с помощью процессора 111 формирует идентификатор жеста. Конкатенация может производиться без разделителя или с разделителем, в качестве разделителя может использоваться пробел, дефис, символ подчеркивания и т.д. For each hand involved in a gesture, the marking tool 110, using the processor 111, generates a gesture identifier corresponding to the notation of that gesture in the notation system, according to the received instructions from the user corresponding to that hand. To do this, each specified visual representation is assigned by the markup tool 110 using the processor 111 to an (alphanumeric) representation identifier. In various embodiments, the identifier may comprise letters, numbers, or other characters, or any combination thereof. Next, by concatenating the representation identifiers corresponding to a given hand, the marking tool 110 generates a gesture identifier using the processor 111. Concatenation can be done without a delimiter or with a delimiter; a space, a hyphen, an underscore, etc. can be used as a delimiter.

После этого инструмент 110 разметки с помощью процессора 111 и памяти 112 формирует запись в файле разметки, содержащую указание видеофрагмента и для каждой руки, участвующей в жесте, идентификатор жеста. Пример файла разметки показан на Фиг. 7. В частности, файл на Фиг. 7 представляет собой csv-файл, содержащий три столбца. Столбец «filename» («имя файла») содержит указание видеофрагмента в виде соответствующего имени файла в папке, в которой хранятся видеофрагменты. Столбец «lefthand» («левая рука») содержит идентификатор жеста для левой руки, столбец «righthand» («правая рука») содержит идентификатор жеста для правой руки. Для каждого видеофрагмента в файле разметки создается отдельная запись (строка). Например, для видеофрагмента с именем файла «401-5.mp4» в файле разметки создана строка 6, в которой указано, что данный видеофрагмент содержит жест , выполненный правой рукой. Пустое значение для левой руки означает, что левая рука не использовалась при выполнении этого жеста.After this, the markup tool 110, using the processor 111 and memory 112, generates an entry in the markup file containing an indication of the video fragment and, for each hand involved in the gesture, a gesture identifier. An example markup file is shown in FIG. 7. In particular, the file in FIG. 7 is a csv file containing three columns. The “filename” column contains an indication of the video fragment in the form of the corresponding file name in the folder in which the video fragments are stored. The "lefthand" column contains the gesture ID for the left hand, the "righthand" column contains the gesture ID for the right hand. For each video fragment, a separate record (line) is created in the markup file. For example, for a video fragment with the file name “401-5.mp4”, line 6 is created in the markup file, which indicates that this video fragment contains a gesture performed with the right hand. An empty value for left hand means that the left hand was not used when performing this gesture.

Файл разметки вместе с соответствующими видеофрагментами представляет собой набор данных, который сохраняется в устройстве 130 для хранения набора данных.The markup file, together with the corresponding video fragments, constitutes a data set that is stored in the data set storage device 130.

Полученный в результате разметки множества видеофрагментов набор данных используется для машинного обучения модели распознавания жестов жестового языка, видеофрагменты используются в качестве обучающих данных, а идентификаторы жестов используются в качестве разметки.The resulting dataset from marking multiple video fragments is used for machine learning of a sign language gesture recognition model, video fragments are used as training data, and gesture identifiers are used as marking.

В одном из дополнительных вариантов осуществления настоящего изобретения пользовательский интерфейс также может иметь элемент интерфейса, обеспечивающий одновременно указание руки с переходом к следующему видео. Для этого интерфейс может содержать кнопку, такую как «Сохранить и след» (что означает сохранить и перейти к следующему). Для примера на Фиг. 6 блок визуальных представлений ориентации руки рядом с кнопками «ОК» и «Отмена» может содержать кнопку «ОК и след». Соответственно, пользователь имеет возможность нажатием на одну кнопку зафиксировать результат разметки для текущего жеста и сразу перейти к следующему видео. Аналогичным образом, пользовательский интерфейс также может иметь элемент интерфейса, обеспечивающий одновременно указание руки с переходом к предыдущему видео. In one further embodiment of the present invention, the user interface may also have a user interface element that provides both hand indication and navigation to the next video. To do this, the interface may contain a button such as “Save and Next” (meaning save and move on to the next). For example in FIG. 6, the block of visual representations of hand orientation next to the “OK” and “Cancel” buttons may contain an “OK and Next” button. Accordingly, the user has the opportunity, by pressing one button, to record the markup result for the current gesture and immediately move on to the next video. Likewise, the user interface may also have a user interface element that provides both hand indication and navigation to the previous video.

Еще в одном варианте осуществления инструмент 110 разметки может содержать опцию настройки, чтобы при нажатии на кнопку «ОК» автоматически выполнялся переход к следующему жесту. Это позволяет ускорить процесс разметки, не загромождая интерфейс лишними элементами.In yet another embodiment, markup tool 110 may include a setting option so that clicking the "OK" button will automatically advance to the next gesture. This allows you to speed up the markup process without cluttering the interface with unnecessary elements.

На Фиг. 8A-8B показаны дополнительные примеры блока визуальных представлений, где ориентация руки отображается относительно тела человека с учетом того, какой частью тела выполняется жест, при этом части тела человека схематически отображаются в выбранной ранее конфигурации жеста. Ориентации могут отображаться непосредственно вокруг той части тела, которая выполняет жест (Фиг. 8A), или около нее (Фиг. 8B). Таким образом, идентификация жеста и разметка упрощается и ускоряется, при этом учитываются особенности жестового языка.In FIG. 8A-8B show additional examples of a visual representation block where the orientation of a hand is displayed relative to a person's body based on which part of the body is performing the gesture, with parts of the person's body being schematically displayed in a previously selected gesture configuration. Orientations may be displayed directly around the part of the body that performs the gesture (Fig. 8A) or near it (Fig. 8B). Thus, gesture identification and marking are simplified and accelerated, while taking into account the features of sign language.

Блоки визуальных представлений могут отображаться последовательно с помощью древовидной структуры. В частности, по меньшей мере один блок визуальных представлений может содержать несколько групп визуальных представлений, причем группы могут быть сформированы согласно параметрам компонентов жеста.Blocks of visual representations can be displayed sequentially using a tree structure. In particular, the at least one visual representation block may comprise multiple visual representation groups, wherein the groups may be formed according to parameters of the gesture components.

В одном из вариантов осуществления интерфейс может дополнительно содержать элементы, позволяющие копировать разметку из предыдущего размеченного жеста (см. Фиг. 5 внизу слева). Для этого при переходе к следующему жесту в соответствующих полях для правой и левой руки может отображаться результат разметки предыдущего жеста, и при нажатии на кнопку «Копировать» эти результаты могут быть скопированы в поля для правой и левой руки для текущего жеста. Таким образом, идентификация жеста и разметка упрощается и ускоряется.In one embodiment, the interface may further include elements that allow markup to be copied from a previously marked gesture (see FIG. 5 bottom left). To do this, when moving to the next gesture, the markup result of the previous gesture can be displayed in the corresponding fields for the right and left hands, and when you click on the “Copy” button, these results can be copied into the fields for the right and left hands for the current gesture. Thus, gesture identification and marking are simplified and accelerated.

Аналогично описанной ранее функции сохранения с переходом, в дополнительном варианте осуществления инструмент 110 разметки может содержать кнопку «Копировать и след» или опцию настройки, чтобы при нажатии на кнопку «Копировать» автоматически выполнялся переход к следующему жесту. Это позволяет ускорить процесс разметки.Similar to the previously described save with transition function, in an additional embodiment, the markup tool 110 may include a "Copy and Follow" button or setting option so that when the "Copy" button is clicked, it automatically advances to the next gesture. This allows you to speed up the marking process.

Таким образом, обеспечивается повышение скорости разметки за счет упрощения интерпретации компонентов жеста как для пользователя, так и для модели машинного обучения. Кроме того, модель машинного обучения будет быстрее обучаться, и для ее обучения потребуется меньше данных, то есть предложенный способ формирования набора данных обеспечивает также уменьшение требуемого объема набора данных. Также снижаются требования к квалификации оператора/пользователя, выполняющего разметку, поскольку в предложенном способе с достаточной точностью может выполнять разметку даже человек, имеющий лишь первичное знакомство с жестовым языком и получивший инструктаж по пользованию данным устройством.This improves markup speed by making gesture components easier to interpret for both the user and the machine learning model. In addition, the machine learning model will learn faster and will require less data to train it, that is, the proposed method for generating a data set also reduces the required volume of the data set. The requirements for the qualifications of the operator/user performing the marking are also reduced, since in the proposed method even a person who has only initial familiarity with sign language and has received instructions on how to use this device can carry out marking with sufficient accuracy.

ПримерExample

Настоящее изобретение было испытан на практике с использованием нескольких пользователей-разметчиков, имеющих разную квалификацию и разные уровни знания жестового языка. На Фиг. 9 приведены результаты испытаний. В частности, первые три столбца слева (синий цвет) показывают среднее количество жестов в час за день, которые были размечены разметчиками, когда в качестве инструмента разметки им было предоставлено простое перечисление знаков системы нотации на экране компьютера. Остальные столбцы справа (зеленый цвет) показывают среднее количество жестов в час за день, которые были размечены разметчиками, когда они пользовались предложенным в настоящем изобретении инструментом разметки, который был реализован в варианте осуществления, приведенном на Фиг. 5-6. Как можно видеть, предложенное изобретение позволяет объективно повысить скорость разметки.The present invention was tested in practice using several marking users with different qualifications and different levels of knowledge of sign language. In FIG. 9 shows the test results. Specifically, the first three columns on the left (blue) show the average number of gestures per hour per day that were marked up by markers when they were given a simple listing of notation system signs on a computer screen as a markup tool. The remaining columns on the right (green) show the average number of gestures per hour per day that were tagged by markers when they used the tagging tool of the present invention, which was implemented in the embodiment shown in FIG. 5-6. As you can see, the proposed invention allows you to objectively increase the speed of marking.

ПрименениеApplication

Устройства и способы согласно настоящему изобретению можно использовать для создания базы данных, которая обеспечивает выполнение функций по хранению и сортировке данных о жестах и их компонентах жестового языка, и для разработки с помощью такой базы данных систем распознавания жестового языка, применяемых для обеспечения коммуникации глухих и слышащих. The devices and methods of the present invention can be used to create a database that provides functions for storing and sorting data about gestures and their sign language components, and to develop, using such a database, sign language recognition systems used to facilitate communication between the deaf and the hearing. .

Дополнительные особенности реализацииAdditional implementation features

Различные иллюстративные блоки и модули, описанные в связи с раскрытием сущности в данном документе, могут реализовываться или выполняться с помощью процессора общего назначения, процессора цифровых сигналов (DSP), специализированной интегральной схемы (ASIC), программируемой пользователем вентильной матрицы (FPGA) или другого программируемого логического устройства (PLD), дискретного логического элемента или транзисторной логики, дискретных аппаратных компонентов либо любой комбинации вышеозначенного, предназначенной для того, чтобы выполнять описанные в данном документе функции. Процессор общего назначения может представлять собой микропроцессор, но в альтернативном варианте, процессор может представлять собой любой традиционный процессор, контроллер, микроконтроллер или конечный автомат. Процессор также может реализовываться как комбинация вычислительных устройств (к примеру, комбинация DSP и микропроцессора, несколько микропроцессоров, один или более микропроцессоров вместе с DSP-ядром либо любая другая подобная конфигурация).The various illustrative blocks and modules described in connection with the disclosure herein may be implemented or executed by a general purpose processor, digital signal processor (DSP), application specific integrated circuit (ASIC), field programmable gate array (FPGA), or other programmable logic device (PLD), discrete logic gate or transistor logic, discrete hardware components, or any combination of the foregoing, designed to perform the functions described herein. The general purpose processor may be a microprocessor, but in the alternative, the processor may be any conventional processor, controller, microcontroller, or state machine. A processor may also be implemented as a combination of computing devices (eg, a combination of a DSP and a microprocessor, multiple microprocessors, one or more microprocessors plus a DSP core, or any other similar configuration).

Некоторые блоки или модули по отдельности или вместе могут представлять собой, например, процессор, который сконфигурирован для вызова и выполнения компьютерных программ из памяти для выполнения этапов способа или функций блоков или модулей в соответствии с вариантами осуществления настоящего изобретения. Согласно вариантам осуществления, устройство может дополнительно включать в себя память. Процессор может вызывать и выполнять компьютерные программы из памяти для выполнения способа. Память может быть отдельным устройством, независимым от процессора, или может быть интегрирована в процессор. Память может хранить код, инструкции, команды и/или данные для исполнения на наборе из одного или более процессоров описанного устройства. Коды, инструкции, команды могут предписывать процессору выполнять этапы способа или функции устройства.Some blocks or modules, individually or collectively, may constitute, for example, a processor that is configured to call and execute computer programs from memory to perform method steps or functions of the blocks or modules in accordance with embodiments of the present invention. According to embodiments, the device may further include memory. The processor may call and execute computer programs from memory to perform the method. Memory may be a separate device, independent of the processor, or may be integrated into the processor. The memory may store code, instructions, commands and/or data for execution on a set of one or more processors of the described device. Codes, instructions, or commands may direct a processor to perform steps of a method or device function.

Функции, описанные в данном документе, могут реализовываться в аппаратном обеспечении, программном обеспечении, выполняемом посредством одного или более процессоров, микропрограммном обеспечении или в любой комбинации вышеозначенного, если это применимо. Аппаратные и программные средства, реализующие функции, также могут физически находиться в различных позициях, в том числе согласно такому распределению, что части функций реализуются в различных физических местоположениях, то есть может выполняться распределенная обработка или распределенные вычисления.The functions described herein may be implemented in hardware, software executed by one or more processors, firmware, or any combination of the above, as applicable. The hardware and software that implement the functions may also be physically located in different locations, including according to a distribution such that portions of the functions are implemented in different physical locations, that is, distributed processing or distributed computing may be performed.

Вышеупомянутая память может быть энергозависимой или энергонезависимой памятью или может включать в себя как энергозависимую, так и энергонезависимую память. Специалисту в области техники должно быть также понятно, что, когда речь идет о памяти и о хранении данных, программ, кодов, инструкций, команд и т.п., подразумевается наличие машиночитаемого (или компьютерно-читаемого, процессорно-читаемого) запоминающего носителя. Машиночитаемый запоминающий носитель может представлять собой любой доступный носитель, который может использоваться для того, чтобы переносить или сохранять требуемое средство программного кода в форме инструкций или структур данных, и к которому можно осуществлять доступ посредством компьютера, процессора или иного устройства обработки общего назначения или специального назначения.The above memory may be volatile or non-volatile memory, or may include both volatile and non-volatile memory. One skilled in the art will also understand that when talking about memory and the storage of data, programs, codes, instructions, commands, etc., it is meant to be a machine-readable (or computer-readable, processor-readable) storage medium. A computer-readable storage medium can be any available medium that can be used to carry or store desired program code media in the form of instructions or data structures, and that can be accessed by a computer, processor, or other general purpose or special purpose processing device. .

Следует понимать, что хотя в настоящем документе для описания различных элементов, компонентов, областей, слоев и/или секций могут использоваться такие термины, как "первый", "второй", "третий" и т.п., эти элементы, компоненты, области, слои и/или секции не должны ограничиваться этими терминами. Эти термины используются только для того, чтобы отличить один элемент, компонент, область, слой или секцию от другого элемента, компонента, области, слоя или секции. Так, первый элемент, компонент, область, слой или секция может быть назван вторым элементом, компонентом, областью, слоем или секцией без выхода за рамки объема настоящего изобретения. В настоящем описании термин "и/или" включает любые и все комбинации из одной или более из соответствующих перечисленных позиций. Элементы, упомянутые в единственном числе, не исключают множественности элементов, если отдельно не указано иное.It should be understood that although terms such as “first”, “second”, “third” and the like may be used herein to describe various elements, components, regions, layers and/or sections, these elements, components, areas, layers and/or sections should not be limited to these terms. These terms are used only to distinguish one element, component, area, layer or section from another element, component, area, layer or section. Thus, a first element, component, region, layer or section may be referred to as a second element, component, region, layer or section without departing from the scope of the present invention. As used herein, the term “and/or” includes any and all combinations of one or more of the respective items listed. Elements referred to in the singular do not exclude the plurality of elements unless specifically stated otherwise.

Функциональность элемента, указанного в описании или формуле изобретения как единый элемент, может быть реализована на практике посредством нескольких компонентов устройства, и наоборот, функциональность элементов, указанных в описании или формуле изобретения как несколько отдельных элементов, может быть реализована на практике посредством единого компонента.The functionality of an element specified in the description or claims as a single element may be realized in practice by means of several components of the device, and conversely, the functionality of elements specified in the description or claims as several separate elements may be realized in practice by means of a single component.

Несмотря на то, что примерные варианты осуществления были подробно описаны и показаны на сопроводительных чертежах, следует понимать, что такие варианты осуществления являются лишь иллюстративными и не предназначены ограничивать настоящее изобретение, и что данное изобретение не должно ограничиваться конкретными показанными и описанными компоновками и конструкциями, поскольку специалисту в данной области техники на основе информации, изложенной в описании, и знаний уровня техники могут быть очевидны различные другие модификации и варианты осуществления изобретения, не выходящие за пределы сущности и объема данного изобретения.Although exemplary embodiments have been described in detail and shown in the accompanying drawings, it is to be understood that such embodiments are merely illustrative and are not intended to limit the present invention, and that the invention should not be limited to the specific arrangements and structures shown and described, since Various other modifications and embodiments of the invention may be apparent to one skilled in the art based on the information set forth in the specification and knowledge of the prior art without departing from the spirit and scope of the present invention.

Claims

1. A method for marking sign language gestures, implemented using a computer device containing a processor, memory, a screen and a screen pointer control device, containing the steps of:

- display in the user interface a video fragment containing one component of a gesture, a gesture or a phrase of sign language, while the frame contains one speaker in frontal view, performing at least one sign language gesture during the video, and the frame completely covers the area of space in which the gesture is being performed, and the parts of the speaker's body involved in performing the gesture are

- provide an interface element that provides instructions for the right hand and left hand to perform the appropriate markings,

- display in the user interface a set of visual representations characterizing the components of gestures included in a predefined sign language notation system to perform the corresponding markup, wherein the set of visual representations contains at least the following blocks: a block of visual representations of the hand configuration, a block of visual representations of the hand orientation ,

wherein the visual representation is an image or animated image that simplifies the corresponding component of the gesture,

wherein at least one block of visual representations contains sub-blocks of visual representations for each hand, and the sub-blocks are formed separately for each parameter of the components of the gesture,

wherein the parameters of the gesture components include a hand configuration type or a hand orientation type,

- receiving from the user for a gesture contained in a video fragment, for each hand participating in the gesture, an indication of a hand made using an on-screen pointer and an indication of one or more visual representations from each block of visual representations, as a markup,

- for each hand participating in the gesture, a gesture identifier is generated corresponding to the notation of this gesture in the notation system, according to the received instructions from the user corresponding to this hand, and

- create a record in the markup file containing an indication of the video fragment and, for each hand involved in the gesture, the gesture identifier,

wherein the dataset containing the markup file and video fragments specified in the markup file are used for machine learning of a sign language gesture recognition model, the video fragments are used as training data, and the gesture IDs from the markup file are used as markup.

2. The method according to claim 1, further comprising the step of:

- make a video recording of a person speaking sign language.

3. The method according to claim 1, in which the speaker is dressed in clothing, the color of which is contrasting to the background color, to the complexion and to the color of the hand or gloves, the clothing completely covers the body up to the speaker’s neck and arms at least in the shoulder area and not has elements protruding above the neck, clothing follows the shape of the speaker's body, the color of the brush or gloves is in contrast to the color of the clothing, the background color is in contrast to the color of the face and the color of the hand or gloves.

4. The method according to claim 1, additionally containing the steps of:

- receiving and displaying in the user interface an input video containing at least one sign language gesture,

- receive input from the user indicating one or two points in time within the duration of the input video to select a video fragment,

- extracting from the input video a video fragment containing one component of a gesture, a gesture or a phrase, wherein the video fragment contains one or more consecutive video frames.

5. The method according to claim 1, in which the set of visual representations further comprises a block of visual representations of gesture localization, a block of visual representations of hand movement and a block of visual representations of non-manual components of the gesture,

parameters of the gesture components further include the location of the gesture, the type of movement, or the non-manual object involved in the gesture.

6. The method according to claim 1, further comprising the step of:

- form and save in memory a data set containing a markup file and video fragments specified in the markup file.

7. A device for marking gestures in a sign language, comprising a processor, memory, a screen and a means for controlling a screen pointer, wherein the device is configured to:

- display in the user interface a video fragment containing one component of a gesture, a gesture or phrase of a sign language, while the frame contains one speaker in frontal view, performing at least one sign language gesture during the video, and the frame completely covers the area of space in which the gesture is being performed, and the parts of the speaker's body involved in performing the gesture are

- receive from the user, for a gesture contained in a video fragment, for each hand participating in the gesture, an indication of the hand made using an on-screen pointer and an indication of one or more visual representations from each block of visual representations, as a markup,

- for each hand participating in the gesture, generate a gesture identifier corresponding to the notation of this gesture in the notation system, according to the accepted instructions from the user corresponding to this hand,

- generate a record in a markup file containing an indication of the video fragment and, for each hand involved in the gesture, the gesture identifier,

wherein the dataset containing the markup file and video fragments is used for machine learning of a sign language gesture recognition model, the video fragments are used as training data, and the gesture IDs from the markup file are used as markup.

8. The device according to claim 7, additionally containing a camera configured to:

- make a video recording of a person speaking sign language.

9. The device according to claim 7, additionally configured to:

- receive and display in the user interface an input video containing at least one sign language gesture,

- accept input from the user indicating one or two points in time within the duration of the input video to select a video fragment,

- extract from the input video a video fragment containing one component of a gesture, a gesture or a phrase, wherein the video fragment contains one or more consecutive video frames.

10. The device of claim 7, wherein the set of visual representations further comprises a block of visual representations of gesture localization, a block of visual representations of hand movement, and a block of visual representations of non-manual components of the gesture,

11. The device according to claim 7, additionally configured to:

- generate and save in memory a data set containing a markup file and video fragments specified in the markup file.