RU2340941C2

RU2340941C2 - Method of handwriting specimen similarity estimation and methods of identity verification and handwriting identification using this estimation method

Info

Publication number: RU2340941C2
Application number: RU2006133724/09A
Authority: RU
Inventors: Сергей Олегович Новиков (RU); Сергей Олегович Новиков
Original assignee: Институт проблем информатики Российской академии наук
Priority date: 2006-09-21
Filing date: 2006-09-21
Publication date: 2008-12-10
Also published as: RU2006133724A

Abstract

FIELD: information technologies.

SUBSTANCE: invention refers to image recognition, specifically to automatic analysis of handwriting specimen stored in digital form. Quantitative estimation of handwriting specimen similarity is carried out by evaluation of quantitative measure of proximity of vector templates of compared specimens. Formation of every vector template includes making of each handwriting specimen in digital form, separation of grapheme set and processing of this set to produce set of vector descriptions of graphemes thereafter transformed in vector template. Thus graphemes are rated by position and orientation using estimation of line inclination in the initial handwriting specimen; and fixed number of grapheme points and some its metric characteristics are used at the stage of grapheme transformation to vector. Thus using representative learning sample of handwriting specimen description vectors the operator of handwriting specimen description vector reduction to major constituent is found that enables to use vectors of smaller dimensions for analysis and thereby essentially simplify practical implementation of method.

EFFECT: invention allows for simple and reliable comparison of handwriting specimen and for identity verification and personal identification by handwriting.

18 cl, 2 dwg

Description

Область техникиTechnical field

Настоящее изобретение относится к области распознавания данных и обработки цифровых данных с помощью электрических устройств и более конкретно к автоматическому анализу образцов почерка, представленных в цифровом виде, в частности, с целью верификации (подтверждения идентичности) личности по почерку или идентификации почерка и соответствующей ему личности.The present invention relates to the field of data recognition and digital data processing using electrical devices and more particularly to automatic analysis of handwriting samples presented in digital form, in particular for the purpose of verifying (confirming identity) a person by handwriting or identifying a handwriting and its corresponding person.

Уровень техникиState of the art

По мере развития вычислительной техники и электронных средств коммуникации все более широкое применение получают различные методы верификации личности по цифровым образцам почерка (например, по образцам подписи). Любой метод верификации применим также для идентификации (получения списка наиболее похожих образцов) при поиске по некоторой базе темплейтов.With the development of computer technology and electronic means of communication, various methods of verification of identity using digital handwriting samples (for example, signature samples) are becoming more widely used. Any verification method is also applicable for identification (obtaining a list of the most similar samples) when searching for a base of templates.

Известные методы верификации, применяемые при осуществлении различных электронных транзакций, описаны, например, в RU 2000114185, МПК G06T 7/00, 2002 и RU 2002119571, МПК G06F 17/60, 2004. Наиболее эффективные методы верификации по почерку состоят в определении количественной оценки сходства образцов почерка. Один из образцов в этом случае хранится в виде заранее сформированного темплейта (от англ. template - описание в цифровом виде) и привязан к установочным данным определенного человека. Второй образец предъявляется для распознавания, причем процедура верификации рассматривается как состоявшаяся, если оценка сходства сравниваемых образцов является достаточно высокой для признания идентичности почерка в обоих случаях. Один из известных способов анализа сходства образцов почерка в виде подписей, используемый для целей идентификации личности, описан в RU 2148274, МПК G06K 9/22, G06K 9/62, G06F 15/18, 2000. Однако данный и аналогичный ему способы требуют для своей реализации специального оборудования типа специальных графических планшетов и, как следствие, имеют ограниченную область применения.Known verification methods used in various electronic transactions are described, for example, in RU 2000114185, IPC G06T 7/00, 2002 and RU 2002119571, IPC G06F 17/60, 2004. The most effective handwriting verification methods are to determine the quantitative assessment of similarity handwriting samples. One of the samples in this case is stored in the form of a pre-formed template (from the English template - description in digital form) and is tied to the installation data of a specific person. The second sample is presented for recognition, and the verification procedure is considered completed if the similarity of the compared samples is high enough to recognize the identity of the handwriting in both cases. One of the known methods for analyzing the similarity of handwriting samples in the form of signatures, used for personal identification purposes, is described in RU 2148274, IPC G06K 9/22, G06K 9/62, G06F 15/18, 2000. However, this and similar methods require implementations of special equipment such as special graphic tablets and, as a result, have a limited scope.

Известны также различные способы получения количественной оценки сходства образцов почерка в целях осуществления верификации или идентификации, использующие стандартное оборудование для преобразования рукописных образцов в цифровую форму. Подобные способы основываются, в частности, на технике автоматической кластеризации и скрытых марковских моделях (см., например, А.Schlappbach, H.Bunke. Off-Line Handwriting Identification Using HMM Based Recognizers. IEEE, 2004 (2), pp.654-658) или на использовании определенного, специфичного набора признаков (см., например, G.Leedham, S.Chachra. Writer Identification Using Innovative Binarised Features of Handwritten Numerals. IEEE, ICDAR 2003, pp.413-417). Известные способы используют нормировку, со всеми вытекающими неудобствами, и транскрипцию, т.е. очень громоздки и требуют серьезного участия человека на этапе формирования темплейта.Various methods are also known for quantifying the similarity of handwriting samples in order to verify or identify using standard equipment for converting ink samples into digital form. Such methods are based, in particular, on the technique of automatic clustering and hidden Markov models (see, for example, A. Schlappbach, H. Bunke. Off-Line Handwriting Identification Using HMM Based Recognizers. IEEE, 2004 (2), pp.654- 658) or using a specific, specific set of attributes (see, for example, G. Leedham, S. Chachra. Writer Identification Using Innovative Binarised Features of Handwritten Numerals. IEEE, ICDAR 2003, pp.413-417). Known methods use normalization, with all the resulting inconveniences, and transcription, i.e. very cumbersome and require serious human participation at the stage of template formation.

Наиболее близким аналогом предлагаемого способа является способ определения количественной оценки сходства образцов почерка, представленный в работе A.Bensefia, Т.Paquet, L.Heutte. Handwritten Document Analysis for Automatic Writer Recognition. Electronic Letters on Computer Vision and Image Analysis, 2005, 5(2), pp 72-86. Как и другие вышеупомянутые способы, данный способ ориентирован на использование базы данных темплейтов, соответствующих различным почеркам и сформированных по выборке рукописных документов, написанных различными почерками (т.е. имеющих различных авторов). При этом подготовка каждого темплейта включает получение образца почерка в цифровой бинаризованной форме и его предварительную обработку, предусматривающую, в частности, сегментацию каждого образца почерка с выделением набора графем и с фильтрацией шумовых участков. В результате дальнейшей обработки каждого набора графем (с применением процедуры автоматической кластеризации) получают наборы векторных описаний графем, составляющих основу векторных темплейтов образцов почерка. При этом, как и в других аналогичных способах, в качестве меры близости сравниваемых образцов почерка используют количественную меру близости векторных темплейтов. В указанной работе Bensefia et al. описано также использование рассмотренного способа оценки сходства образцов почерка для осуществления способов верификации и идентификации.The closest analogue of the proposed method is a method for determining the quantitative assessment of the similarity of handwriting samples presented in the work of A. Bensefia, T. Paquet, L. Heutte. Handwritten Document Analysis for Automatic Writer Recognition. Electronic Letters on Computer Vision and Image Analysis, 2005, 5 (2), pp 72-86. Like the other above-mentioned methods, this method is focused on using a database of templates corresponding to different handwritings and formed from a selection of handwritten documents written in different handwritings (i.e., having different authors). Moreover, the preparation of each template includes obtaining a sample of handwriting in digital binarized form and its preliminary processing, which includes, in particular, the segmentation of each sample of handwriting with the selection of a set of graphemes and with filtering of noise sections. As a result of further processing of each set of graphemes (using the automatic clustering procedure), sets of vector descriptions of graphemes that form the basis of vector templates for handwriting samples are obtained. Moreover, as in other similar methods, a quantitative measure of the proximity of vector templates is used as a measure of the proximity of the handwriting samples being compared. In this work, Bensefia et al. The use of the considered method for assessing the similarity of handwriting samples for the implementation of verification and identification methods is also described.

Использование в известном способе операций автоматической кластеризации существенно усложняет его осуществление. Кроме того, результаты статистического анализа, выполняемого при проведении верификации или идентификации известным способом, зависят от конкретной базы (при вводе образца почерка нового автора список всех состояний по всем авторам изменяется). При этом решение в отношении верификации или идентификации принимается на основании статистического критерия взаимной информации по данным очень большой размерности (400-500 состояний).The use of automatic clustering operations in the known method substantially complicates its implementation. In addition, the results of statistical analysis performed during verification or identification in a known manner depend on the specific base (when entering a sample of a new author’s handwriting, the list of all conditions for all authors changes). In this case, the decision regarding verification or identification is made on the basis of a statistical criterion for mutual information from very large dimension data (400-500 states).

Раскрытие изобретенияDisclosure of invention

Таким образом, существует потребность в разработке простого в осуществлении и эффективного способа количественного сравнения образцов почерка, которые могут быть преобразованы в цифровую форму стандартными цифровыми устройствами ввода с умеренным разрешением. При этом необходимо обеспечить высокую надежность получаемых оценок без необходимости использования образцов почерка, содержащих большое количество символов.Thus, there is a need to develop an easy-to-implement and effective way to quantify handwriting samples that can be digitized with standard, moderate-resolution digital input devices. At the same time, it is necessary to ensure high reliability of the obtained estimates without the need for handwriting samples containing a large number of characters.

Еще одна задача, решаемая изобретением, заключается в обеспечении возможности распознавания почерка (в целях верификации и/или идентификации) в условиях независимости от текста, т.е. с получением сравниваемых образцов почерка из текстов несовпадающего содержания.Another problem solved by the invention is to enable handwriting recognition (for verification and / or identification) under conditions of independence from the text, i.e. with obtaining comparable handwriting samples from texts of divergent content.

Перечисленные задачи решены созданием способа определения количественной оценки сходства образцов почерка, который включает следующие операции:These tasks are solved by creating a method for determining the quantitative assessment of the similarity of handwriting samples, which includes the following operations:

- (а) получение каждого образца почерка в цифровой бинаризованной форме,- (a) obtaining each sample of handwriting in a digital binarized form,

- (б) сегментацию каждого образца почерка с выделением набора графем и с фильтрацией шумовых участков;- (b) segmentation of each handwriting sample with the selection of a set of graphemes and with filtering of noise sections;

- (в) обработку каждого набора графем с получением набора векторных описаний графем;- (c) processing each set of graphemes to obtain a set of vector descriptions of graphemes;

- (г) формирование на основе каждого полученного набора векторных описаний графем векторного темплейта образца почерка;- (d) forming, on the basis of each set of vector descriptions of graphemes, a vector template of a handwriting sample;

- (д) получение количественной меры близости векторных темплейтов сравниваемых образцов почерка; и- (e) obtaining a quantitative measure of the proximity of the vector templates of the compared handwriting samples; and

- (ж) определение количественной меры близости сравниваемых образцов почерка с использованием количественной меры, полученной на операции (д).- (g) determining a quantitative measure of the proximity of the handwriting samples being compared using the quantitative measure obtained in operation (e).

При этом отличительными особенностями способа по изобретению является то, что операция сегментации включает в себя нахождение оценки α угла наклона строк, скелетизацию линий символов в образце почерка и удаление точек ветвления линий с разбиением образца почерка на графемы

.In this case, the distinguishing features of the method according to the invention is that the segmentation operation includes finding estimates of the angle of inclination of the lines, skeletonization of the symbol lines in the handwriting sample, and removing branch points of the lines with dividing the handwriting sample into graphemes

.

Кроме того, операция обработки каждого набора графем включает:In addition, the processing operation for each set of graphemes includes:

получение описания каждой графемы

в виде набора координат

getting a description of each grapheme

as a set of coordinates

гдеWhere

n_i - количество точек графемы,n _i is the number of grapheme points,

- координаты ее j-й точки;

- coordinates of its j-th point;

преобразование, с использованием найденной оценки α угла наклона строк, каждой графемы

в нормированную по положению и ориентации графему

transformation, using the found estimate α of the line angle, of each grapheme

into grapheme normalized by position and orientation

гдеWhere

(х_с, y_с) - координаты опорной точки нормировки;(x _c, y _c) - the coordinates of reference point normalization;

определение метрических характеристик каждой графемы и исключение графем с нетипичными метрическими характеристиками, иdetermining the metric characteristics of each grapheme and the exclusion of graphemes with atypical metric characteristics, and

преобразование каждой графемы в вектор с использованием фиксированного количества n_f точек графемы, где n_f<n_i.transforming each grapheme into a vector using a fixed number n _f points of grapheme, where n _f <n _i .

При этом перед выполнением операции формирования темплейта для сравниваемых образцов почерка выполняют перечисленные операции (а)-(в) для каждого образца предварительно созданной выборки образцов различных почерков с формированием представительной обучающей выборки векторов описаний образцов почерка и по результатам анализа сформированной выборки определяют оператор приведения векторов описаний образцов почерка к главным компонентам.In this case, before performing the template formation operation for the handwriting samples to be compared, the following operations (a) - (c) are performed for each sample of a pre-created sample of samples of various handwritings with the formation of a representative training sample of handwriting sample description vectors and the operator of reducing the description vectors is determined from the analysis of the generated sample handwriting samples to the main components.

Еще одной особенностью является то, что операция (г) в способе по изобретению включает преобразование, посредством указанного оператора, каждого вектора, полученного на операции (в) для сравниваемых образцов почерка, в вектор меньшей размерности и использование, в качестве темплейта образца почерка, набора векторов меньшей размерности, соответствующего указанному образцу.Another feature is that operation (d) in the method according to the invention includes the conversion, using the specified operator, of each vector obtained in operation (c) for the compared handwriting samples into a vector of lower dimension and using, as a template for the handwriting sample, a set vectors of lower dimension corresponding to the specified pattern.

Предлагаются также предпочтительные варианты осуществления способа по изобретению, которые характеризуются соответствующими дополнительными признаками.Preferred embodiments of the method of the invention are also provided, which are characterized by corresponding additional features.

Так, для обеспечения высокой достоверности идентификации анализируемого образца почерка длину этого образца (как и длину каждого образца из используемой выборки образцов почерка) выбирают таким образом, чтобы набор графем, формируемый из каждого образца почерка, составлял не менее 300 графем. Важной полезной особенностью способа является то, что сравниваемые образцы почерка могут формироваться из текстов несовпадающего содержания. Эта особенность существенно облегчает формирование базы данных образцов почерка и позволяет использовать для анализа любой доступный текст достаточной длины, а не только текст, подготовленный по определенным правилам, например включающий заданные слова.So, to ensure high reliability of identification of the analyzed handwriting sample, the length of this sample (as well as the length of each sample from the used sample of handwriting samples) is chosen so that the set of graphemes formed from each handwriting sample is at least 300 graphemes. An important useful feature of the method is that the handwriting samples to be compared can be formed from texts of different content. This feature significantly facilitates the formation of a database of handwriting samples and allows you to use any available text of sufficient length for analysis, and not just text prepared according to certain rules, for example, including given words.

Далее перед определением метрических параметров графемы

из двух ее крайних точек (х₁, y₁) и

рекомендуется выбрать в качестве точки ее начала точку с наименьшим значением х, а при х₁=х_n - точку с наименьшим значением y. При этом в случае выбора в качестве точки начала точки

необходимо произвести перестановку точек в описании графемы

в обратном порядке, с преобразованием графемы

в графему

.Next, before determining the metric parameters of the grapheme

from its two extreme points (x ₁ , y ₁ ) and

It is recommended to select the point with the smallest value of x as the point of its beginning, and for x ₁ = x _n , the point with the smallest value of y. Moreover, in the case of choosing as the point of origin

it is necessary to rearrange the points in the description of the grapheme

in reverse order with grapheme conversion

to grapheme

.

Основная часть операций способа по изобретению может быть реализована в различных альтернативных вариантах. Так, при преобразовании графемы

в графему

в качестве опорной точки может быть выбран центроид исходной графемы, ее начало или конец, а также центр тяжести.The bulk of the operations of the method according to the invention can be implemented in various alternatives. So, when converting grapheme

to grapheme

as a reference point, the centroid of the initial grapheme, its beginning or end, and also the center of gravity can be selected.

В качестве метрических характеристик каждой графемы

может быть использована любая приемлемая комбинация следующих параметров: косинуса и синуса ее угла наклона, длины графемы и размеров наименьшего охватывающего прямоугольника. При этом графемы с нетипичными значениями перечисленных характеристик (соответствующие присутствующим в образце почерка различным посторонним элементам типа линеек, клеток и т.д.) целесообразно исключить из используемого набора графем посредством векторной фильтрации. В то же время названные метрические характеристики графемы (или их часть) могут быть включены в компоненты вектора, в который преобразуется графема.As the metric characteristics of each grapheme

any acceptable combination of the following parameters can be used: the cosine and sine of its angle of inclination, the length of the grapheme and the size of the smallest enclosing rectangle. In this case, graphemes with atypical values of the listed characteristics (corresponding to various extraneous elements such as rulers, cells, etc. present in the handwriting sample) should be excluded from the used set of graphemes by means of vector filtering. At the same time, the named metric characteristics of the grapheme (or part of it) can be included in the components of the vector into which the grapheme is transformed.

Количество компонент в векторе меньшей размерности может быть определено по результату обработки представительной обучающей выборки с использованием одного из стандартных методов уменьшения размерности, например метода анализа главных компонент или метода анализа независимых компонент. При этом качестве количественной меры близости сопоставляемых темплейтов предпочтительно использовать произведение Р-значений одного из стандартных критериев согласия (например, критерия Колмогорова-Смирнова или критерия хи-квадрат) по всем компонентам указанного вектора.The number of components in a vector of lower dimension can be determined by processing a representative training sample using one of the standard methods of reducing dimension, for example, the principal component analysis method or the independent component analysis method. In this case, it is preferable to use the product of P-values of one of the standard criteria of agreement (for example, the Kolmogorov-Smirnov criterion or the chi-square criterion) for all components of the specified vector as a quantitative measure of the proximity of the compared templates.

В качестве достоинств способа по изобретению (которые будут более подробно рассмотрены в разделе "Осуществление изобретения") можно отметить, что он позволяет избежать таких этапов предварительной обработки образцов текста, как нормировка всего изображения по размеру и наклону рукописных символов. Кроме того, обеспечивается автоматическое удаление нетекстовых элементов (строк, подчеркиваний, линий, шумов и т.д.). Не требуется этапа транскрипции, т.е. расшифровки самого текста.As the advantages of the method according to the invention (which will be discussed in more detail in the section "Implementation of the invention"), it can be noted that it avoids such stages of preprocessing text samples as normalizing the entire image in size and inclination of handwritten characters. In addition, automatic removal of non-text elements (lines, underlines, lines, noise, etc.) is provided. No transcription step required, i.e. decryption of the text itself.

Изобретение охватывает также способ верификации личности по верифицируемому образцу почерка путем определения сходства верифицируемого образца почерка и заранее сформированного эталонного образца текста, привязанного к установочным данным верифицируемой личности, с использованием количественной оценки сходства. При этом для определения указанной количественной оценки сходства в способе верификации по изобретению используют любой из вышеописанных вариантов осуществления способа определения количественной оценки сходства образцов почерка, причем эталонный образец почерка предпочтительно хранят в виде темплейта в составе сформированной для этой цели базе данных темплейтов.The invention also encompasses a method for verifying a person using a verifiable handwriting sample by determining the similarity of a verifiable handwriting sample and a pre-formed reference sample of text tied to the verified identity settings using a similarity score. Moreover, to determine the specified quantitative assessment of the similarity in the verification method according to the invention, any of the above described embodiments of the method for determining the quantitative assessment of the similarity of handwriting samples is used, and the reference handwriting sample is preferably stored as a template in the template database formed for this purpose.

Изобретение охватывает, кроме того, способ идентификации почерка путем определения сходства образца идентифицируемого почерка и образцов почерка из предварительно сформированной базы данных, содержащей идентифицированные образцы почерка, причем в способе используют количественные оценки сходства и составляют список идентифицированных образцов почерка, ранжированный по значениям оценок сходства с образцом идентифицируемого почерка. Подобно тому, как это предложено для вышеупомянутого способа верификации, количественные оценки сходства определяют с использованием любого из вышеописанных вариантов осуществления способа определения количественной оценки сходства образцов почерка. При этом база данных идентифицированных образцов почерка предпочтительно представляет собой базу данных темплейтов этих образцов.The invention also encompasses a method for identifying handwriting by determining the similarity of an identifiable handwriting sample and handwriting samples from a pre-formed database containing identified handwriting samples, the method using quantitative similarity estimates and compiling a list of identified handwriting samples ranked by the values of similarity ratings with the sample identifiable handwriting. Just as suggested for the aforementioned verification method, quantification of similarity is determined using any of the above embodiments of the method for determining quantification of the similarity of handwriting samples. Moreover, the database of identified handwriting samples is preferably a database of templates of these samples.

Краткое описание чертежейBrief Description of the Drawings

На фиг.1 приведен характерный исходный образец почерка, пригодный для осуществления изобретения.Figure 1 shows a typical initial sample of handwriting, suitable for carrying out the invention.

На фиг.2 представлен результат обработки образца почерка, приведенного на фиг.1.Figure 2 presents the result of processing the sample handwriting shown in figure 1.

Осуществление изобретенияThe implementation of the invention

Способ определения количественной оценки сходства образцов почерка согласно изобретению можно разделить на две стадии: построение темплейта и сравнение темплейтов. Для реализации способа необходимо располагать базой данных (БД) темплейтов, построенной на основе представительной выборки образцов различных почерков. Порядок построения темплейтов, вводимых в БД, точно такой же, как и для образцов почерка, подлежащих сравнению в рамках способа по изобретению. При этом для формирования темплейта с целью занесения в БД желательно иметь образец текста не менее 30 слов. Желательно также, чтобы текст имел форму нескольких (предпочтительно не менее трех) рукописных строк. Такая длина и форма текста в принципе достаточны для получения стабильных результатов, при этом одним из преимуществ способа является то, что исходные (рукописные) образцы почерка могут соответствовать текстам несовпадающего содержания. Один из реальных образцов почерка, использованный при экспериментальной проверке изобретения, приведен на фиг.1.The method for determining the quantitative assessment of the similarity of handwriting samples according to the invention can be divided into two stages: building a template and comparing templates. To implement the method, it is necessary to have a database (DB) of templates based on a representative sample of samples of various handwritings. The procedure for constructing templates entered in the database is exactly the same as for handwriting samples to be compared in the framework of the method according to the invention. At the same time, it is desirable to have a sample text of at least 30 words in order to formulate a template with the goal of entering it into the database It is also desirable that the text be in the form of several (preferably at least three) handwritten lines. This length and shape of the text is, in principle, sufficient to obtain stable results, while one of the advantages of the method is that the original (handwritten) handwriting samples can correspond to texts of mismatched content. One of the real handwriting samples used in the experimental verification of the invention is shown in figure 1.

Построение темплейтаBuilding a template

Первой операцией, выполняемой на этапе обработки изображения образца почерка, является получение образца почерка в цифровой бинаризованной форме. Преобразование исходных (рукописных) образцов в электронную форму может быть осуществлено любым подходящим для этой цели цифровым устройством ввода (предпочтительно стандартным планшетным сканером) с разрешением, при котором эффект дискретизации не искажает качество распознавания (рекомендуемое разрешение 300 dpi). Далее оцифрованное изображение или какой-либо выделенный его участок преобразуют к бинарному виду, т.е. пикселы изображения, соответствующие линиям символов, получают одно из двух бинарных значений, а пикселы, соответствующие фону, - другое. Для осуществления этой операции может быть использован любой метод автоматической бинаризации, который хорошо отделяет изображения букв от фона. При дальнейшем рассмотрении принимается, что линии белые (значение "1"), а фон черный (значение "0").The first operation performed at the stage of processing the image of the handwriting sample is to obtain a handwriting sample in digital binarized form. Conversion of the original (handwritten) samples into electronic form can be carried out by any suitable digital input device (preferably a standard flatbed scanner) with a resolution at which the sampling effect does not distort the recognition quality (recommended resolution 300 dpi). Next, the digitized image or any selected portion of it is converted to binary form, i.e. image pixels corresponding to character lines receive one of two binary values, and pixels corresponding to the background receive the other. To implement this operation, any automatic binarization method can be used that separates letter images from the background well. Upon further consideration, it is assumed that the lines are white (value "1"), and the background is black (value "0").

После бинаризации образца почерка находят оценку угла наклона строк (УНС). Специалистам хорошо известны различные методы оценки УНС. В качестве одного из предпочтительных вариантов можно отметить метод, основанный на суммировании бинарных значений вдоль различных направлений в растровом монохромном изображении рукописного текста (полученного в результате выполнения бинаризации). Для каждого выбранного направления получают одномерную последовательность и находят оценку ее дисперсии. В качестве оценки УНС принимают значение угла α, который дает наибольшую дисперсию.After binarization of the handwriting sample, an estimate of the line angle (ONS) is found. Various methods for evaluating ONS are well known to those skilled in the art. As one of the preferred options, a method based on summing binary values along various directions in a raster monochrome image of handwritten text (obtained as a result of binarization) can be noted. For each selected direction, a one-dimensional sequence is obtained and an estimate of its dispersion is found. As an estimate of ONS, we take the value of angle α, which gives the greatest dispersion.

После этого одним из известных методов, в частности с использованием фильтра низких частот выполняют фильтрацию шумовых участков в бинарном изображении образца почерка, а также скелетизацию линий символов. После данного преобразования каждая точка линии символа в образце почерка (кроме случаев окончания и ветвления линий) будет иметь только два соседа со значением 1.After that, one of the known methods, in particular using a low-pass filter, performs filtering of noise sections in a binary image of a handwriting sample, as well as skeletonization of symbol lines. After this transformation, each point of the symbol line in the handwriting sample (except for the cases of termination and branching of lines) will have only two neighbors with a value of 1.

На следующей операции удаляют точки, соответствующие ветвлениям линий (т.е. имеющие более 2-х соседей со значением 1). В результате формируется набор несвязных линий, каждая точка которых имеет не более двух равнозначных соседей и только 2 точки (точки окончаний) на одной линии имеют только одного равнозначного соседа. Такие линии далее будут именоваться графемами. Графема представляет собой участок линии (в дискретном представлении) без самопересечений, т.е. она задается координатами начальной точки, конечной точки и всех точек линии. Путем обхода каждой отдельной графемы от конечной точки получают ее описание в виде набора координат на дискретной сетке.In the next operation, points corresponding to branching lines (i.e., having more than 2 neighbors with a value of 1) are deleted. As a result, a set of disconnected lines is formed, each point of which has no more than two equivalent neighbors and only 2 points (termination points) on one line have only one equivalent neighbor. Such lines will hereinafter be called graphemes. A grapheme is a line segment (in a discrete representation) without self-intersections, i.e. it is given by the coordinates of the start point, end point and all points of the line. By traversing each individual grapheme from the end point, its description is obtained in the form of a set of coordinates on a discrete grid.

Пусть в преобразуемом образце почерка содержится n_g графем, а исходное описание i-й графемы представлено в виде:Let the converted handwriting sample contain n _g graphemes, and the initial description of the ith grapheme is presented in the form:

где n_i - количество точек графемы,

- координаты j-й точки.where n _i is the number of grapheme points,

- coordinates of the j-th point.

Далее выполняют нормировку графем по положению:Next, the graphemes are normalized by position:

где

Where

α= оценка УНС.

α = ONS score.

В качестве опорной точки для нормировки (х_c, y_c) можно взять любую однозначно вычисляемую оценку-, В качестве опорной можно взять, например, начальную точку, конечную точку или любую функцию, переводящую набор координат в вектор, например центр тяжести или центроид:

As a reference point for normalization (x _c , y _c ), you can take any uniquely calculated estimate -. As a reference, you can take, for example, a starting point, an end point or any function that translates a set of coordinates into a vector, for example, the center of gravity or centroid:

Если нужно избежать вариабельности наклона символов, которая может быть связана с нечетким соблюдением наклона строки при написании либо с психологическим состоянием автора, можно сделать дополнительную нормировку по ориентации, определяемую любой из однозначно вычисляемых оценок, например по направлению вектора начала графемы (х₁, y₁)^T, либо вдоль оси инерции и т.д. После преобразования (2) важно выбрать точку начала графемы. Это можно сделать, например, следующим образом: из двух точек (х₁, y₁) и

выбирают ту, у которой абсцисса меньше, если же абсциссы совпадают, выбирают ту, у которой ордината меньше. Если начальной точкой оказалась

, то точки в описании графемы переставляются в обратном порядке. После выбора начальной точки и изменения порядка следования точек линии (там, где это необходимо) получают описание графемы следующего уровня:If it is necessary to avoid variability symbols tilt, which may be associated with the fuzzy compliance line slope at the writing or the psychological state of the author, it can be further normalization of the orientation defined by any of uniquely calculated estimates, for example in the direction of the vector starts graphemes (x _1, y ₁ ) ^T , either along the axis of inertia, etc. After transformation (2), it is important to choose the point of origin of the grapheme. This can be done, for example, as follows: from two points (x ₁ , y ₁ ) and

choose the one with less abscissa, but if the abscissas match, choose the one with less ordinate. If the starting point turned out to be

, then the points in the description of the grapheme are rearranged in the reverse order. After selecting the starting point and changing the order of the line points (where necessary), a description of the next level grapheme is obtained:

После преобразования графем в векторную форму проводят "векторную" фильтрацию, т.е. исключение графем с нетипичными метрическими характеристиками. Данная фильтрация позволяет удалять линейки, клетки и прочие нетекстовые элементы на изображении.After transforming the graphemes into a vector form, a “vector” filtering is performed, i.e. exclusion of graphemes with atypical metric characteristics. This filtering allows you to remove rulers, cells and other non-textual elements in the image.

Каждое описание

преобразуют в новое описание с фиксированным количеством точек n_f:Each description

convert to a new description with a fixed number of points n _f :

гдеWhere

квадратные скобки здесь означают округление до ближайшего целого. Таким образом, описание графемы получают в виде вектора V_i с 2n_f компонентами, которые далее будут обознаться, как

.square brackets here mean rounding to the nearest integer. Thus, a grapheme description is obtained in the form of a vector V _i with 2n _f components, which will be further referred to as

.

Для каждой графемы вычисляют также дополнительные метрические параметры, в качестве которых могут быть использованы, например, угол наклона графемы, ее длина и/или размеры наименьшего охватывающего прямоугольника. Эти признаки более отдалены от природы изображения (рукописного текста) и описывают скорее параметры текстуры. Можно использовать и другие текстурные признаки. В распознавании (т.е. при верификации или идентификации) их можно использовать для вычисления дополнительного критерия соответствия текстурных характеристик.Additional metric parameters are also calculated for each grapheme, for example, the grapheme's tilt angle, its length and / or the sizes of the smallest enclosing rectangle can be used. These signs are more distant from the nature of the image (handwritten text) and rather describe the texture parameters. You can use other texture features. In recognition (i.e., during verification or identification), they can be used to calculate an additional criterion for the correspondence of texture characteristics.

Более предпочтительный вариант использования дополнительных метрических параметров заключается в том, что к 2n_f компонентам графемы добавляют еще несколько компонентов, например длину, косинус и синус угла наклона, размеры наименьшего охватывающего прямоугольника, с получением результирующего вектора из n₀=2n_f+k компонент (в приведенном примере k=5). Дальнейшая обработка в этом варианте проводится именно для векторов размерности n₀ (а не 2n_f).A more preferable option for using additional metric parameters is that several more components are added to the 2n _f components of the grapheme, for example, the length, cosine and sine of the angle of inclination, the sizes of the smallest enclosing rectangle, with the resulting vector from n ₀ = 2n _f + k components ( in the above example, k = 5). Further processing in this embodiment is carried out specifically for vectors of dimension n ₀ (and not 2n _f ).

Если есть необходимость в обеспечении инвариантности к размеру символов, можно нормировать координаты в описании (2) делением на усредненный по всем графемам метрический параметр.If there is a need to ensure invariance to the size of characters, you can normalize the coordinates in the description (2) by dividing by the metric parameter averaged over all graphemes.

Результаты векторизации исходного образца почерка (приведенного на фиг.1) представлены на фиг.2, где светлые линии отображают векторное представление графем. Приведенный на фиг.2 результат получен при выборе 16 точек на графему (n_f=16), причем для получения графического представления точки соединяют отрезками прямых. Как видно из фиг.2, при таком выборе количества точек графемы в векторном представлении отображаются вполне адекватно. Видно также, что нетекстовые структуры автоматически фильтруются, а графемы нормируются по интегральному значению УНС.The results of vectorization of the original handwriting sample (shown in Fig. 1) are presented in Fig. 2, where the light lines represent the vector representation of the graphemes. The result shown in Fig. 2 was obtained when 16 points were selected per grapheme (n _f = 16), and to obtain a graphical representation of the point, they are connected by line segments. As can be seen from figure 2, with this choice of the number of points of the grapheme in the vector representation are displayed quite adequately. It can also be seen that non-text structures are automatically filtered, and graphemes are normalized by the integral value of ONS.

Используя представительную выборку векторов описаний образцов почерка, полученную по множеству текстов различных авторов, одним из методов приведения к главным компонентам, независимым в рамках некоторой модели, например методом анализа главных компонент (РСА - principal component analysis) или методом анализа независимых компонент (ICA - independent component analysis), - см., например, R.О.Duda, Р.Е.Hart, D.G. Stork. Pattern Classification (2nd ed.), (2000). New York: John Wiley Press и L.I.Smith A tutorial on principal components analysis. (2002). Retrieved from www.cs.otago.ac.nz/cosc453/student.tutorials/principal_component.pdf. - находят оператор преобразования векторов V_i в вектора меньшей размерности Р_i с компонентами

, j=1÷n_p, n_p<2n_f. Размерность вектора главных компонент определяют исходя из особенностей статистики исследуемой выборки.Using a representative sample of handwriting sample description vectors obtained from a variety of texts by various authors, using one of the methods for reducing to main components independent within a certain model, for example, principal component analysis (PCA) or independent component analysis (ICA - independent component analysis), - see, for example, R.O.Duda, P.E. Hart, DG Stork. Pattern Classification (2nd ed.), (2000). New York: John Wiley Press and LISmith A tutorial on principal components analysis. (2002). Retrieved from www.cs.otago.ac.nz/cosc453/student.tutorials/principal_component.pdf. - find the transformation operator of the vectors V _i in the vector of smaller dimension P _i with components

, j = 1 ÷ n _p , n _p <2n _f . The dimension of the vector of the main components is determined based on the characteristics of the statistics of the studied sample.

Полностью темплейт описания почерка для данного участка текста определяется набором

.The complete template for the description of handwriting for a given section of text is determined by the set

.

Метод сравнения двух темплейтовMethod for comparing two templates

Поскольку главные компоненты предположительно независимы, для сравнения двух темплейтов

и

можно для каждой компоненты k отдельно вычислить Р-значение статистики одного из стандартных критериев согласия, например, Колмогорова-Смирнова или χ² (хи-квадрат) выборок

и

.Since the main components are supposedly independent, to compare the two templates

and

it is possible for each component k to separately calculate the P-value of the statistics of one of the standard criteria of agreement, for example, Kolmogorov-Smirnov or χ ² (chi-square) samples

and

.

Обозначим это значение через

. Поскольку в базисе главных векторов корреляционной матрицы (главных компонент) значения компонент рассматриваются как статистически независимые между собой, для многомерной оценки можно просто умножать оценки для каждой из компонент. Тогда в качестве меры сходства двух почерков можно использовать величинуDenote this value by

. Since, in the basis of the principal vectors of the correlation matrix (principal components), the values of the components are considered statistically independent from each other, for a multidimensional estimate, you can simply multiply the estimates for each of the components. Then, as a measure of the similarity of the two handwritings, we can use the quantity

где f(x) - любая монотонно возрастающая функция, которая обычно выбирается нормировкой по ошибке ложного опознавания "чужого".where f (x) is any monotonically increasing function, which is usually chosen by normalization by the error of false recognition of the “alien”.

В качестве основных отличительных особенностей и преимуществ способа по изобретению можно отметить следующие:As the main distinguishing features and advantages of the method according to the invention, the following can be noted:

- графемы в анализируемых образцах рассматриваются как характеристики написания, а не как элементы букв или других зависящих от текста структурных единиц;- graphemes in the analyzed samples are considered as writing characteristics, and not as elements of letters or other structural units dependent on the text;

- графемы преобразуются в векторный вид, что позволяет обойтись без таких сложных этапов предобработки, как нормировка всего изображения по размеру и наклону рукописных символов, удаление нетекстовых элементов (строк, подчеркиваний, линий, шумов и т.д.), тогда как нормировка и фильтрация нетекстовых структур в векторном виде намного проще в вычислительном смысле;- graphemes are converted into a vector view, which eliminates the need for such complicated preprocessing steps as normalization of the entire image by the size and inclination of handwritten characters, removal of non-textual elements (lines, underlines, lines, noise, etc.), while normalization and filtering non-textual structures in vector form are much simpler in a computational sense;

- не требуется выделение строк, отдельных слов, букв и т.д.;- no selection of lines, single words, letters, etc .;

- не требуется этапа транскрипции, т.е. расшифровки самого текста;- no transcription step is required, i.e. decryption of the text itself;

- оператор преобразования в пространство главных компонент вычисляется один раз и в дальнейшем не зависит от предъявляемых данных.- the operator of transformation into the space of principal components is calculated once and subsequently does not depend on the presented data.

Перечисленные преимущества способа определения количественной оценки сходства образцов почерка делают его весьма эффективным в качестве основы способа верификации личности по верифицируемому образцу почерка. В этом случае в качестве одного из двух сравниваемых темплейтов используют хранящийся в базе данных темплейт эталонного образца почерка, привязанного к установочным данным верифицируемой личности. Темплейт эталонного образца формируют точно так же, как это было описано выше, т.е. на него не накладываются какие-либо дополнительные ограничения в отношении содержания текста, разбиения на строки и т.д.The listed advantages of the method for determining the quantitative assessment of the similarity of handwriting samples make it very effective as the basis of the method for verifying a person using a verified handwriting sample. In this case, one of the two compared templates uses the template template of the handwriting stored in the database that is tied to the setup data of the verified person. The template of the reference sample is formed in the same way as described above, i.e. it does not impose any additional restrictions on the content of the text, pagination, etc.

При этом, как уже было отмечено, критерий верификации представляет собой обычный, теоретически обоснованный уровень значимости. Кроме того, не используются никакие дополнительные неробастные настройки, которые имеют место на этапе автоматической кластеризации и выделения основных состояний в большинстве известных способов аналогичного назначения (включая способы, упомянутые в разделе "Уровень техники").At the same time, as already noted, the verification criterion is an ordinary, theoretically substantiated level of significance. In addition, no additional non-oversized settings are used, which take place at the stage of automatic clustering and allocation of basic states in most known methods for a similar purpose (including the methods mentioned in the "Background" section).

Использование предложенного способа определения количественной оценки сходства образцов почерка для осуществления идентификации, по существу, аналогично его использованию для целей верификации. Способ идентификации также предусматривает использование базы данных идентифицированных образцов почерка, предпочтительно организованной в виде базы данных темплейтов образцов почерка, сформированных, как это было описано выше. Однако в этом случае образец почерка, подлежащий идентификации, предъявляется без каких-либо дополнительных данных о личности человека, которому принадлежит данный идентифицируемый образец, поэтому сравнение данного образца производится не с единственным эталонным образцом, а с множеством идентифицированных образцов. По результатам такого множественного сравнения составляют список идентифицированных образцов почерка, ранжированный по значениям оценок сходства с образцом идентифицируемого почерка. Если темплейт образца идентифицированного почерка имеется в базе данных темплейтов, он с высокой вероятностью окажется первым в ранжированном списке темплейтов, т.е. будет иметь наивысшую (и очень высокую) оценку сходства с темплейтом идентифицируемого образца. В результате анализируемый образец почерка может быть идентифицирован как почерк личности, соотнесенной с темплейтом соответствующего образца почерка, тем самым будет идентифицирована личность человека, которому принадлежит анализируемый образец почерка.The use of the proposed method for determining the quantitative assessment of the similarity of handwriting samples for identification is essentially the same as its use for verification purposes. The identification method also involves the use of a database of identified handwriting samples, preferably organized as a database of templates of handwriting samples generated as described above. However, in this case, the handwriting sample to be identified is presented without any additional data on the identity of the person who owns the identifiable sample; therefore, this sample is not compared with a single reference sample, but with many identified samples. Based on the results of such multiple comparisons, a list of identified handwriting samples is compiled, ranked by the values of the similarity estimates with the sample of identified handwriting. If the template for the sample of identified handwriting is in the template database, it is likely to be the first in the ranked list of templates, i.e. will have the highest (and very high) rating of similarity with the template of the identified sample. As a result, the analyzed sample of handwriting can be identified as the handwriting of a person correlated with the template of the corresponding sample of handwriting, thereby identifying the person who owns the analyzed sample of handwriting.

При проведении экспериментальной проверки изобретения были получены следующие результаты, свидетельствующие о высокой надежности способов верификации и идентификации согласно изобретению: при наличии не менее 300 графем в сравниваемых образцах в 95% случаев можно говорить об идентичности сравниваемых образцов почерка с достоверностью 90%, в 70% случаев - с достоверностью 99%, в 60% случаев - с достоверностью 99,9%.When conducting an experimental verification of the invention, the following results were obtained, indicating a high reliability of the verification and identification methods according to the invention: in the presence of at least 300 graphemes in the compared samples in 95% of cases, one can speak of the identity of the compared handwriting samples with a reliability of 90%, in 70% of cases - with a reliability of 99%, in 60% of cases - with a reliability of 99.9%.

Специалистам в данной области должно быть очевидно, что в конкретные варианты их осуществления, представленные в данном описании, могут быть внесены многочисленные модификации и дополнения, не выходящие за пределы предложенной группы изобретений. Например, возможно применение различных методик количественного сопоставления образцов почерка в векторной форме, а также использование различных критериев согласия и т.д.It will be apparent to those skilled in the art that numerous modifications and additions may be made to the specific embodiments presented herein, without departing from the scope of the proposed group of inventions. For example, it is possible to use various methods of quantitative comparison of handwriting samples in vector form, as well as the use of various criteria of consent, etc.

Claims

1. A method for determining a quantitative assessment of the similarity of handwriting samples containing characters located in at least one line, including:

(a) obtaining each sample of handwriting in digital binarized form,

(b) segmentation of each handwriting sample with a selection of a set of graphemes and with filtering of noise sections;

(c) processing each set of graphemes to obtain a set of vector descriptions of graphemes,

(d) forming, on the basis of each set of vector descriptions of graphemes, a vector template of a handwriting sample,

(e) obtaining a quantitative measure of the proximity of the vector templates of the compared handwriting samples and

(g) determining a quantitative measure of the proximity of the compared handwriting samples using the quantitative measure obtained in operation (e), characterized in that operation (b) includes finding an estimate of the angle of inclination of the lines, skeletonization of the symbol lines in the handwriting sample, and

removal of branch points of lines with a breakdown of the handwriting pattern into graphemes

;

operation (c) includes

getting a description of each grapheme

as a set of coordinates

,

where n _i is the number of grapheme points,

- coordinates of its j-th point,

transformation, using the found estimate α of the line angle, of each grapheme

into grapheme normalized by position and orientation

where

x _c , y _c - coordinates of the normalization reference point,

determination of the metric characteristics of each grapheme and the exclusion of graphemes with atypical metric characteristics and the conversion of each grapheme into a vector using a fixed number n _f points of the grapheme, where n _f <n _i ; in this case, before performing operation (d) for the handwriting samples to be compared, a sample of samples of various handwriting is created, operations (a) - (c) are performed for each sample of the sample with the formation of a representative training sample of handwriting sample description vectors, and the vector reduction operator is determined from the analysis of the generated sample descriptions of handwriting samples for the main components, moreover

operation (d) includes the conversion by means of the specified operator of each vector obtained in operation (c) for the handwriting samples being compared into a vector of lower dimension and

the use of a set of vectors of lower dimension corresponding to the specified sample as a template for handwriting.

2. The method according to claim 1, characterized in that the length of the handwriting samples is selected so that the set of graphemes formed from each handwriting sample is at least 300 graphemes.

3. The method according to claim 1, characterized in that the compared handwriting samples are formed from texts of mismatched content.

4. The method according to claim 1, characterized in that the handwriting samples forming the specified sample are formed from texts of mismatched content.

5. The method according to claim 1, characterized in that as a reference point in the transformation of the grapheme

to grapheme

use centroid graphemes

with coordinates

,

.

6. The method according to claim 1, characterized in that as a reference point in the transformation of the grapheme

to grapheme

use the start or end point of the grapheme G _i .

7. The method according to claim 6, characterized in that before determining the metric parameters of the grapheme

from its two extreme points (x ₁ , y ₁ ) and

choose the point with the smallest value of x as the point of its beginning, and if x ₁ = x _n , the point with the smallest value of y, in this case, if you select the point as the starting point

rearrange the points in the description of the grapheme

in reverse order with grapheme conversion

to grapheme

.

8. The method according to claim 1, characterized in that as the metric characteristics of each grapheme

use the cosine and sine of its angle of inclination, the length of the grapheme and / or the size of the smallest enclosing rectangle.

9. The method according to claim 8, characterized in that by vector filtering, graphemes with atypical characteristics are excluded from the set of graphemes.

10. The method according to claim 8, characterized in that when converting grapheme

in the vector, the indicated metric characteristics of the grapheme are included in the components of the specified vector.

11. The method according to claim 1, characterized in that the number of components in a vector of lower dimension is determined by the result of processing a representative training sample using the principal component analysis method.

12. The method according to claim 1, characterized in that the number of components in the vector of lower dimension is determined by the processing of a representative training sample of handwriting sample description vectors using the independent component analysis method.

13. The method according to p. 12, characterized in that as a quantitative measure of the proximity of the templates, use the product of the P-values of the Kolmogorov-Smirnov agreement criterion for all components of a vector of smaller dimension.

14. The method according to p. 12, characterized in that as a quantitative measure of the proximity of the templates, use the product of the P-values of the chi-square agreement criterion for all components of a vector of smaller dimension.

15. The method of verifying the identity of the verified sample of handwriting by determining the similarity of the verified sample of handwriting and a pre-formed reference sample of text, tied to the installation data of the verified identity, using a quantitative assessment of similarity, characterized in that the quantitative assessment of the similarity of the verified and reference samples is determined in accordance with the method according to any one of claims 1 to 14.

16. The method according to clause 15, wherein the reference sample of handwriting is stored in the form of a template.

17. A method for identifying handwriting by determining the similarity of a sample of identifiable handwriting and handwriting samples from a pre-formed database containing identified handwriting samples, moreover, the method uses quantitative similarity estimates and compiles a list of identified handwriting samples ranked by similarity assessment values with an identifiable handwriting sample, which differs the fact that quantitative estimates of the similarity of the sample of identifiable handwriting and each used handwriting sample from seemed database is determined in accordance with the method according to any one of claims 1-14.

18. The method according to 17, characterized in that the database of identified handwriting samples is a database of templates of these samples.