RU2718978C1

RU2718978C1 - Automated legal advice system control method

Info

Publication number: RU2718978C1
Application number: RU2019129976A
Authority: RU
Inventors: Валерий Викторович Мешков; Руслан Владимирович Хюрри; Ольга Викторовна Приходько
Original assignee: Общество с ограниченной ответственностью «ПРАВОВЕД.РУ ЛАБ»
Priority date: 2019-09-24
Filing date: 2019-09-24
Publication date: 2020-04-15

Abstract

FIELD: computer equipment.

SUBSTANCE: method of managing an automated system of legal consultations consists in forming a database of arbitrary questions, selecting questions of one area of law, for which the number of directions of responses is determined in the form of categories of questions, text questions are vectorized into binary vectors of questions, questions are classified by categories on the basis of proximity of vectors, so that classes are formed for branches of law, and when a user question is entered into a system, a user question is parsed vocabulary with determining the lexical structure of the user question, determining, based on the generated classes for branches of the right, the provision of the regulatory legal act contained in the knowledge database, relevant to user question according to normative legal acts vectors, generating a lexical response structure to a user, corresponding to the lexical structure of the user question and the position of the regulatory act relevant to the user's response, and providing the user with a lexical response structure in the form of a text message.

EFFECT: high probability of generating a correct answer to a user question, reducing the effect of unskilled experts on system operation results.

10 cl, 5 dwg, 1 tbl

Description

ОБЛАСТЬ ТЕХНИКИFIELD OF TECHNOLOGY

Изобретение относится к вычислительной технике и, в частности, к автоматизированным экспертным системам и может быть использовано в качестве автоматизированной системы ответов на вопросы правового характера, например, при выявлении возможностей предъявления претензий, установления фактов правонарушений, а также по вопросам совершения действий, не противоречащих закону в условиях сложных бытовых ситуаций.The invention relates to computer technology and, in particular, to automated expert systems and can be used as an automated system of answers to questions of a legal nature, for example, in identifying the possibilities of making claims, establishing facts of offenses, as well as on issues of committing actions that do not contradict the law in difficult domestic situations.

УРОВЕНЬ ТЕХНИКИBACKGROUND

Известные системы используют обратную связь для формирования решающих методов, по результатам реакции конечных пользователей. При этом ошибки методов устраняются, как правило, по мере расширения базы данных вопросов и ответов, и по мере сбора ответов оценок правильности принятых экспертным решений по результатам анкетирования пользователей. Недостаток известных систем заключается в том, что при применении к вопросам юридической практики, системы не обеспечивают принятие верных решений или формирование верных ответов по причинам объективного и субъективного характера.Known systems use feedback to form critical methods, based on the results of the reaction of end users. At the same time, method errors are eliminated, as a rule, as the database of questions and answers expands, and as answers are collected, assessments of the correctness of the expert decisions made based on the results of user surveys are collected. A disadvantage of the known systems is that when applied to legal practice issues, the systems do not provide the right decisions or the formation of the right answers for reasons of an objective and subjective nature.

Одной из объективных причин является тот факт, что для части правовых коллизий не существует правильных ответов и решение по вопросу формируется субъективным образом, например, судьей, исходя из субъективного понятия о справедливости. Субъективной причиной может являться взаимодействие пользователя экспертной системы и департамента, принимающего решение по объекту интереса. Например, сведения, предоставленные ненадлежащим образом и за пределами установленного срока, могут быть не приняты к рассмотрению, что может привести к негативному отзыву пользователя экспертной системы по всему вопросу, несмотря на то, что недостаток рекомендации заключался в отсутствии указания на важность соблюдения формы сроков предоставления надлежащих документов. Кроме того, в процессе правоприменения могут меняться толкования нормативных документов.One of the objective reasons is the fact that for some legal conflicts there are no correct answers and a decision on a question is formed in a subjective way, for example, by a judge based on the subjective concept of justice. The subjective reason may be the interaction of the user of the expert system and the department making the decision on the object of interest. For example, information provided inappropriately and beyond the established deadline may not be taken into consideration, which may lead to negative feedback from the user of the expert system on the whole issue, despite the fact that the disadvantage of the recommendation was that there was no indication of the importance of observing the deadline form proper documents. In addition, in the process of enforcement, interpretations of regulatory documents may change.

Из уровня техники известна справочная правовая система хранения и поиска данных (см. RU 2223537, 10.02.2004), содержащая блок выбора вида поиска, формирователь запроса, один выход которого связан с блоком проведения поиска, вход-выход которого соединен через соответствующие шины с базами данных системы, блоком отображения и контроллером, предназначенным для управления поиском данных, отличающаяся тем, что в систему введены блок выбора условий поиска, вход которого связан с выходом блока выбора вида поиска, первый выход связан со входом формирователя запроса, второй выход соединен со входом формирователя атрибутов фильтра, первый вход-выход которого связан со вторым входом-выходом формирователя запроса, второй вход-выход - с блоком памяти атрибутов фильтра, первый вход-выход формирователя запроса соединен с блоком памяти запросов.The prior art is a reference legal system for storing and searching data (see RU 2223537, 02.10.2004), containing a search type selection unit, a query generator, one output of which is connected to the search unit, the input-output of which is connected via databases to the corresponding buses system data, a display unit and a controller for managing data search, characterized in that a search condition selection unit is introduced into the system, the input of which is connected to the output of the search type selection unit, the first output is connected to the input I request a second output coupled to an input of the attribute filter, a first input-output of which is connected to the second input-output of the request, the second input-output - with the block of memory attribute filter, a first input-output request generator coupled to the memory block requests.

К недостаткам такого решения можно отнести, по крайней мере, отсутствие возможности формирования ответов на поставленный пользователем вопрос и обеспечения высокой адекватности ответов, заключающейся в понимании вопроса, поставленного пользователем и формировании ответа, соответствующего поставленному вопросу.The disadvantages of this solution include, at least, the lack of the possibility of forming answers to the question posed by the user and ensuring the high adequacy of answers, which consists in understanding the question posed by the user and forming the answer corresponding to the question posed.

Из уровня техники известна система сбора правовой информации и ее анализа для выявления проблемных вопросов и недостатков нормативного правового регулирования (см. RU115095, 20.04.2012), содержащая базы данных, блок ввода документов в указанные базы, центральный сервер и компьютеризированные средства пользователей, отличающаяся тем, что базы данных сформированы в блок, который включает в себя по крайней мере базу «Правовые акты», базу «Юридические консультации», базу «Судебная практика», базу «Диссертации и исследования», базу «Обзоры, статьи публикации», базу «Зарубежное законодательство», базу «Региональное законодательство», базу «Законодательство государства», содержащие подбор документов, относящихся к тематикам перечисленных баз и имеющих идентификационные признаки для их обнаружения и представления на мониторе в электронном формате, блок ввода указанных документов в базу данных, выполненный с функцией перевода документа на бумажном или машиночитаемом носителе в электронный формат с присвоением идентификационных признаков и распределения по указанным базам данных, блок загрузки документов из блока баз данных, отобранных по общему критерию совпадения идентификационных признаков, который связан с центральным сервером, выполненным с возможностью соединения с компьютеризированными средствами пользователей для отображения на экранах их мониторов загруженных документов и который выполнен с функцией предоставления поискового аппарата для формирования запроса в связанном с центральным сервером блоке обработке запросов, блок регистрации пользователей, выполненный с функцией сохранения данных о пользователях с их кодами доступа к блоку с базами данных и объемом прав по этому доступу, который связан с центральным сервером и с блоком обработки запросов для выделения санкционированного канала доступа для пользователя после его идентификации при соединении с центральным сервером, блоком сбора предложений, направляемых в режиме свободного доступа с компьютеризированных средств пользователей, выполненным с функцией регистрации предложения и идентификации отправителя и передачи этого предложения в блок группирования предложений по тематической сущности или предметному назначению и передачи сгруппированных пакетов этих предложений в блок аналитической обработки для сопоставления отраженных в предложениях вопросов с правовыми нормами их регулирования и с документами из блока баз данных для формирования вывода.The prior art system is known for collecting legal information and analyzing it to identify problematic issues and deficiencies in legal regulation (see RU115095, 04/20/2012), which contains databases, a unit for entering documents into these databases, a central server and computerized user tools, characterized in that the databases are formed in a block that includes at least the Legal Acts database, the Legal Advice database, the Judicial Practice database, the Thesis and Research database, the Reviews, articles pu database “newsletters”, the base “Foreign Legislation”, the base “Regional Legislation”, the base “Legislation of the State”, containing a selection of documents related to the topics of the listed databases and having identification signs for their detection and presentation on a monitor in electronic format, a block for entering these documents into a database executed with the function of translating a document on paper or machine-readable medium into electronic format with the assignment of identification features and distribution according to the specified databases , a document loading unit from a database unit selected according to a common criterion for matching identification features, which is connected to a central server configured to connect to computerized means of users for displaying downloaded documents on their screens and which is configured to provide a search apparatus for generating a request in a request processing unit associated with a central server, a user registration unit configured to store data about users with their access codes to the block with the databases and the scope of rights for this access, which is connected with the central server and with the request processing block for allocating the authorized access channel for the user after identification when connecting to the central server, the block for collecting offers sent in the mode free access from computerized means of users, performed with the function of registering an offer and identifying the sender and transmitting this proposal to the block for grouping proposals thematic spirit or substantive destination and transmitting the grouped packages these proposals analytical processing unit for comparing the reflected questions in sentences with the legal provisions regulating them with documents from the database unit data to generate output.

Также, из уровня техники известны способ и система (см. US 20100063925 A1, 11.03.2010) для оказания определенных видов юридических услуг лицензированными и практикующими юристами клиентам и потенциальным клиентам юридической фирмы, причем система, реализующая такой способ, содержит первое вычислительное устройство для передачи первого набора запрашиваемой клиентом информации; второе вычислительное устройство для передачи набора консультативной информации юридического характера, причем второе устройство удалено от первого устройства; промежуточный сервер, сконфигурированный для удаленной работы с упомянутым первым устройством и вторым устройством; а также, по крайней мере, одну базу данных, связанную с указанным промежуточным сервером, причем база данных способна хранить информацию, запрашиваемую клиентом и информации юридических консультаций, причем способ включает ввод ответа юристом на вопрос клиента; хранение такого ответа в базе данных и интерфейс для получения указанного ответа клиентом с использованием программного обеспечения, установленного на вычислительном устройстве пользователя.Also, a method and system are known from the prior art (see US 20100063925 A1, March 11, 2010) for providing certain types of legal services to licensed and practicing lawyers to clients and potential clients of a law firm, and a system implementing this method contains a first computing device for transmitting the first set of information requested by the client; a second computing device for transmitting a set of legal advisory information, the second device being remote from the first device; an intermediate server configured to remotely work with said first device and second device; as well as at least one database associated with said intermediate server, the database being capable of storing information requested by the client and information of legal advice, the method comprising inputting an answer by a lawyer to a client's question; storing such a response in the database and an interface for receiving the specified response by the client using software installed on the user's computing device.

Недостатками такого решения являются, по крайней мере, низкую вероятность формирования правильного ответа на поставленный пользователем вопрос, высокое влияние экспертов, в том числе, неквалифицированных экспертов, на результаты работы системы, а также, низкую адекватность ответов, заключающуюся в понимании вопроса, поставленного пользователем и формировании ответа, соответствующего поставленному вопросу.The disadvantages of this solution are at least the low probability of forming the correct answer to the question posed by the user, the high impact of experts, including unskilled experts, on the results of the system, as well as the low adequacy of the answers, which consists in understanding the question posed by the user and formation of the answer corresponding to the question posed.

Таким образом, в уровне техники существует потребность в автоматизированной экспертной системе, относящейся к решению правовых вопросов с максимальной вероятностью принятия верных решений и выдаче рекомендаций с максимальной полнотой, достаточной, достаточной для того, чтобы пользователь либо решил далее вопрос самостоятельно, либо оценил сложность решения вопроса и доверил его решение квалифицированному специалисту. Также, в уровне техники существует потребность в способе управления автоматизированной системой правовых консультаций.Thus, in the prior art there is a need for an automated expert system related to solving legal issues with the maximum probability of making the right decisions and issuing recommendations with the maximum completeness, sufficient enough for the user to either solve the issue further on his own or evaluate the difficulty of solving the issue and entrusted his decision to a qualified professional. Also, in the prior art there is a need for a method of managing an automated legal advice system.

СУЩНОСТЬ ESSENCE

Предложенное изобретение решает поставленную задачу, а также обеспечивает достижение технического результата, заключающегося в повышении вероятности формирования правильного ответа на поставленный пользователем вопрос, актуализации правоприменительной практики, снижения влияния неквалифицированных экспертов на результаты работы системы, а также адекватности ответов. Здесь под адекватностью понимается верное понимание вопроса, поставленного пользователем и формирование ответа, соответствующего поставленному вопросу.The proposed invention solves the problem, as well as ensures the achievement of a technical result, which consists in increasing the probability of forming the correct answer to the question posed by the user, updating law enforcement practice, reducing the impact of unskilled experts on the results of the system, as well as the adequacy of the answers. Here, adequacy is understood as a correct understanding of the question posed by the user and the formation of the answer corresponding to the question posed.

Согласно одному из вариантов реализации, предлагается способ управления автоматизированной системой правовых консультаций, содержащей базу данных знаний, которая содержит совокупность норм и правил действующих правовых документов, модуль предобработки и классификации вопросов, выполненный формирующим векторное представление сущности вопросов и модуль формирования ответов из базы данных знаний в лексическое представление, соответствующее лексическому построению вопроса, при этом формируют базу данных произвольных вопросов, выбирают вопросы одной области права, для которой определяют количество направлений ответов в виде категорий вопросов, векторизуют тексты вопросов в бинарные вектора вопросов посредством представления уникальных слов в виде таблицы, содержащей уникальные слова с указанием наличия или отсутствия уникальных слов в тексте каждого из вопросов, причем совокупность значений таблицы формирует вектор, классифицируют вопросы по категориям по признаку близости векторов посредством группировки вопросов с использованием норм и правил действующих правовых документов, хранящихся в базе данных знаний, и кластерного анализа, в процессе которого формируют кластеры, количество которых равно сформированным экспертами группам для каждой отрасли права, количество которых вместе с векторами вопросов и словарем уникальных слов и словосочетаний, обрабатывается с использованием кластерного анализа, так что формируют классы для отраслей права и при поступлении в систему вопроса пользователя производят лексический парсинг вопроса пользователя с определением лексической структуры вопроса пользователя, определяют на основании сформированных классов для отраслей права содержащееся в базе данных знаний положение нормативно-правового акта, релевантного вопросу пользователя по векторам нормативно-правовых актов, формируют лексическую структуру ответа пользователю, соответствующую лексической структуре вопроса пользователя и положению нормативно-правового акта, релевантного ответу пользователя и предоставляют пользователю лексическую структуру ответа в виде текстового сообщения.According to one implementation option, a method for managing an automated system of legal advice containing a knowledge database that contains a set of norms and rules of existing legal documents, a module for preprocessing and classification of questions, performed by forming a vector representation of the essence of the questions, and a module for generating answers from the knowledge database to the lexical representation corresponding to the lexical construction of the question, while forming a database of arbitrary questions, choosing they ask questions of one area of law, for which the number of directions of answers is determined in the form of categories of questions, vectorize the text of the questions into binary question vectors by presenting unique words in the form of a table containing unique words indicating the presence or absence of unique words in the text of each question, and values of the table forms a vector, questions are classified into categories based on the proximity of vectors by grouping questions using the norms and rules of equal documents stored in the knowledge database and cluster analysis, during which clusters are formed, the number of which is equal to the groups formed by experts for each branch of law, the number of which, together with question vectors and a dictionary of unique words and phrases, is processed using cluster analysis, so what the classes for the branches of law form and when a user's question enters the system, they perform lexical parsing of the user's question with the definition of the lexical structure of the question the user, on the basis of the generated classes for the branches of law, determine the position of the normative legal act in the knowledge database relevant to the user's question on the vectors of normative legal acts, form the lexical structure of the answer to the user corresponding to the lexical structure of the user's question and the position of the normative legal act the user's response and provide the user with the lexical structure of the response in the form of a text message.

В одном из частных вариантов реализации преобразуют вопросы пользователей в формат, пригодный для автоматизированного анализа, посредством автоматизированной обработки вопросов с формированием векторного представления для каждого вопроса.In one of the private embodiments, user questions are converted into a format suitable for automated analysis by means of automated processing of questions with the formation of a vector representation for each question.

В одном из частных вариантов реализации определяют соответствие группировки вопросов по признаку близости векторов и группировки вопросов по направлениям ответов и при несоответствии группировки вопросов по признаку близости векторов и группировки вопросов по направлениям ответов изменяют параметры преобразования вопросов пользователей в формат, пригодный для автоматизированного анализа, и при соответствии группировки вопросов по признаку близости векторов и группировки вопросов по направлениям ответов используют параметры преобразования вопросов пользователей в формат, пригодный для автоматизированного анализа, для преобразования вопросов пользователей в формат, пригодный для автоматизированного анализа, и положений нормативно-правовых актов с формированием векторов нормативно-правовых актов.In one of the private implementation options, the grouping of questions according to the proximity of the vectors and the grouping of questions by the directions of answers is determined, and if the grouping of questions by the proximity of the vectors and the grouping of questions by the directions of answers do not match, the parameters for converting user questions into a format suitable for automated analysis according to the grouping of questions on the basis of proximity of vectors and the grouping of questions by direction of answers, the parameters of ducation user issues in a format suitable for automated analysis, to convert the user issues in a format suitable for automated analysis, and the provisions of legal acts with the formation of the vectors of normative legal acts.

В одном из частных вариантов реализации преобразованные вопросы пользователей в формат, пригодный для автоматизированного анализа, представлены в виде аргументов запросов и описаний ответов, представленных в виде значений функций.In one particular embodiment, the converted user questions into a format suitable for automated analysis are presented in the form of query arguments and descriptions of answers, presented in the form of function values.

В одном из частных вариантов реализации группируют вопросы по признаку близости векторов и группируют вопросы по направлениям ответов.In one of the private options for implementation, questions are grouped according to the proximity of the vectors and questions are grouped in the directions of the answers.

В одном из частных вариантов реализации при обработке текстов вопросов: удаляют из текста вопросов нерелевантные символы, включающие, по крайней мере, знаки препинания, элементы HTML разметки; выполняют токенизацию текстов, разделяя предложения на отдельные символы, являющимися токенами; выполняют конкатенацию слов с предлогами и частицами для сохранения эмоциональной окраски текстов; удаляют нерелевантные слова, не несущие смысловой нагрузки, в том числе местоимения и вопросительные слова и слова, встречающиеся более чем в 80% вопросов; выполняют лемматизацию токенов, осуществляя сведение токенов к словарной форме; выполняют стемминг посредством нахождения основы слова для заданного исходного слова, если словарная форма не найдена.In one of the private implementation options when processing the text of questions: remove irrelevant characters from the text of the questions, including at least punctuation marks, HTML markup elements; perform tokenization of texts, dividing sentences into separate characters that are tokens; perform concatenation of words with prepositions and particles to preserve the emotional coloring of the texts; remove irrelevant words that do not carry a semantic load, including pronouns and interrogative words and words found in more than 80% of questions; perform lemmatization of tokens by reducing tokens to a dictionary form; perform stemming by finding the word base for a given source word, if the dictionary form is not found.

В одном из частных вариантов реализации словарь уникальных слов состоит из словарных форм, полученных в процессе лемматизации токенов и/или стемминга.In one particular embodiment, the dictionary of unique words consists of dictionary forms obtained in the process of lemmatizing tokens and / or stamming.

В одном из частных вариантов реализации для каждой из фраз определяют наличие в фразе вопроса уникальных слов и словосочетаний и формируют табличную строку, в которой количество ячеек соответствует количеству слов и словосочетаний словаря, так что каждая из ячеек соответствует одному из словарных слов и словосочетаний, а значение, занесенное в ячейку соответствует наличию или отсутствию соответствующего слова или словосочетания в фразе, соответствующего позиции или номеру ячейки, где совокупность значений, занесенных в ячейки таблицы является вектором и используется для анализа соответствующей фразы.In one of the private embodiments for each of the phrases, the presence of unique words and phrases in the question phrase is determined and a tabular line is formed in which the number of cells corresponds to the number of words and phrases in the dictionary, so that each cell corresponds to one of the dictionary words and phrases, and the value entered in the cell corresponds to the presence or absence of the corresponding word or phrase in the phrase corresponding to the position or number of the cell, where the set of values entered in the table cells is a vector and is used to analyze the corresponding phrase.

В одном из частных вариантов реализации при кластеризации используется алгоритм k-средних, осуществляющий сортировку размеченных специалистами вопросов по кластерам, где каждый из вопросов относится к одному кластеру, расположенному на наименьшем расстоянии от вектора вопроса.In one of the private options for implementation, the k-means algorithm is used for clustering, sorting the questions marked out by experts into clusters, where each of the questions refers to one cluster located at the smallest distance from the question vector.

В одном из частных вариантов реализации лексический парсинг осуществляют с использованием, по крайней мере, одного семантического парсера, причем парсеры осуществляют поиск в вопросах цепочек символов, заранее заданных в словарях, где цепочками символов являются слова или словосочетания, объединённые общим свойством в исходном тексте вопроса пользователя, так что создают классы юридических проблем, которые идентифицируются алгоритмом и имеют общее законодательное обоснование, определяющее принадлежность вопроса к кластеру.In one particular embodiment, lexical parsing is performed using at least one semantic parser, the parsers searching for strings of characters predefined in dictionaries, where strings of characters are words or phrases combined by a common property in the source text of a user's question so that they create classes of legal problems that are identified by the algorithm and have a common legislative justification that determines whether the issue belongs to the cluster.

КРАТКОЕ ОПИСАНИЕ ГРАФИЧЕСКИХ МАТЕРИАЛОВBRIEF DESCRIPTION OF GRAPHIC MATERIALS

ФИГ. 1 иллюстрирует диаграмму настройки и работы предложенной системы, реализующей предложенный способ;FIG. 1 illustrates a configuration diagram and operation of the proposed system that implements the proposed method;

ФИГ. 2 иллюстрирует схему работы семантического парсера;FIG. 2 illustrates the operation scheme of a semantic parser;

ФИГ. 3 иллюстрирует примерный вариант осуществления настоящего изобретения;FIG. 3 illustrates an exemplary embodiment of the present invention;

ФИГ. 4 иллюстрирует упрощенный пример аппаратной реализации предложенного изобретения;FIG. 4 illustrates a simplified example of a hardware implementation of the proposed invention;

ФИГ. 5 иллюстрирует пример вычислительной системы, пригодный для реализации элементов предложенного изобретения.FIG. 5 illustrates an example of a computing system suitable for implementing elements of the proposed invention.

ОПИСАНИЕ ВАРИАНТОВ ОСУЩЕСТВЛЕНИЯ ИЗОБРЕТЕНИЯDESCRIPTION OF EMBODIMENTS OF THE INVENTION

Для решения поставленной задачи и достижения технического результата, предлагается способ, реализуемый правовой экспертной системой, диаграмма настройки и работы которой показана на ФИГ. 1.To solve the problem and achieve a technical result, a method is proposed that is implemented by a legal expert system, the configuration and operation diagram of which is shown in FIG. 1.

В процессе управления системой осуществляется формирование базы данных размеченных данных (базы данных произвольных вопросов) 128 с использованием данных из базы данных неразмеченной информации (данных) 115, как описано в рамках настоящего изобретения.In the process of managing the system, a database of tagged data (a database of arbitrary questions) 128 is generated using data from a database of unallocated information (data) 115, as described in the framework of the present invention.

База данных неразмеченной информации 115 содержит вопросы пользователей и рекомендации консультантов, сформулированные в соответствии с поставленными вопросами, где вопросы и рекомендации представлены в терминах естественного языка. В качестве базы данных неразмеченной информации 115 может быть использована база данных, созданная, в частности, подготовленная, по результатам общения пользователей 111 с консультантами - физическими лицами. Далее база данных неразмеченной информации 115 также может пополняться диалогами с системой, в частности, вопросами пользователей (222, ФИГ. 2), содержащими исходный текст (232, ФИГ. 2), и сформированными системой, в частности, модулем формирования ответов, ответами (282, ФИГ. 2).The unallocated information database 115 contains user questions and consultant recommendations formulated in accordance with the questions posed, where questions and recommendations are presented in natural language terms. As a database of unallocated information 115, a database can be used, created, in particular, prepared, based on the results of communication of users 111 with consultants - individuals. Further, the database of unallocated information 115 can also be replenished with dialogs with the system, in particular, user questions (222, FIG. 2) containing the source text (232, FIG. 2), and generated by the system, in particular, the module for generating responses, answers ( 282, FIG. 2).

База данных размеченных данных 128 формируется модулем предобработки и классификации вопросов из вопросов пользователей (преобразованных в формат, пригодный для автоматизированного анализа), в частности представленных в виде аргументов запросов и описаний ответов, представленных в виде значений функций.The database of marked-up data 128 is formed by the module for preprocessing and classification of questions from user questions (converted to a format suitable for automated analysis), in particular, queries presented in the form of arguments and response descriptions presented in the form of function values.

Система, показанная на ФИГ. 1 также содержит базу данных знаний 147, которая содержит совокупность норм и правил действующих правовых документов, представленных в виде аргументов описаний ответов.The system shown in FIG. 1 also contains a knowledge database 147, which contains a set of norms and rules of existing legal documents, presented in the form of arguments for descriptions of answers.

Также правовая экспертная система содержит:The legal expert system also contains:

- средства разметки вопросов пользователей 124, представленных на естественном языке, с обеспечением выделения из вопросов пользователей раздельных тематик вопросов и группировке тематик по нескольким независимым группам по признаку сходства. Средства разметки вопросов пользователей 124 реализованы в асессорском блоке, реализующем асессорский функционал; - means for marking user questions 124, presented in natural language, ensuring the separation of user questions from separate topics and grouping topics into several independent groups based on similarity. Markers for user questions 124 are implemented in the assessment block that implements assessment functions;

- реализующие алгоритмы кроссвалидации средства валидации разметки 126, выполненные с возможностью выбора, для каждой раздельной тематики вопроса, наиболее вероятную группу для каждой из предложенных;- implementing markup validation cross-validation algorithms 126, made with a choice, for each separate subject of the question, the most likely group for each of the proposed ones;

- средства группировки размеченных данных, использующиеся (и, в частном случае, являющиеся его частью), по крайней мере, модулем предобработки и классификации вопросов, выполняющие классификацию по тематикам вопросов по формальным признакам сходства, где для конкретного вопроса пользователя используется конкретный ответ на вопрос;- means for grouping marked-up data that are used (and, in a particular case, which are part of it), at least with a module for preprocessing and classification of questions, performing classification according to the topics of questions according to formal signs of similarity, where a specific answer to the question is used for a specific user question;

- диалоговый менеджер (диалоговый блок, средства (автоматизированного) диалога с пользователем) 137, выполненный принимающим вопрос пользователя 111 на естественном языке, осуществляющим семантический анализ текста вопроса, с учетом тематики определенной ранее, и формирующим ответ на вопрос пользователя, при этом конечный ответ системы (с использованием модуля формирования ответов) основан на уточнениях пользователя, полученных в процессе взаимодействия с диалоговым менеджером 137. Вопросы системы формируются, первоначально, в формате, обеспечивающем предельную лаконичность ответа, а затем, в формате, понятном пользователю по итогам предыдущего уточнения, причем тематика является наименованием отдельной группы вопросов, например, «Раздел имущества супругов». Также, диалоговый менеджер 137 может осуществлять преобразование вопросов (в том числе частей вопросов, например, фраз) в формат, пригодный для автоматизированного анализа средствами описываемой системы, а также осуществлять выделение тематики вопроса из преобразованного вопроса,- dialogue manager (dialog box, means of (automated) dialogue with the user) 137, executed by the user 111 receiving the question in natural language, performing semantic analysis of the question text, taking into account the topics defined earlier, and forming the answer to the user question, with the final system response (using the module for generating responses) is based on user refinements received in the course of interaction with the dialog manager 137. System questions are formed, initially, in the format, both sintering the utmost brevity of the answer, and then, in a format understandable to the user based on the results of the previous clarification, and the subject is the name of a separate group of questions, for example, “Section of the property of spouses”. Also, the dialog manager 137 can convert questions (including parts of questions, for example, phrases) into a format suitable for automated analysis by means of the described system, as well as extract the subject of the question from the transformed question,

Модуль предобработки и классификации вопросов использует (в частном случае, содержит) средство разметки вопросов пользователей 124 и средство (кросс-)валидации разметки 126.The module for preprocessing and classification of questions uses (in a particular case, contains) a means for marking user questions 124 and a means for (cross-) validating markup 126.

В дополнительных вариантах реализации изобретения для формирования ответа требуются уточнения, представление в описании ситуации, соответствующей вопросу и используется дополнительная система уточнений, формирующая вопросы пользователю. Вопросы формируются, первоначально, в формате, обеспечивающем предельную лаконичность ответа, а затем, в формате, понятном пользователю по итогам предыдущего ответа.In additional embodiments of the invention, for the formation of the answer, clarifications are required, a presentation in the description of the situation corresponding to the question is used, and an additional system of refinements is used that forms the questions for the user. Questions are formed, initially, in a format that provides the utmost conciseness of the answer, and then, in a format that the user understands based on the results of the previous answer.

В другом дополнительном варианте использования изобретения, In another further embodiment of the invention,

- для систематизации используется дублирование сортировки понятий, для чего используется система человеческой проверки параллельно с машиной систематизацией. - for systematization, duplication of the sorting of concepts is used, for which a human verification system is used in parallel with the systematization machine.

- для каждого из систематизаторов устанавливается параметр качества, в частном случае, путем голосования, а результаты группировки принимаются с учетом вероятности правильного ответа, - for each of the systematizers, a quality parameter is set, in a particular case, by voting, and the grouping results are accepted taking into account the probability of a correct answer,

- группировка производится как прямым, так и косвенным путём, где для группировки прямым путем экспертом формулируется правильный ответ, а для косвенной оценки, определяется качество предложенного ответа, соответствующего одной из тематик. В частном случае для каждого эксперта определяется вероятность качества оценки путем сопоставления его ответов с ответами других экспертов и с результатами работы экспертной системы. Качество оценки определенного эксперта определяется с учетом влияющих факторов, например, времени суток или тематики. - the grouping is carried out both directly and indirectly, where for the grouping by the direct way the expert formulates the correct answer, and for an indirect assessment, the quality of the proposed answer corresponding to one of the topics is determined. In the particular case for each expert, the probability of the quality of the assessment is determined by comparing his answers with the answers of other experts and with the results of the expert system. The quality of the assessment of a particular expert is determined taking into account influencing factors, for example, time of day or topic.

После первоначальной разметки вопросов экспертами, производится обучение автоматизированной системы. After the initial markup of questions by experts, an automated system is trained.

Для обучения автоматизированной системы модулем предобработки и классификации вопросов производится преобразование тематик (или фраз), содержащихся в вопросах пользователей, в формализованный вид, в частности, используются векторизованные тексты вопросов пользователей, причем модулем предобработки и классификации вопросов осуществляется векторизация текстов вопросов. Под формализованным видом понимается представление тематик, изложенных в вопросах в виде векторов, то есть, строк определенной длины. Дополнительным параметром тематики может являться описание лексического построения вопроса. Описание лексического построения вопроса используется для автоматизированной формулировки ответа на вопрос с учетом грамматических особенностей вопроса, предметной области и объекта обсуждения. Например, для вопроса: «в какой срок подлежит возврату «мужская обувь» ненадлежащего качества?», ответ будет содержать термин, в частности, словосочетание, «мужская обувь». Для вопроса с термином, в частности, словом, «ботинки», ответ будет содержать термин, в частности, слово, «ботинки».To train an automated system with a module for preprocessing and classifying questions, the topics (or phrases) contained in user questions are converted into a formalized form, in particular, vectorized texts of user questions are used, and the module for preprocessing and classification of questions is used to vectorize the text of the questions. A formalized view refers to the presentation of topics set forth in questions in the form of vectors, that is, strings of a certain length. An additional topic parameter may be a description of the lexical construction of the question. The description of the lexical construction of the question is used for the automated formulation of the answer to the question, taking into account the grammatical features of the question, subject area and object of discussion. For example, for the question: “in what time is“ men's shoes ”of inadequate quality to be returned?”, The answer will contain the term, in particular, the phrase, “men's shoes”. For a question with a term, in particular with the word “boots”, the answer will contain a term, in particular with the word “boots”.

В процессе управления системой осуществляется выбор вопросов одной области права, для которой определяется количество направлений ответов (категорий). Для тематик вопросов производится (предварительная) группировка данных, которая далее используется для формирования автоматизированной системы принятия решений. Исходный выбор тематик определяется областью права, к которой принадлежат вопросы, например, «авторское право» или «защита прав потребителей. Группировка проводится с учетом возможности возникновения ошибок, вызванных, например, возможностью неоднозначного толкования вопроса, либо невнимательностью специалиста, осуществляющего сортировку.In the process of managing the system, questions of one area of law are selected, for which the number of directions of answers (categories) is determined. For subjects of questions, a (preliminary) grouping of data is performed, which is then used to form an automated decision-making system. The initial choice of topics is determined by the area of law to which the questions belong, for example, “copyright” or “consumer protection.” Grouping is carried out taking into account the possibility of errors caused, for example, by the possibility of ambiguous interpretation of the issue, or by the carelessness of the specialist performing the sorting.

Дополнительно, при группировке специалистам предъявляются возможные ответы на вопросы, относящиеся к фразе или тематике для оценки релевантности ответа поставленному вопросу или фразе из вопроса.Additionally, when grouping, specialists are presented with possible answers to questions related to a phrase or topic in order to assess the relevance of an answer to a posed question or phrase from a question.

При этом, ответы также указывают на группу к которой принадлежит вопрос, например, верный ответ, заведомо принадлежащий к определённой группе ответа указывает на группу, к которой принадлежит вопрос, неверный ответ указывает на то, что вопрос не принадлежит к группе ответа.At the same time, the answers also indicate the group to which the question belongs, for example, the correct answer, obviously belonging to a certain answer group, indicates the group to which the question belongs, an incorrect answer indicates that the question does not belong to the answer group.

Разметка базы данных неразмеченных данных 115 осуществляется исполнителями, в частности, специалистами (асессорами, например, юристами-асессорами) 122 с использованием средств разметки вопросов пользователей 124 и раскрытой выше (и далее в описании) методики, что позволяет, по крайней мере, ускорить разметку данных для решения задач автоматизации дальнейшего процесса консультирования пользователей. The markup of the database of unallocated data 115 is carried out by performers, in particular, specialists (assessors, for example, legal assessors) 122 using means for marking user questions 124 and the methodology described above (and further in the description), which allows at least speeding up the markup data to solve the problems of automation of the further process of consulting users.

Методика реализована в асессорском блоке, обеспечивающим формирование базы данных заданий для специалистов и требований к выполнению заданий, включающих инструкции (121). Для сформированных заданий, обеспечивается предъявление заданий асессорам, а также сбор результатов выполнения заданий и обработку результатов. The methodology is implemented in the assessment block, which provides the formation of a database of tasks for specialists and requirements for completing tasks that include instructions (121). For the formed tasks, the presentation of tasks to assessors is provided, as well as the collection of the results of the tasks and the processing of the results.

Для осуществления разметки данных, в частности, для упрощения разметки данных:To perform data markup, in particular, to simplify data markup:

- создается дерево категорий (где категориями являются, например: ГК, 4 часть ГК, патенты, патенты на полезную модель, гражданское право, защита прав потребителей, возврат товара, замена товара и т.д). Так, например, категорией может являться «Семейное право», а подкатегориями «Раздел имущества супругов», «Заключение брака», «Определение места жительства ребенка», «Перемещение ребенка», «Опека»;- a category tree is created (where the categories are, for example: Civil Code, Part 4 of the Civil Code, patents, utility model patents, civil law, consumer protection, return of goods, replacement of goods, etc.). So, for example, the category may be “Family Law”, and the subcategories “Section of the property of spouses”, “Marriage”, “Determination of the child’s place of residence”, “Movement of the child”, “Custody”;

- исполнители (асессоры) с использованием средств разметки вопросов пользователей 124, т.е. в частном случае, исполнители подключаются к асессорскому функционалу, в частности, используются средства разметки 124;- performers (assessors) using means of marking user questions 124, i.e. in the particular case, the performers are connected to the assessor’s functional, in particular, markup means 124 are used;

- формируются задания для разметки, в частности, загружаются файлы с данными, выбираются правила разметки (из существующих), назначаются исполнители для конкретных заданий. Правилами разметки является список возможных для разметки категорий, причем для каждой категории разрабатывается инструкция, в которой указываются все условия (правила) отнесения вопроса именно к этой категории, а также список исключений из этих правил. Так, например, данные в задании могут быть размечены по следующим категориям: “Некачественный товар”, “Некачественный технически-сложный товар”, “Навязанная услуга”, “Отказ от услуги” и т.д. Для категории “Некачественный товар” создается инструкция, которая содержит условия отнесения вопроса к этой категории, например: некачественный товар - это товар из части 1 статьи 18 Закон РФ "О защите прав потребителей" за исключением технически сложного товара; если пользователь (клиент) спрашивает, в какой срок ему вернут деньги при отказе от некачественного товара; если клиент спрашивает что делать, если некачественный товар долго ремонтируется и т.д. Количество условий варьируется в зависимости от категории и сложности вопросов в данной категории.- tasks for marking are formed, in particular, data files are loaded, marking rules (from existing ones) are selected, executors for specific tasks are appointed. The markup rules are a list of categories that are possible for marking, and for each category an instruction is developed that indicates all the conditions (rules) for assigning the issue to this category, as well as a list of exceptions to these rules. So, for example, the data in the task can be categorized into the following categories: “Poor-quality goods”, “Poor-quality technically complex goods”, “Imposed service”, “Denial of service”, etc. An instruction is created for the “Low-quality product” category that contains the conditions for classifying the issue into this category, for example: a low-quality product is a product from Part 1 of Article 18 of the Law of the Russian Federation “On the Protection of Consumer Rights” with the exception of technically complex goods; if the user (client) asks how soon the money will be returned to him in case of rejection of defective goods; if the client asks what to do, if defective goods are repaired for a long time, etc. The number of conditions varies depending on the category and complexity of the questions in this category.

- определяются другие условия для выполнения задания: например, сколько раз каждый объект показывается разным исполнителям для разметки.- Other conditions for completing the task are determined: for example, how many times each object is shown to different performers for marking.

Далее, через личный кабинет (исполнителя), с использованием логинов и паролей, исполнители получают доступ к заданиям, исполнителям демонстрируются объекты для разметки, исполнители размечают данные, например, путем указания категорий из дерева категорий. Further, through the personal account (of the performer), using logins and passwords, performers get access to tasks, performers are shown objects for marking, performers mark up data, for example, by indicating categories from the category tree.

Размеченные данные сохраняются в базе данных размеченных данных 128, в частности, локальной базе данных , например, реализованной с использованием MySQL, где указываются метки категорий, указанных асессорами, а также информация, которая позволяет учитывать индивидуальный уровень эффективности асессоров (с использованием алгоритма кросс-валидации разметки асессоров-юристов, который позволяет автоматически рассчитывать вероятности каждого ответа асессора и отсекать ошибочные ответы, как описано в рамках настоящего изобретения), влияющий на результаты разметки. Так, метками категорий могут являться наименования категорий, идентификаторы категорий и т.д., которые сохраняются в базу данных размеченных вопросов 128.Tagged data is stored in the database of tagged data 128, in particular, a local database, for example, implemented using MySQL, where category labels indicated by assessors are indicated, as well as information that allows you to take into account the individual level of effectiveness of assessors (using the cross-validation algorithm markup of assessors-lawyers, which allows you to automatically calculate the probabilities of each response of the assessor and cut off erroneous answers, as described in the framework of the present invention), affecting and markup results. Thus, category labels can be category names, category identifiers, etc., which are stored in the database of marked up questions 128.

В следующем этапе осуществляется кросс-валидация разметки разных асессоров средствами кросс-валидации разметки 126.In the next step, cross-validation of the markup of different assessors is performed by means of cross-validation of the markup 126.

При реализации способа, для каждого действия определяется вероятность правильного решения, а неверные ответы, выявленные с применение предложенного способа, а также ответы, которые отличаются от заведомо верных ответов, исключаются из последующего рассмотрения, но учитываются при определении вероятности получения неверного ответа от соответствующего специалиста. When implementing the method, the probability of a correct solution is determined for each action, and incorrect answers identified using the proposed method, as well as answers that differ from obviously correct answers, are excluded from the subsequent consideration, but are taken into account when determining the probability of receiving an incorrect answer from the corresponding specialist.

Следует также отметить, что для разметки используются предварительно выбранные вопросы пользователей из базы данных заданных вопросов сервисов юридических консультаций 113, т.е. в базу данных 115 сохраняется информация, содержащаяся в вопросах пользователей, задаваемых (и сохраняемых) на различных ресурсах, в системах обмена (текстовыми, мгновенными и т.д.) сообщениями и т.д. Так, например, для осуществления разметки может быть выбрана часть вопросов из общего количества заданных пользователями вопросов, например, из 3000000 заданных вопросов, для разметки может быть выбрано 2000000 вопросов, каждый из которых может иметь отношение к нескольким отраслям права.It should also be noted that pre-selected questions of users from the database of asked questions of legal advice services 113, i.e. the database 115 stores information contained in the questions of users asked (and stored) on various resources, in exchange systems (text, instant, etc.) messages, etc. So, for example, for marking up, some of the questions can be selected from the total number of questions asked by users, for example, out of 3,000,000 asked questions, for marking up, 2,000,000 questions can be selected, each of which may be related to several branches of law.

Далее часть выбранных для разметки вопросов, например, 300000 вопросов, размечаются специалистами 122 с использованием средств разметки вопросов пользователя 124, другие вопросы могут быть размечены в автоматизированном режиме с использованием преобразования вопросов в формат, пригодный для автоматизированного анализа, и формированием векторов для каждой тематики вопроса.Further, part of the questions selected for marking, for example, 300,000 questions, are marked out by specialists 122 using the marking tools of user questions 124, other questions can be marked out automatically using the conversion of questions into a format suitable for automated analysis, and the formation of vectors for each topic of the question .

Разметка данных, в частности, классификация вопросов по категориям, в том числе по признаку близости векторов, в том числе на основании заданных правил, включает совместное использование экспертной группировки вопросов и результатов методов кластерного анализа. Для осуществления классификации вопросов по категориям осуществляется обучение системы классифицировать вопросы по (выделенным) категориям Так, может осуществляться обучение алгоритмов классифицировать вопросы по выделенным категориям. Так, слова, встречающиеся в схожих контекстах, стремятся иметь близкий смысл, например, согласно гипотезе распределения в лингвистике. Слово может быть представлено в виде вектора, элементы которого соответствуют числу вхождений в некоторый контекст. Близость векторов определяет семантическое сходство. Экспертная группировка вопросов осуществляется на основе анализа системой репрезентативной выборки из базы данных неразмеченной информации 115. Упомянутая репрезентативная выборка является выборкой конечного объёма, обладающей всеми свойствами исходной популяции, значимыми с точки зрения задач исследования.The markup of data, in particular, the classification of questions into categories, including on the basis of proximity of vectors, including on the basis of specified rules, includes the joint use of an expert grouping of questions and results of cluster analysis methods. To carry out the classification of questions into categories, the system is trained to classify questions into (selected) categories. Thus, algorithms can be trained to classify questions into selected categories. So, words found in similar contexts tend to have a close meaning, for example, according to the distribution hypothesis in linguistics. A word can be represented as a vector, the elements of which correspond to the number of occurrences in a certain context. The proximity of vectors determines semantic similarity. The expert grouping of questions is carried out on the basis of analysis by a system of a representative sample from a database of unallocated information 115. The aforementioned representative sample is a sample of a finite volume that has all the properties of the initial population that are significant from the point of view of the research tasks.

Специалисты (юристы-эксперты) 122 группируют вопросы исходя из законодательной базы и юридических случаев (кейсов), возникающих у пользователей. Далее, с использованием алгоритма кластеризации, использующего метод кластеризации k-means (метод k-средних), формируются содержащие тексты вопросов кластеры (машинные кластеры), количество которых было заранее определено исходя из экспертной группировки. Таким образом, по результатам разметки вопросов и тематик вопросов асессорами, задается число совокупностей или кластеров, к которым относятся размеченные вопросы. В самом простом случае, для (первой) отрасли права, например, для «защиты прав потребителей», число таких кластеров равно 44, что является одним из параметров, передаваемых в алгоритм кластеризации (алгоритм k-means), так что для каждой статьи (для каждой нормы соответствующей отрасли права) формируется соответствующий кластер. Кроме этого, за основу принимается соответствие размеченных вопросов одному из кластеров. Далее упомянутый параметр передается в алгоритм k-means в качестве параметра k, являющимся числом кластеров. Использование количества кластеров равного количеству норм соответствующей отрасли права наиболее удобный вариант, поскольку изменения законодательства, в большинстве случаев приводят к уточнениям в формируемых ответах и не требуют повторной кластеризации вопросов.Specialists (legal experts) 122 group issues based on the legislative framework and legal cases (cases) that arise from users. Further, using a clustering algorithm using the k-means clustering method (k-means method), clusters (machine clusters) containing texts of questions are formed, the number of which was predetermined based on the expert grouping. Thus, according to the results of marking up questions and topics of questions by assessors, the number of sets or clusters to which marked-up questions belong is set. In the simplest case, for the (first) branch of law, for example, to “protect consumer rights”, the number of such clusters is 44, which is one of the parameters passed to the clustering algorithm (k-means algorithm), so for each article ( for each norm of the corresponding branch of law) an appropriate cluster is formed. In addition, the basis is taken for the correspondence of the marked questions to one of the clusters. Further, the mentioned parameter is passed to the k-means algorithm as the parameter k, which is the number of clusters. Using the number of clusters equal to the number of norms of the relevant branch of law is the most convenient option, since changes in legislation, in most cases, lead to clarifications in the generated answers and do not require re-clustering of questions.

Выборка для выделения машинных кластеров используется та же репрезентативная выборка из базы данных неразмеченной информации 115, что и при экспертной группировке.The selection for the allocation of machine clusters uses the same representative selection from the database of unallocated information 115 as in the expert grouping.

Осуществляется предобработка выборки, в частности, при обработке (в том числе при нормализации) текста для проведения кластеризации (в том числе, автоматизированной кластеризации), в процессе которой используются технологии обработки естественного языка (от англ. - Natural Language Processing, NLP), включающие операции, указанные ниже:The sample is pre-processed, in particular, during processing (including normalization) of the text for clustering (including automated clustering), in the process of which natural language processing technologies are used (from the English - Natural Language Processing, NLP), including operations indicated below:

1. Очистка вопросов от нерелевантных символов, например, знаков препинания, html-элементы (элементы HTML разметки) и пр., в частности, осуществляется удаление нерелевантных символов; 1. Clearing questions from irrelevant characters, for example, punctuation marks, html elements (HTML markup elements), etc., in particular, the removal of irrelevant characters;

2. Токенизация текстов - разделение предложений на отдельные символы (токены);2. Tokenization of texts - separation of sentences into separate characters (tokens);

3. Конкатенация слов с определенными предлогами и частицами для сохранения верного значения (наличие/отсутствие), а также для сохранения эмоциональной окраски текстов. Так, например, при обработке текста "Мы НЕ согласны с результатами экспертизы", если не произвести конкатенацию частицы "НЕ" со словом "согласны", при удалении нерелевантных слов получится следующий текст "Мы согласны с результатами экспертизы", который имеет смысл, противоположный изначальному тексту;3. Concatenation of words with certain prepositions and particles to maintain the correct meaning (presence / absence), as well as to preserve the emotional coloring of the texts. So, for example, when processing the text “We DO NOT agree with the results of the examination,” if you do not concatenate the particle “DO” with the word “agree,” when deleting irrelevant words, you get the following text “We agree with the results of the examination”, which has the opposite meaning original text;

4. Удаление нерелевантных слов, которые не несут смысловой нагрузки, в частности, которые могут встречаться в любом контексте - местоимений, вопросительных слов и тех, которые встречались более чем в 80% вопросов (в частности, наблюдений) из анализируемой выборки; 4. Removing irrelevant words that do not carry a semantic load, in particular, which can occur in any context - pronouns, interrogative words, and those that were found in more than 80% of questions (in particular, observations) from the analyzed sample;

5. Лемматизация токенов - сведение токенов к (базовой) словарной форме, а там, где словарная форма не найдена, проводится стемминг, т.е. нахождение основы слова для заданного исходного слова. 5. Token lemmatization - reduction of tokens to the (basic) dictionary form, and where the dictionary form is not found, stemming is performed, that is finding the basis of the word for a given source word.

В частном случае реализации для каждой из тематик (то есть, для составных фраз вопросов, имеющих различное смысловое содержание) вопросов пользователей средствами системы производится (раздельная) нормализация, то есть преобразование (фразы) в формат, пригодный для автоматизированного анализа. In the particular case of the implementation for each of the topics (that is, for compound phrases of questions having different semantic contents) of questions of users by means of the system, (separate) normalization is performed, that is, conversion (phrases) into a format suitable for automated analysis.

В частном случае реализации используются технологии обработки естественного языка, реализующие анализ фраз, обеспечивающий понимание принадлежности фразы к определённой тематике, а также выделение из фраз ключевых элементов, обеспечивающих последующий синтез фраз ответов, в части генерации грамотного текста.In the particular case of the implementation, natural language processing technologies are used that implement phrase analysis to ensure that the phrase belongs to a specific topic, as well as highlight key elements from phrases that provide the subsequent synthesis of response phrases, in terms of generating competent text.

Тексты вопросов преобразуются в бинарные вектора, т.е. векторизуются, с использованием словаря уникальных терминов (словарь уникальных слов и словосочетаний), в частности, выбирается представление данных - векторизация текстов вопросов в бинарные вектора (вектора вопросов, вектора наблюдений). Словарь уникальных терминов строится путем выбора, в частности, формирования терминов, из базы данных размеченных данных 128 (исходной базы данных). Вектора вопросов передаются в алгоритм кластеризации (k-means), по результатам работы которого формируется список машинных кластеров. Таким образом, формируется словарь уникальных терминов, состоящий из словарных форм, использованных на этапе лемматизации токенов, а также словарных форм, полученных в результате стемминга. Для каждой из фраз определяется наличие в фразе уникальных терминов и формируется табличная строка, в которой количество ячеек соответствует количеству словарных терминов, каждая из ячеек соответствует одному из терминов из словаря, а значение, занесенное в ячейку, соответствует наличию или отсутствию соответствующего термина в фразе, соответствующего позиции или номеру ячейки. Совокупность значений, занесенных в ячейки таблицы, называется вектором и используется для анализа соответствующей фразы. Таким образом, для нормализованных векторов, с использованием алгоритма к-средних размеченные вопросы сортируются по кластерам, где каждый из вопросов относится к одному кластеру, расположенному на наименьшем расстоянии от вектора вопроса. На первом шаге алгоритма к-средних (k-means) данные произвольно разбиваются на кластеры, и для каждого из кластеров вычисляется центр масс в соответствии со значениями векторов данных, входящих в кластер. На последующих шагах, для каждого вектора находится ближайший центр масс, происходит перераспределение векторов по кластерам, последующие шаги процесса происходят итеративно до тех пор, пока последующий шаг не дает перераспределения векторов, по отношению к предыдущему. Question texts are converted to binary vectors, i.e. they are vectorized using a dictionary of unique terms (a dictionary of unique words and phrases), in particular, a presentation of the data is selected - vectorization of the text of the questions into binary vectors (vector of questions, vector of observations). A dictionary of unique terms is constructed by choosing, in particular, the formation of terms from a database of marked up data 128 (the source database). Question vectors are transferred to the clustering algorithm (k-means), based on the results of which a list of machine clusters is formed. Thus, a dictionary of unique terms is formed, consisting of dictionary forms used at the lemmatization stage of tokens, as well as dictionary forms obtained as a result of stemming. For each of the phrases, the presence of unique terms in the phrase is determined and a tabular line is formed in which the number of cells corresponds to the number of dictionary terms, each cell corresponds to one of the terms from the dictionary, and the value entered in the cell corresponds to the presence or absence of the corresponding term in the phrase, corresponding to the position or cell number. The set of values entered in the table cells is called a vector and is used to analyze the corresponding phrase. Thus, for normalized vectors, using the k-means algorithm, labeled questions are sorted into clusters, where each of the questions belongs to one cluster located at the smallest distance from the question vector. At the first step of the k-means algorithm, the data are randomly divided into clusters, and for each of the clusters the center of mass is calculated in accordance with the values of the data vectors included in the cluster. In the next steps, for each vector, the nearest center of mass is found, the vectors are redistributed into clusters, the next steps of the process occur iteratively until the next step gives the vectors redistribution relative to the previous one.

Центры масс или кластеры в дальнейшем используются для определения принадлежности неразмеченных вопросов к соответствующей тематике.Centers of mass or clusters are subsequently used to determine whether unplaced questions are related to relevant topics.

При обычном исходном несовпадении машинных и экспертных кластеров около 40%, формируются новые группы для несовпадающих кластеров на основе построенных семантических парсеров, причем парсеры (242, ФИГ. 2) осуществляют поиск определенных цепочек символов, заранее заданных в словарях (252, ФИГ. 2). Цепочки символов - это слова или словосочетания, объединённые общим свойством в исходном тексте вопроса пользователя 111, задающего вопрос, например, на веб-сайте 153. Например, общим свойством группы вопросов может являться “некачественные технически-сложные товары”. С использованием идентификатора класса проблем (292, ФИГ. 2) многослойной нейронной сети могут быть созданы такие классы юридических проблем, которые, с одной стороны, может с высоким качеством идентифицировать упомянутый алгоритм, с другой стороны, имеют общее законодательное обоснование, в частности, определяющее принадлежность к кластеру. В частном случае, таких классов в отрасли защиты прав потребителей получено 33.With the usual initial mismatch between machine and expert clusters of about 40%, new groups are formed for mismatched clusters based on the constructed semantic parsers, and the parsers (242, FIG. 2) search for specific chains of characters predefined in the dictionaries (252, FIG. 2) . Character chains are words or phrases that are combined by a common property in the source text of a question of a user 111 asking a question, for example, on the website 153. For example, a common property of a group of questions may be “low-quality technically complex products”. Using the identifier of the class of problems (292, FIG. 2) of a multilayer neural network, such classes of legal problems can be created that, on the one hand, can identify the algorithm with high quality, on the other hand, have a general legislative justification, in particular, defining belonging to a cluster. In the particular case, 33 such classes were obtained in the consumer protection industry.

Для формирования автоматизированной системы ответов на вопросы или юридических консультаций проводится обучение системы, а именно, при обучении системы проверяется актуальность нормативных актов. Так, например, сравниваются нормативные акты, содержащиеся в базе знаний 147 с нормативными актами, содержащимися в эталонной (регулярно обновляющейся) актуальной базе данных (например, базе данных, содержащейся на удаленном сервере, в сети Интернет и т.д.) и по результатам сравнения обновляются данные в базе данных 147. Для обучения системы используется многослойная нейронная сеть, архитектура которой разработана с учетом специфики обрабатываемых данных. Также в процессе разработки архитектуры сети использовались методы Монте-Карло и проводились многочисленные эксперименты Монте-Карло, которые позволяют выбрать наилучшие параметры сети для обучения.For the formation of an automated system of answers to questions or legal advice, the system is trained, namely, when training the system, the relevance of normative acts is checked. So, for example, the normative acts contained in the knowledge base 147 are compared with the normative acts contained in the reference (regularly updated) current database (for example, the database contained on a remote server, the Internet, etc.) and the results comparisons, the data in the database 147 is updated. For training the system, a multilayer neural network is used, the architecture of which is developed taking into account the specifics of the processed data. Also, in the process of developing the network architecture, Monte Carlo methods were used and numerous Monte Carlo experiments were carried out, which allow you to choose the best network parameters for training.

В дальнейшем алгоритмы нормализации, применяющие нейронные сети, используются для нормализации тематик других отраслей права, что позволяет существенно экономить время и ресурсы на подготовку системы к использованию, правильность формирования векторов для других отраслей права проверяется специалистами 122 в ускоренном режиме, преимущественно путем определения области соответствующего кластера.In the future, normalization algorithms that use neural networks are used to normalize the topics of other branches of law, which can significantly save time and resources for preparing the system for use, the correctness of the formation of vectors for other branches of law is verified by specialists 122 in an accelerated mode, mainly by determining the area of the corresponding cluster .

На завершающем этапе машинного обучения обучается модель, которая в дальнейшем позволяет предсказывать категорию вопроса задаваемого пользователем.At the final stage of machine learning, a model is trained, which subsequently allows you to predict the category of the question asked by the user.

При обработке текста для проведения кластеризации, также используются операции NLP: When processing text for clustering, NLP operations are also used:

- очистка вопросов от нерелевантных символов, например, знаков препинания, html-элементов и пр.;- cleaning questions from irrelevant characters, for example, punctuation marks, html elements, etc .;

- токенизация текстов - разделение предложений на отдельные символы (токены);- tokenization of texts - separation of sentences into separate characters (tokens);

- конкатенация слов с определенными предлогами и частицами для сохранения эмоциональной окраски текстов;- concatenation of words with certain prepositions and particles to preserve the emotional coloring of the texts;

- удаление нерелевантных слов, которые не несут смысловой нагрузки - местоимения, вопросительные слова и те, которые встречались более, чем в 80% наблюдений из анализируемой выборки;- removal of irrelevant words that do not carry a semantic load - pronouns, interrogative words and those that were found in more than 80% of observations from the analyzed sample;

- лемматизация токенов - сведение их к словарной форме, а там, где словарная форма не найдена проводится стемминг.- lemmatization of tokens - reducing them to a dictionary form, and where the dictionary form is not found, stemming is performed.

В частном случае, методом градиентного бустинга формируются решающие деревья, которые определяют формат ответа на вопрос в соответствии с векторами тематик, а содержание ответов определяется в соответствии с кластером, к которому относится тематика. Помимо прочего учитываются стилистические особенности вопроса.In the particular case, the decision trees are formed by the gradient boosting method, which determine the format of the answer to the question in accordance with the subject vectors, and the content of the answers is determined in accordance with the cluster to which the subject belongs. Among other things, the stylistic features of the issue are taken into account.

Тексты вопросов преобразуются в бинарные вектора (векторизуются). Преобразование в вектора производится с использованием словаря уникальных терминов, извлеченных из выборки вопросов из исходной базы данных. Вектора вопросов передаются в алгоритм k-means, результатом работы которого является группировка вопросов по кластерам. Texts of questions are converted into binary vectors (vectorized). Conversion to vectors is performed using a dictionary of unique terms extracted from a selection of questions from the source database. Question vectors are passed to the k-means algorithm, the result of which is a grouping of questions into clusters.

Как было сказано выше, по тематике защиты прав потребителей предложенной методикой было сформировано 33 класса из 44 кластеров.As mentioned above, on the topic of consumer protection, the proposed methodology formed 33 classes of 44 clusters.

При осуществлении разметки вопросов специалистами 122 для каждого специалиста определено входное качество работы, которое выражается в доле правильно размеченных вопросов.When marking up questions by experts 122, the input quality of work is determined for each specialist, which is expressed in the proportion of correctly marked questions.

В Таблице 1 приведен примерный вариант разметки вопросов специалистами.Table 1 shows an approximate variant of marking up questions by specialists.

Таблица 1.Table 1.

Входное качество работыInput quality of work СпециалистSpecialist Вопрос 1Question 1 Вопрос 2Question 2 Вопрос 3Question 3 0.650.65 Специалист 1Specialist 1 АA БB БB 0.910.91 Специалист 2Specialist 2 АA ВIN АA 0.80.8 Специалист 3Specialist 3 БB ГG БB

Так, например, для Вопроса 1 три специалиста разметили Вопрос 1. Вероятность правильного ответа для Вопроса 1 рассчитывается по следующей формуле:So, for example, for Question 1, three specialists marked out Question 1. The probability of the correct answer for Question 1 is calculated using the following formula:

0.661.

Таким образом, наиболее вероятным вариантом ответа в Вопрос 1 является А с вероятностью 66%. Такой подход позволяет выявлять ошибки специалистов и, в частности, сократить стоимость разметки на 20%, при этом качество разметки выросло на 40%, что повышает качестве работы алгоритмов классификации. Thus, the most likely answer to Question 1 is A with a probability of 66%. This approach allows to identify errors of specialists and, in particular, to reduce the cost of markup by 20%, while the quality of the markup has increased by 40%, which improves the quality of the classification algorithms.

Значение входного качества работы каждого специалиста меняется с каждым размеченным вопросом, т.е. может увеличиваться или уменьшаться, в зависимости от того, отметил специалист наиболее вероятный ответ или нет.The value of the input quality of work of each specialist varies with each question marked out, i.e. may increase or decrease, depending on whether the specialist noted the most likely answer or not.

Данный этап позволяет сформировать базу данных размеченных данных 128 с метками классов проблем по нескольким отраслям права.This stage allows you to create a database of marked up data 128 with labels of classes of problems in several areas of law.

Важным этапом является предобработка текстовых данных с использованием описанных в настоящем изобретении методов обработки текста 133, которые включают лемматизацию, стемминг, удаление стоп-слов, приведение текста к нижнему регистру, токенизацию, и архитектуры модели 135, которая включает многослойную нейронную сеть. Помимо очистки текста от шума и векторизации, в предложенном решении используются другие оригинальные способы обработки текста.An important step is the preprocessing of text data using the text processing methods 133 described in the present invention, which include lemmatization, stemming, stop word removal, lowercase text, tokenization, and model 135 architecture, which includes a multilayer neural network. In addition to cleaning the text from noise and vectorization, the proposed solution uses other original methods of text processing.

Модуль предобработки и классификации вопросов осуществляет приведение текста к представлению, которое понятно алгоритмам машинного обучения. На вход алгоритмы принимают тензоры разных рангов с числовыми значениями, так что осуществляется преобразование текста в формат векторов с числовыми значениями, где каждое отдельное значение - это индекс некоторого слова из словаря. The preprocessing and classification module for questions brings the text to a view that is understandable by machine learning algorithms. Algorithms accept tensors of different ranks with numerical values as input, so the text is converted into a vector format with numerical values, where each individual value is the index of a word from the dictionary.

В частном случае, модуль предобработки и классификации вопросов, или, по крайней мере, одна его часть, в частности, часть, осуществляющая обработку исходного текста, реализован как класс языка программирования, внутри которого методы класса преобразуют исходный текст в числовые векторы. Методы класса используют алгоритмы NLP - токенизация, лемматизация, стемминг, как более подробно описано выше. Так, исходный текст вопроса (232), преобразованный в вектор чисел модулем предобработки и классификации вопросов, поступает на первый уровень обученной модели, которая определяет относится ли поступивший вопрос целевой отрасли права. Это бинарная модель, которая возвращает два исхода - «0» или «1». Если модель возвращает 1, то вопрос передается следующей модели, которая идентифицирует юридическую проблему, т.е. осуществляется многоклассовая классификация.In a particular case, the module for preprocessing and classification of questions, or at least one part of it, in particular, the part that processes the source text, is implemented as a class of a programming language, inside which the class methods convert the source text into numerical vectors. Class methods use NLP algorithms - tokenization, lemmatization, and stemming, as described in more detail above. So, the source text of the question (232), converted into a vector of numbers by the module for preprocessing and classification of questions, goes to the first level of the trained model, which determines whether the received question relates to the target branch of law. This is a binary model that returns two outcomes - “0” or “1”. If the model returns 1, then the question is passed to the next model, which identifies the legal problem, i.e. multiclass classification is carried out.

Результатом работы модели многоклассовой классификации является упорядоченный по убыванию вероятности список меток классов. Первый элемент этого списка является наиболее вероятным по расчетам модели. В процессе обучения модели осуществляется отслеживание мест, в которых она ошибается - какие категории юридических проблем ей бывает сложно правильно дифференцировать. Чтобы избежать таких ошибок используется алгоритм, который идентифицирует пары похожих для модели категорий и в этом случае позволяет уточнять у клиента, с чем именно связана его проблема. После уточнения клиента робот, реализованный программными средствами и являющийся частью диалогового менеджера 137, выбирает окончательную метку класса, которая используется для выбора тематики и формирования ответа. Такой подход, в частности, позволяет поднять точность работы модели на 15%.The result of the multiclass classification model is a list of class labels sorted in descending order of probability. The first element of this list is the most probable according to model calculations. In the process of training the model, it tracks the places where it is mistaken - which categories of legal problems it can be difficult to differentiate correctly. To avoid such errors, an algorithm is used that identifies pairs of categories similar to the model and in this case allows you to clarify with the client what exactly his problem is related to. After clarifying the client, the robot, implemented by software and being part of the dialog manager 137, selects the final class label, which is used to select the topic and form the answer. Such an approach, in particular, makes it possible to increase the accuracy of the model by 15%.

При реализации системы используется диалоговый менеджер 137, где робот общаясь с клиентом и задавая контекстно-зависимые вопросы, с использованием реализованной архитектурой системы с контекстно-зависимыми связями, на выходе формирует подготовленный под конкретного пользователя (кастомизированный) ответ для пользователя. Диалоговый менеджер 137 имеет архитектуру с контекстно-зависимыми связями и моделями семантической обработки (семантического разбора) текста, причем диалоговый менеджер 137 является частью модуля формирования ответов из базы данных знаний в лексическое представление, соответствующее лексическому построению вопроса. В частном случае, упомянутый робот является приложением, способным взаимодействовать с модулями системы, в том числе имитируя действия человека, например, взаимодействовать с интерфейсами системы и с данными баз данных и выполнять определенные операции в соответствии с заданными алгоритмами обработки данных.When implementing the system, dialogue manager 137 is used, where the robot communicates with the client and asks context-sensitive questions, using the implemented architecture of the system with context-sensitive connections, generates an answer prepared for a specific user for the user. Dialogue manager 137 has an architecture with context-sensitive relationships and models of semantic processing (semantic analysis) of text, and dialog manager 137 is part of the module for generating responses from the knowledge database into a lexical representation corresponding to the lexical construction of the question. In a particular case, the said robot is an application capable of interacting with system modules, including simulating human actions, for example, interacting with system interfaces and database data and performing certain operations in accordance with specified data processing algorithms.

Определенная моделью на втором уровне метка класса передается в качестве параметра в диалоговый менеджер 137. Подключается соответствующий классу диалоговый модуль диалогового менеджера 137. Диалоговый менеджер 137 реализует древовидную структуру, по которой осуществляется (проходит) диалог робота и пользователя. Вопрос на каждом последующем шаге зависит от ответа пользователя на предыдущий вопрос. Семантический анализатор извлекает из текста вопроса клиента необходимые сущности и использует их в качестве ответа, по крайней мере, на часть вопросов модулем формирования ответов. За счёт этого сокращается время взаимодействия клиента с роботом и увеличивается скорость ответа на вопрос примерно на 30%.The class label defined by the model at the second level is passed as a parameter to the dialog manager 137. The dialog module 137 of the dialog manager 137 is connected to the class. The dialog manager 137 implements a tree structure through which the dialogue between the robot and the user is performed. The question at each subsequent step depends on the user's response to the previous question. The semantic analyzer extracts the necessary entities from the text of the client’s question and uses them as an answer to at least part of the questions by the response generation module. Due to this, the time of interaction between the client and the robot is reduced and the speed of answering the question increases by about 30%.

Ответы упомянутого робота, предъявляемые пользователям модулем формирования ответов, формируются системой динамически - после диалога с клиентом (пользователем) модуль формирования ответов обращается к данным базы знаний 147, которая обновляется и соответствует реальному законодательству. Так, при изменениях в законах база знаний 147 обновляется и, следовательно, изменяются ответы робота, т.е. модуль формирования ответов формирует ответы с учетом обновленной (и обработанной) информации, содержащейся в базе данных знаний 147.The responses of the mentioned robot presented to users by the response generation module are generated dynamically by the system - after the dialogue with the client (user), the response generation module refers to the knowledge base data 147, which is updated and complies with real legislation. So, with changes in laws, the knowledge base 147 is updated and, consequently, the responses of the robot, i.e. the response generation module generates responses based on the updated (and processed) information contained in the knowledge database 147.

На ФИГ. 3 показан примерный вариант осуществления настоящего изобретения.In FIG. 3 shows an exemplary embodiment of the present invention.

В шаге 310 осуществляется формирование базы данных размеченных данных (базы данных произвольных вопросов) 128 с использованием данных из базы данных неразмеченной информации (данных) 115, как описано в рамках настоящего изобретения. База данных неразмеченной информации 115 содержит вопросы пользователей и рекомендации консультантов в терминах естественного языка. Данные в базу данных произвольных вопросов могут добавляться из существующих баз данных, в которых сохранены вопросы пользователей, задаваемых на различных интернет-ресурсах, в различных приложениях (приложениях для компьютера, смартфона и т.д.) и т.д., а также сохранены ответы специалистов (консультантов) на такие вопросы пользователей. Также, в базу данных размеченных данных 128 могут добавляться после обработки системой вопросы пользователей, задаваемых пользователями с использованием диалогового менеджера 137, на которые формирует ответы описываемая система с использованием модуля формирования ответов.In step 310, a tagged data database (a random questions database) 128 is generated using data from a non-tagged information (data) database 115, as described in the framework of the present invention. The Unallocated Information Database 115 contains user questions and consultant recommendations in natural language terms. Data in the database of arbitrary questions can be added from existing databases in which the questions of users asked on various Internet resources, in various applications (applications for a computer, smartphone, etc.), etc. are saved, and also saved answers of specialists (consultants) to such questions of users. Also, after processing by the system, user questions asked by users using the dialog manager 137, to which the described system generates answers using the response generation module, can be added to the markup data database 128.

Формирование базы данных размеченных данных 128, в том числе, разметка базы данных неразмеченных данных 115, осуществляется средствами системы и специалистами с использованием модуля предобработки вопросов, использующего, по крайней мере, средства разметки вопросов пользователей 124. Formation of the database of marked-up data 128, including marking up the database of unallocated data 115, is carried out by means of the system and specialists using the module for preprocessing questions, using at least the means for marking up user questions 124.

В шаге 315 осуществляется выбор вопросов одной области права, для которой определяют количество направлений ответов (категорий), как описано в рамках настоящего изобретения, причем специалистам предъявляются возможные ответы на вопросы, относящиеся к фразе или тематике, для оценки релевантности ответа поставленному вопросу или фразе из вопроса, причем ответы также указывают на группу, к которой принадлежит вопрос.In step 315, questions of one area of law are selected for which the number of directions of answers (categories) is determined, as described in the framework of the present invention, and experts are presented with possible answers to questions related to the phrase or topic in order to assess the relevance of the answer to the question or phrase from question, and the answers also indicate the group to which the question belongs.

Специалисты с использованием предоставляемых системой средств обработки данных, а также средства системы, выделяют из вопросов пользователей раздельные тематики вопросов и осуществляют группировку тематик по нескольким независимым группам по признаку сходства, причем для тематик вопросов производится сортировка данных для исключения ошибок, совершаемых специалистами. Средства разметки вопросов пользователей 124 осуществляют и позволяют осуществлять специалистам выделять из вопросов пользователей раздельные тематики вопросов и группировать тематики по нескольким независимым группам по признаку сходства.Specialists using the data processing tools provided by the system, as well as the system tools, select separate topics of questions from users' questions and group topics into several independent groups based on similarity, moreover, for topics of questions data is sorted to eliminate errors made by specialists. The means for marking user questions 124 is carried out and allows specialists to isolate separate questions from user questions and group topics into several independent groups based on similarity.

Средства (кросс-)валидации разметки 126 позволяют осуществлять выбор для каждой раздельной тематики вопроса наиболее вероятную группу для каждой из предложенных.Means of (cross-) validation markup 126 allows you to make the choice for each separate topic of the question the most likely group for each of the proposed.

В шаге 320 осуществляется векторизация текстов вопросов пользователей, как описано в рамках настоящего изобретения.In step 320, the text of the user questions is vectorized as described in the framework of the present invention.

Модуль классификации вопросов осуществляет преобразование тематик, содержащихся в вопросах пользователей, в формализованный вид, в частности, осуществляет векторизацию текстов вопросов пользователей. Тексты вопросов пользователей преобразуются в бинарные вектора с использованием словаря уникальных слов и словосочетаний (терминов). Так, векторизация текстов (вопросов пользователей) осуществляется посредством представления уникальных слов и словосочетаний в виде таблицы, содержащей уникальные термины (слова и словосочетания) с указанием наличия или отсутствия уникальных слов и словосочетаний в тексте каждого из вопросов, причем совокупность значений таблицы формирует вектор. Для каждой из фраз определяют наличие в фразе вопроса уникальных терминов и формируют табличную строку, в которой количество ячеек соответствует количеству терминов словаря, так что каждая из ячеек соответствует одному из словарных терминов, а значение, занесенное в ячейку соответствует наличию или отсутствию соответствующего термина в фразе, соответствующего позиции или номеру ячейки, где совокупность значений, занесенных в ячейки таблицы является вектором и используется для анализа соответствующей фразы, в том числе для классификации вопросов, как описано в рамках настоящего изобретения.The module for classifying questions converts topics contained in user questions into a formalized form, in particular, it carries out vectorization of texts of user questions. Texts of user questions are converted into binary vectors using a dictionary of unique words and phrases (terms). So, the vectorization of texts (user questions) is carried out by presenting unique words and phrases in the form of a table containing unique terms (words and phrases) indicating the presence or absence of unique words and phrases in the text of each of the questions, and the set of table values forms a vector. For each of the phrases, the presence of unique terms in the question is determined and a tabular line is formed in which the number of cells corresponds to the number of dictionary terms, so that each cell corresponds to one of the dictionary terms, and the value entered in the cell corresponds to the presence or absence of the corresponding term in the phrase corresponding to the position or cell number, where the set of values entered in the table cells is a vector and is used to analyze the corresponding phrase, including for classifications and questions as described in the framework of the present invention.

При обработке текстов вопросов, в частном случае, как при формировании базы данных произвольных вопросов, так и при обработке задаваемых пользователями вопросов в процессе диалога пользователя с диалоговым менеджером посредством вычислительного устройства, осуществляется удаление из текста вопросов, нерелевантных символов, включающие, по крайней мере, знаки препинания, элементы HTML разметки и т.д. Также осуществляется токенизация текстов с разделением частей текста, например, предложений на отдельные части, в частности, символы, являющимися токенами, а также осуществляется конкатенацию слов с предлогами и частицами для сохранения эмоциональной окраски текстов. Также, осуществляется удаление нерелевантных слов, не несущих смысловой нагрузки, в том числе местоимения и вопросительные слова и слова, как описано в рамках настоящего изобретения. Также, осуществляется лемматизация токенов, т.е. сведение токенов к словарной форме и осуществляется стемминг посредством нахождения основы слова для заданного исходного слова, если словарная форма не найдена.When processing texts of questions, in a particular case, both when creating a database of arbitrary questions and when processing questions asked by users during a user’s dialogue with the dialogue manager using a computing device, questions, irrelevant characters are deleted from the text, including at least punctuation marks, HTML markup elements, etc. Also, tokenization of texts is carried out with the separation of parts of the text, for example, sentences into separate parts, in particular, symbols that are tokens, and the words are concatenated with prepositions and particles to preserve the emotional coloring of the texts. Also, the removal of irrelevant words that do not carry a semantic load, including pronouns and interrogative words and words, is carried out, as described in the framework of the present invention. Also, lemmatization of tokens, i.e. the tokens are reduced to the dictionary form and stemming is performed by finding the word base for a given source word if the dictionary form is not found.

В шаге 325 осуществляется классификация вопросов по категориям по признаку близости векторов на основании заданных правил. Средства группировки размеченных данных, в частности, являющиеся частью модуля классификации вопросов, осуществляют и позволяют осуществлять специалистам классификацию по тематикам вопросов по формальным признакам сходства. Так, по крайней мере, классификация осуществляется посредством группировки вопросов с использованием норм и правил действующих правовых документов из базы данных знаний, и кластерного анализа, в процессе которого формируют кластеры, количество которых равно сформированным экспертами группам для каждой отрасли права, количество которых вместе с векторами вопросов и словарем уникальных слов, обрабатывается с использованием кластерного анализа, так что формируют классы для отраслей права. При кластеризации системой используется алгоритм k-средних, осуществляющий сортировку размеченных специалистами вопросов по кластерам, где каждый из вопросов относится к одному кластеру, расположенному на наименьшем расстоянии от вектора вопроса.In step 325, the questions are classified into categories based on the proximity of the vectors based on the given rules. The means for grouping marked-up data, in particular, which are part of the module for classifying questions, carry out and allow specialists to classify according to topics of questions according to formal signs of similarity. So, at least, the classification is carried out by grouping questions using the norms and rules of existing legal documents from the knowledge database, and cluster analysis, during which clusters are formed, the number of which is equal to the groups formed by experts for each branch of law, the number of which together with the vectors questions and a dictionary of unique words, processed using cluster analysis, so that they form classes for branches of law. When clustering the system, the k-means algorithm is used, sorting the questions marked by experts into clusters, where each of the questions belongs to one cluster located at the smallest distance from the question vector.

Также, в частности в промежуточном шаге, осуществляется преобразование вопросов в формат, пригодный для автоматизированного анализа, при поступлении в систему вопроса пользователя. При упомянутой обработке вопросов (текстов вопросов пользователей) удаляют из текста вопросов нерелевантные символы, выполняют токенизацию текстов, выполняют конкатенацию слов с предлогами и частицами для сохранения эмоциональной окраски текстов, удаляют нерелевантные слова, не несущие смысловой нагрузки, выполняют лемматизацию токенов или стемминг. В частном случае реализации для каждой из тематик, то есть, для составных частей (например, фраз, словосочетаний и т.д. вопросов, имеющих различное смысловое содержание) вопросов пользователей может осуществляться преобразование фразы в формат, пригодный для автоматизированной обработки данных, в том числе анализа. В частном случае, преобразование вопросов в формат, пригодный для автоматизированного анализа, осуществляется посредством автоматизированной обработки вопросов с формированием векторного представления для каждого вопроса. Also, in particular, in the intermediate step, questions are converted to a format suitable for automated analysis when a user's question enters the system. With the above-mentioned processing of questions (texts of user questions), irrelevant characters are removed from the text of the questions, texts are tokenized, words are concatenated with prepositions and particles to preserve the emotional coloring of the texts, irrelevant words that do not carry a semantic load are deleted, lemmatization of tokens or stamming are performed. In the particular case of implementation for each of the topics, that is, for components (for example, phrases, phrases, etc. of questions having different semantic contents) of user questions, the phrase can be converted into a format suitable for automated data processing, including analysis number. In the particular case, the conversion of questions into a format suitable for automated analysis is carried out through automated processing of questions with the formation of a vector representation for each question.

В шаге 335 осуществляется лексический парсинг вопроса пользователя с определением лексической структуры вопроса пользователя. Лексический парсинг осуществляют с использованием, по крайней мере, одного семантического парсера, причем парсеры осуществляют поиск в вопросах цепочек символов, заранее заданных в словарях, где цепочками символов являются слова или словосочетания, объединённые общим свойством в исходном тексте вопроса пользователя, так что создают классы юридических проблем, которые идентифицируются алгоритмом и имеют общее законодательное обоснование, определяющее принадлежность вопроса к кластеру.In step 335, the lexical parsing of the user's question is carried out with the definition of the lexical structure of the user's question. Lexical parsing is performed using at least one semantic parser, and parsers search for strings of characters predefined in dictionaries, where strings of characters are words or phrases combined by a common property in the source text of a user's question, so that they create classes of legal problems that are identified by the algorithm and have a common legislative justification that determines whether the issue belongs to the cluster.

В шаге 340 осуществляется определение содержащегося в базе данных знаний положения нормативно-правового акта, релевантного вопросу пользователя по векторам нормативно-правовых актов. Так, например, может осуществляться сравнение положений нормативно-правовых актов из базы данных знаний и вопроса пользователя, в частности, осуществляется сравнение (заранее) векторизованных положений нормативно-правовых актов и векторизованных вопросов пользователей. Также может осуществляться группировка вопросов по признаку близости векторов и группировка вопросов по направлениям ответов.In step 340, the provisions of the normative legal act contained in the knowledge database are determined that are relevant to the user's question on the vectors of normative legal acts. So, for example, a comparison of the provisions of regulatory acts from a knowledge database and a user's question can be carried out, in particular, a comparison of (pre) vectorized provisions of regulatory legal acts and vectorized questions of users is carried out. A grouping of questions by the proximity of vectors and a grouping of questions by direction of answers can also be carried out.

В шаге 345 осуществляется формирование лексической структуры ответа пользователю, соответствующей лексической структуре вопроса пользователя и положению нормативно-правового акта, релевантного ответу пользователя, причем системой обрабатываются и учитываются грамматические особенности вопроса, предметная область и объект обсуждения, в частности, вопроса.In step 345, the lexical structure of the answer to the user is formed, corresponding to the lexical structure of the user's question and the provision of the legal act relevant to the user's answer, and the system processes and takes into account the grammatical features of the question, subject area and object of discussion, in particular, the question.

Диалоговый менеджер 137 с использованием автоматизированных средств общения с пользователем, в частности, программного робота, ведет диалог в текстовом формате с пользователем, задавая контекстно-зависимые вопросы, с использованием реализованной архитектурой системы с контекстно-зависимыми связями, и формирует подготовленный под вопрос пользователя ответ. Ответы формируются системой динамически, в частности, в том числе в процессе диалога с пользователем (клиентом) модуль формирования ответов запрашивает необходимую для обработки вопроса пользователя и для формирования ответа информацию, по крайней мере, из обновляемой базы знаний (базы данных) 147, а также из базы данных размеченных данных 128, информация в которой корректируется автоматически системой и специалистами, в том числе на основе вопросов пользователей и ответов системы.Dialogue manager 137 using automated means of communication with the user, in particular, a software robot, conducts a dialogue in text format with the user, asking context-sensitive questions, using the implemented system architecture with context-sensitive connections, and generates an answer prepared according to the user's question. The responses are generated dynamically by the system, in particular, including during the dialogue with the user (client), the response generation module requests the information necessary for processing the user's question and for generating the response, at least from the updated knowledge base (database) 147, as well as from the database of marked-up data 128, information in which is automatically adjusted by the system and specialists, including on the basis of user questions and system responses.

В шаге 350 осуществляется предъявление пользователю лексической структуры ответа в виде сообщения, например, текстового. Такое сообщение может быть предъявлено (задавшему вопрос) пользователю на веб-сайте, в приложении и т.д., отображаемом на экране вычислительного устройства.At step 350, the user is presented with the lexical structure of the response in the form of a message, for example, text. Such a message can be presented to the (asking question) user on a website, in an application, etc. displayed on the screen of a computing device.

Также, в частном случае, в процессе обработки вопросов пользователя, в том числе при формировании ответа на вопрос пользователя, осуществляется определение соответствия группировки вопросов по признаку близости векторов и группировки вопросов по направлениям ответов и при несоответствии группировки вопросов по признаку близости векторов и группировки вопросов по направлениям ответов осуществляется изменение параметров преобразования вопросов пользователей в формат, пригодный для автоматизированного анализа, и при соответствии группировки вопросов по признаку близости векторов и группировки вопросов по направлениям ответов используются параметры преобразования вопросов пользователей в формат, пригодный для автоматизированного анализа, для преобразования вопросов пользователей в формат, пригодный для автоматизированного анализа, и положений нормативно-правовых актов с формированием векторов нормативно-правовых актов.Also, in a particular case, in the process of processing user questions, including when forming an answer to a user’s question, the grouping of questions according to the proximity of the vectors and the grouping of questions according to the directions of answers and if the grouping of questions according to the proximity of the vectors and the grouping of questions by areas of answers are changing the parameters for converting user questions into a format suitable for automated analysis, and subject to Billing questions by the proximity of vectors and grouping questions by answer directions, the parameters of converting user questions into a format suitable for automated analysis are used to convert user questions into a format suitable for automated analysis, and the provisions of regulatory acts with the formation of regulatory acts vectors .

На ФИГ. 4 показан упрощенный пример аппаратной реализации предложенного изобретения. Компьютерная сеть может представлять собой географически распределенную совокупность узлов, соединенных линиями связи и сегментами для передачи данных между конечными узлами, такими как персональные компьютеры, рабочие станции, или периферийные устройства, такие как принтеры или сканеры. Доступно множество типов сетей, от локальных сетей (LAN) до глобальных сетей (WAN). Как показано на ФИГ. 4, примерная компьютерная сеть 900 может содержать множество сетевых устройств, таких как маршрутизаторы, коммутаторы, компьютеры и тому подобное, связанных между собой линиями связи. Например, множество сетевых устройств может соединять одно или несколько пользовательских устройств 910 (или «клиентских устройств» 910), которые могут использоваться пользователем, таких как компьютеры, смартфоны, планшеты и т.д. Каналы связи, соединяющие различные сетевые устройства, могут быть проводными линиями или общими средами (например, беспроводными линиями), где определенные устройства могут поддерживать связь с другими устройствами на основе расстояния, уровня сигнала, текущего рабочего состояния, местоположения и т.п. Каналы связи могут соединять различные сетевые устройства в любой возможной конфигурации. Один или несколько серверов 920 (например, хост-серверы, веб-серверы, базы данных и т.д.) Могут поддерживать связь с сетью 900 и, таким образом, с множеством клиентских устройств 910. Специалистам в данной области техники будет понятно, что в компьютерной сети может использоваться любое количество и расположение узлов, устройств, линий связи и т. д., а иллюстрация, показанная на ФИГ. 4 является упрощенным примером аппаратной реализации системы.In FIG. 4 shows a simplified example of a hardware implementation of the proposed invention. A computer network can be a geographically distributed set of nodes connected by communication lines and segments for transferring data between end nodes, such as personal computers, workstations, or peripheral devices, such as printers or scanners. Many types of networks are available, from local area networks (LAN) to wide area networks (WAN). As shown in FIG. 4, an example computer network 900 may comprise a plurality of network devices, such as routers, switches, computers, and the like, connected by communication lines. For example, multiple network devices may connect one or more user devices 910 (or “client devices” 910) that may be used by a user, such as computers, smartphones, tablets, etc. Communication channels connecting various network devices may be wired lines or common media (e.g., wireless lines) where certain devices can communicate with other devices based on distance, signal strength, current operating status, location, etc. Communication channels can connect various network devices in any possible configuration. One or more servers 920 (eg, host servers, web servers, databases, etc.) may communicate with a network 900 and thus with a plurality of client devices 910. Those skilled in the art will understand that in a computer network, any number and arrangement of nodes, devices, communication lines, etc. can be used, and the illustration shown in FIG. 4 is a simplified example of a hardware implementation of a system.

На ФИГ. 5 показан пример вычислительной системы, пригодный для реализации элементов предложенного изобретения. Как показано на ФИГ. 5, вычислительная система содержит вычислительное устройство 1000, которое может использоваться в качестве устройства пользователя или сервера, в памяти 1040 которого хранятся операционная система 1042, компьютерные программы 1044 и структуры данных 1045. Память 1040 устройства 1000 связана шиной 1050 с источником питания 1060, по крайней мере, одним процессором 1020 и, по крайней мере, одним сетевым интерфейсом 1010, осуществляющим передачу данных в сеть 900 (ФИГ. 4) и из сети 900 (ФИГ. 4).In FIG. 5 shows an example of a computing system suitable for implementing elements of the proposed invention. As shown in FIG. 5, the computing system comprises a computing device 1000 that can be used as a user or server device, in memory 1040 of which the operating system 1042, computer programs 1044, and data structures 1045 are stored. The memory 1040 of device 1000 is connected via bus 1050 to power supply 1060, at least at least one processor 1020 and at least one network interface 1010, transmitting data to the network 900 (FIG. 4) and from the network 900 (FIG. 4).

В заключение следует отметить, что приведенные в описании сведения являются примерами, которые не ограничивают объем настоящего изобретения, определенного формулой. Специалисту в данной области становится понятным, что могут существовать и другие варианты осуществления настоящего изобретения, согласующиеся с сущностью и объемом настоящего изобретения.In conclusion, it should be noted that the information provided in the description are examples that do not limit the scope of the present invention defined by the claims. One skilled in the art will recognize that other embodiments of the present invention may exist consistent with the spirit and scope of the present invention.

Claims

1. A way to manage an automated legal advice system containing a knowledge database that contains a set of norms and rules of existing legal documents, a preprocessing and classification module for questions, which is a vector representation of the essence of the questions and a module for generating responses from the knowledge database to the lexical representation corresponding to the lexical building a question, while

form a database of arbitrary questions,

choose questions of one area of law, for which they determine the number of directions of answers in the form of categories of questions,

vectorize the question texts into binary question vectors by presenting unique words in the form of a table containing unique words indicating the presence or absence of unique words in the text of each of the questions, and the set of table values forms a vector,

questions are classified into categories based on proximity of vectors by grouping questions using the norms and rules of existing legal documents stored in the knowledge database, and cluster analysis, during which clusters are formed, the number of which is equal to the groups formed by experts for each branch of law, the number of which together with question vectors and a dictionary of unique words and phrases is processed using cluster analysis, so that classes are formed for branches of law

and when a user question enters the system, the user question is lexically parsed with the lexical structure of the user question defined,

on the basis of the generated classes for the branches of law, determine the position of the regulatory legal act contained in the database of knowledge relevant to the user's question on the vectors of regulatory legal acts,

form the lexical structure of the response to the user, corresponding to the lexical structure of the user's question and the provision of the normative legal act relevant to the user's answer

and provide the user with a lexical response structure in the form of a text message.

2. The method according to claim 1, in which users' questions are converted into a format suitable for automated analysis by means of automated processing of questions with the formation of a vector representation for each question.

3. The method according to claim 2, in which the grouping of questions according to the proximity of the vectors and the grouping of questions according to the directions of the answers is determined and if the grouping of the questions according to the proximity of the vectors and the grouping of the questions along the directions of the answers do not change the parameters for converting user questions into a format suitable for automated analysis, and if the grouping of questions according to the proximity of the vectors and the grouping of questions by the direction of the answers are used, the conversion HAVO users in a format suitable for automated analysis, to convert the user issues in a format suitable for automated analysis, and the provisions of legal acts with the formation of the vectors of normative legal acts.

4. The method according to claim 2, in which the transformed user questions into a format suitable for automated analysis are presented in the form of query arguments and descriptions of answers presented in the form of function values.

5. The method according to claim 1, in which also:

- group questions according to the proximity of vectors,

- group questions according to directions of answers.

6. The method according to claim 1, in which when processing the texts of the questions:

- remove irrelevant characters from the question text, including at least punctuation marks, HTML markup elements;

- perform tokenization of texts, dividing sentences into separate characters that are tokens;

- perform the concatenation of words with prepositions and particles to preserve the emotional coloring of the texts;

- remove irrelevant words that do not carry a semantic load, including pronouns, and interrogative words, and words that occur in more than 80% of the questions;

- perform lemmatization of tokens, bringing the tokens to a dictionary form;

- perform stemming by finding the word base for a given source word, if the dictionary form is not found.

7. The method according to claim 6, in which the dictionary of unique words consists of dictionary forms obtained in the process of lemmatization of tokens and / or stemming.

8. The method according to claim 1, in which for each of the phrases the presence in the question phrase of unique words and phrases is determined and a table row is formed in which the number of cells corresponds to the number of words and phrases of the dictionary, so that each of the cells corresponds to one of the dictionary words and phrases, and the value entered in the cell corresponds to the presence or absence of the corresponding word or phrase in the phrase corresponding to the position or number of the cell, where the set of values entered in the table cells is a vector set and used for analysis of the relevant phrase.

9. The method according to claim 1, in which the clustering uses the k-means algorithm, sorting the questions marked out by experts into clusters, where each of the questions refers to one cluster located at the smallest distance from the question vector.

10. The method according to claim 1, in which lexical parsing is carried out using at least one semantic parser, the parsers searching for strings of characters predefined in dictionaries, where strings of characters are words or phrases united by a common property in the source text of the user's question, so that they create classes of legal problems that are identified by the algorithm and have a common legislative justification that determines whether the question belongs to the cluster.