RU2769427C1

RU2769427C1 - Method for automated analysis of text and selection of relevant recommendations to improve readability thereof

Info

Publication number: RU2769427C1
Application number: RU2021109235A
Authority: RU
Inventors: Анатолий Владимирович Буров; Максим Олегович Ильяхов
Original assignee: Анатолий Владимирович Буров; Максим Олегович Ильяхов
Priority date: 2021-04-05
Filing date: 2021-04-05
Publication date: 2022-03-31

Abstract

FIELD: physics.

SUBSTANCE: invention relates to a method for automated analysis of text and selection of relevant recommendations to improve its readability. Method comprises obtaining source text; detecting stop words in a source text in accordance with given rules for determining stop words through morphological, lexical, semantic and syntactic analysis of the source text based on application of fuzzy search, search by initial forms of words, search by grammatical features and search by punctuation; evaluating the readability of the source text based on calculating the ratio of the number of all words in the source text and the number of stop words detected in the source text; displaying relevant recommendations for improving the readability of the text, including at least an indication of the rule for determining stop words, according to which words in the source text were determined as stop words.

EFFECT: high accuracy of generating recommendations to improve readability of text.

5 cl, 7 tbl

Description

Область техники, к которой относится изобретениеThe technical field to which the invention belongs

Изобретение относится к области обработки цифровых данных с помощью электрических устройств, в частности к обработке данных на естественном языке, а именно анализу естественного языка, семантическому, морфологическому и лексическому.The invention relates to the field of digital data processing using electrical devices, in particular to natural language data processing, namely, natural language analysis, semantic, morphological and lexical.

ГлоссарийGlossary

С целью обеспечения достаточности раскрытия изобретения и обеспечения возможности проведения информационного поиска в отношении заявляемого технического решения ниже приведен перечень терминов, используемых в описании заявляемого изобретения.In order to ensure the sufficiency of the disclosure of the invention and to enable information retrieval in relation to the claimed technical solution, below is a list of terms used in the description of the claimed invention.

Четкий поиск — способ поиска информации, при котором выполняется точное сопоставление информации заданному образцу поиска.A precise search is a method of information retrieval in which an exact comparison of information with a given search pattern is performed.

Нечеткий поиск — поиск информации, при котором выполняется сопоставление информации заданному образцу поиска или близкому к нему значению.Fuzzy search is a search for information in which information is compared to a given search pattern or a value close to it.

Лингвистический токен — объект, описывающий элемент предложения на естественном языке с точки зрения наличия морфологических и других лингвистических характеристик, включая, но не ограничиваясь характеристику части речи, падеж, род, число, лицо, время и другие характеристики, а также содержащий начальную форму слова.Linguistic token is an object that describes an element of a natural language sentence in terms of the presence of morphological and other linguistic characteristics, including, but not limited to, the characteristic of a part of speech, case, gender, number, person, tense and other characteristics, and also contains the initial form of the word.

Регулярные выражения — формальный язык поиска и осуществления манипуляций с подстроками в тексте, основанный на использовании метасимволов. Для поиска используется строка-образец, состоящая из символов и метасимволов и задающая правило поиска. Regular expressions are a formal language for searching and manipulating substrings in text based on the use of metacharacters. For the search, a pattern string is used, consisting of symbols and metacharacters and specifying the search rule.

Суперрегулярные выражения (лингвистические регулярные выражения) — формальный язык описания цепочек лингвистических токенов, а также механизм осуществления поиска данных цепочек, основанный на использовании языка описания характеристик отдельных лингвистических токенов, а также языка задания цепочек лингвистических токенов.Superregular expressions (linguistic regular expressions) are a formal language for describing chains of linguistic tokens, as well as a mechanism for searching for these chains based on the use of a language for describing the characteristics of individual linguistic tokens, as well as a language for specifying chains of linguistic tokens.

АПИ (API) — программный интерфейс приложения, (набор классов, процедур, функций, структур или констант), которыми одна компьютерная система может взаимодействовать с другой системой, а также способ использования данных элементов с помощью какого-либо протокола.API (API) - an application programming interface (a set of classes, procedures, functions, structures or constants) with which one computer system can interact with another system, as well as a way to use these elements using a protocol.

Аффиксы — морфемы, которые присоединяется к корню слова и служат для образования слов.Affixes are morphemes that are attached to the root of a word and serve to form words.

Словоформы — обладающая признаками слова цепочка фонем, образованные от одной лексемы. Word forms are a chain of phonemes with the features of a word, formed from one lexeme.

Стоп-слова — фрагменты текста, которые с высокой долей вероятности не несут для читателя смысла относительно цели конкретного текста, либо подлежат переформулировке или дополнению посредством проведения анализа исходного текстаStop words are fragments of text that, with a high degree of probability, do not carry meaning for the reader regarding the purpose of a particular text, or are subject to reformulation or addition through analysis of the source text.

К стоп-словам, как правило, относятся слова и словосочетания из следующих условных категорий: рекламные, газетные и бытовые штампы, клише, устойчивые выражения, канцелярит, неточные формулировки, необъективные оценки, качественные прилагательные, фразы с отглагольными существительными, чрезмерные обобщения, плеоназмы, неопределенные формулировки, избыточные указания времени, неправильно используемые заимствования, матерные выражения и эвфемизмы. Stop words, as a rule, include words and phrases from the following conditional categories: advertising, newspaper and household stamps, clichés, set expressions, stationery, imprecise wording, biased assessments, qualitative adjectives, phrases with verbal nouns, excessive generalizations, pleonasms, vague wording, redundant time indications, misused borrowings, obscene expressions and euphemisms.

Также к стоп-словам относится ряд неоправданных синтаксических оборотов, затрудняющих понимание текста, в частности повторы, страдательный залог, вводные конструкции, модальность и другие обороты, устранение которых не повлияет на смысл предложения, но сделает текст более простым для чтения. К стоп-словам можно отнести любые слова или словосочетания, которые могут быть удалены из текста без потери смысла.Stop words also include a number of unjustified syntactic turns that make it difficult to understand the text, in particular repetitions, passive voice, introductory constructions, modality and other turns, the elimination of which will not affect the meaning of the sentence, but will make the text easier to read. Stop words include any words or phrases that can be removed from the text without losing their meaning.

Также к стоп-словам относятся обобщающие и оценочные понятия, которые необходимо раскрывать более подробно для того, чтобы читателю стала понятнее мысль автора.Stop words also include generalizing and evaluative concepts that need to be disclosed in more detail in order for the reader to understand the author's thought.

Редактор — человек, который занимается написанием, проверкой и исправлением (редактурой) текста. В процессе редактуры текста редактор прорабатывает стоп-слова: удаляет их, раскрывает понятия, скрытые стоп-словами, или осознанно оставляет стоп-слова в тех местах, где их употребление оправдано, а удаление приведет к потере смысла. После такой обработки текст становится понятнее, короче, информативнее и проще для чтения — повышается читабельность текста. Благодаря этому читатель тратит меньше времени на прочтение текста и быстрее усваивает смысл написанного.Editor - a person who writes, checks and corrects (edits) the text. In the process of editing the text, the editor works through stop words: removes them, reveals concepts hidden by stop words, or deliberately leaves stop words in those places where their use is justified, and deletion will lead to a loss of meaning. After such processing, the text becomes clearer, shorter, more informative and easier to read - the readability of the text increases. Thanks to this, the reader spends less time reading the text and quickly learns the meaning of what is written.

Читабельность текста — мера доступности для понимания письменного текста, определяемая анализом ряда факторов, включая синтаксическую сложность, лексику, выраженность темы, связность тем и т.п. (https://psychology_dictionary.academic.ru/9253). В англоязычной литературе для данного понятия используется термин «readability». Чтобы повысить читабельность текста, редакторы используют приемы редактирования, которые помогают очистить текст от стоп-слов и наполнить его полезной информацией. Readability of a text is a measure of the readability of a written text, determined by the analysis of a number of factors, including syntactic complexity, vocabulary, expressiveness of a topic, coherence of topics, etc. ( https://psychology_dictionary.academic.ru/9253 ). In the English-language literature, the term “readability” is used for this concept. To improve the readability of text, editors use editing techniques to help clean up the text from stop words and fill it with useful information.

Формант — морфема, которая присоединяется к корню и служит для образования слов. A formant is a morpheme that attaches to a root and serves to form words.

Флексия — комплекс грамматических категорий, выражающихся в словоизменении, совокупность морфем, осуществляющих словоизменение.Inflection is a complex of grammatical categories expressed in inflection, a set of morphemes that carry out inflection.

Словоизменение — изменение слов по их грамматическим формам.Inflection - changing words according to their grammatical forms.

Уровень техникиState of the art

В 1920 г. в США вышла книга профессора Уильяма Странка мл. «Элементы стиля». Странк привел правила, как писать понятно и избегать распространённых ошибок:In 1920, a book by Professor William Strunk Jr. was published in the USA. "Elements of style". Strunk gave rules on how to write clearly and avoid common mistakes:

«Чтобы текст был энергичным, он должен быть ёмким. В предложении не должно быть ненужных слов, в абзаце — ненужных предложений, так же, как и на картине не должно быть ненужных штрихов, а в механизме — ненужных частей. Это не значит, что автор должен делать каждое предложение максимально короткими, избегать подробностей или описывать свой предмет лишь общими чертами. Это значит, что каждое слово должно нести смысл».“In order for the text to be energetic, it must be capacious. There should be no unnecessary words in a sentence, no unnecessary sentences in a paragraph, just as there should be no unnecessary strokes in a picture, and unnecessary parts in a mechanism. This does not mean that the author should keep each sentence as short as possible, avoid details, or describe his subject only in general terms. This means that every word must carry a meaning.

1972 году в СССР вышла книга редактора и переводчика Норы Галь «Слово живое и мёртвое». Галь также советует избегать канцеляризмов, использовать глаголы вместо отглагольных существительных, сменить официальный тон на простой, использовать русские слова вместо заимствованных, писать по делу и быстро приходить к сути.In 1972, the book of the editor and translator Nora Gal "The Word Living and Dead" was published in the USSR. Gal also advises to avoid clericalism, use verbs instead of verbal nouns, change the official tone to a simple one, use Russian words instead of borrowed ones, write to the point and quickly come to the point.

Так, на протяжении последнего столетия в редактуре постепенно формировались приемы, которые помогали очищать текст от лишних слов и наполнять текст полезной информацией. So, over the past century, editing techniques have been gradually formed that helped to clear the text of unnecessary words and fill the text with useful information.

С развитием компьютерных технологий стали появляться системы и способы автоматизированного анализа и обработки текста, предназначенные для повышения читабельности обрабатываемого текста. With the development of computer technology, systems and methods of automated text analysis and processing began to appear, designed to increase the readability of the processed text.

Заявляемое техническое решение производит семантический, лексический, морфологический и синтаксический анализ текста. При этом под семантическим анализом следует понимать анализ с целью определения смысла определенной части текста, лексический анализ — анализ лексем с целью поиска слов, анализ морфологический анализ — анализ частей речи, синтаксический анализ — анализ знаков препинания.The claimed technical solution produces a semantic, lexical, morphological and syntactic analysis of the text. At the same time, semantic analysis should be understood as analysis in order to determine the meaning of a certain part of the text, lexical analysis - analysis of lexemes in order to search for words, morphological analysis - analysis of parts of speech, syntactic analysis - analysis of punctuation marks.

В ходе патентного поиска были обнаружены документы, определяющие уровень техники и не считающиеся особо релевантным по отношению к заявленному изобретению, а именно: During the patent search, documents were found that define the state of the art and are not considered particularly relevant in relation to the claimed invention, namely:

«Способ автоматизированного анализа текстовых документов» (патент на изобретение №474870 RU, Заявка: 2011146888/08, 18.11.2011, Патентообладатель: Общество с ограниченной ответственностью «Центр Инноваций Натальи Касперской» (RU)); “Method for automated analysis of text documents” (patent for invention No. 474870 RU, Application: 2011146888/08, 11/18/2011, Patent holder: Natalia Kaspersky Innovation Center Limited Liability Company (RU));

«Автоматическое извлечение именованных сущностей из текста» (патент на изобретение №2665239 RU, Заявка: 2014101126 от 15.01.2014, Патентообладатель: Общество с ограниченной ответственностью «Аби Продакшн» (RU));"Automatic extraction of named entities from the text" (patent for invention No. 2665239 RU, Application: 2014101126 dated 01/15/2014, Patent holder: Abi Production Limited Liability Company (RU));

«Извлечение сущностей из текстов на естественном языке» (патент на изобретение №2626555 RU, Заявка: 2015151699, 02.12.2015, Патентообладатель: Общество с ограниченной ответственностью «Аби Продакшн» (RU));"Extraction of entities from texts in natural language" (patent for invention No. 2626555 RU, Application: 2015151699, 02.12.2015, Patentee: Abi Production Limited Liability Company (RU));

«Сентиментный анализ на уровне аспектов с использованием методов машинного обучения» (патент на изобретение № RU 2 657 173, Заявка: 2016131180, 28.07.2016, Патентообладатель: Общество с ограниченной ответственностью «Аби Продакшн» (RU));“Sentiment analysis at the level of aspects using machine learning methods” (patent for invention No. RU 2 657 173, Application: 2016131180, 07/28/2016, Patent holder: Abi Production Limited Liability Company (RU));

«Метод анализа тональности текстовых данных» (патент на изобретение №2571373 RU, Заявка: 2014112242/08, 31.03.2014, Патентообладатель: Общество с ограниченной ответственностью «Аби ИнфоПоиск» (RU)); "Method of Sentiment Analysis of Text Data" (patent for invention No. 2571373 RU, Application: 2014112242/08, 03/31/2014, Patent holder: Abi InfoPoisk Limited Liability Company (RU));

«Предложение релевантных терминов во время ввода текста» (патент на изобретение №2589727 RU, Конвенционный приоритет: 01.11.2010 US 61/408,699, Патентообладатель: Конинклейке Филипс Электроникс н.в. (NL));“Suggestion of relevant terms during text input” (patent for invention No. 2589727 RU, Convention priority: 01.11.2010 US 61/408,699, Patent holder: Koninkleike Philips Electronics n.v. (NL));

«Readability evaluation method, readability evaluation device and readability evaluation program» (Метод оценки читабельности, устройство оценки читабельности и программа оценки читабельности), (JP2012230652, ISUZU MOTORS LTD, Ведомство Япония, Номер заявки 2011100191, Дата подачи 27.04.2011, Номер публикации 2012230652, Дата публикации 22.11.2012, Номер предоставления патента 5733617, Дата выдачи патента 24.04.2015);"Readability evaluation method, readability evaluation device and readability evaluation program" , Publication date 11/22/2012, Patent grant number 5733617, Patent issue date 04/24/2015);

«Text normalization method, device and equipment and storage medium» (Метод нормализации текста, устройство, оборудование и носитель информации), (CN110765733, IFLYTEK CO., LTD., Ведомство Китай, Номер заявки 201911017291.4, Дата подачи 24.10.2019, Номер публикации 110765733, Дата публикации 07.02.2020);"Text normalization method, device and equipment and storage medium" (CN110765733, IFLYTEK CO., LTD., Office of China, Application No. 201911017291.4, Filing Date 10/24/2019, Publication Number 110765733, Publication date 02/07/2020);

«System and method for enhancing comprehension and readability of text» (Система и метод улучшения понимания и читабельности текста), (GB2514725, QUILLSOFT LTD, Ведомство Соединённое Королевство, Номер заявки 201416621, Дата подачи 22.02.2013, Номер публикации 2514725, Дата публикации 05.11.2014)."System and method for enhancing comprehension and readability of text" (GB2514725, QUILLSOFT LTD, United Kingdom Office, Application Number 201416621, Filing Date 02/22/2013, Publication Number 2514725, Publication Date 05.11 .2014).

Также были обнаружены и исследованы технические решения, содержащие отдельные признаки или их эквиваленты, используемые в заявленном изобретении.Also, technical solutions containing individual features or their equivalents used in the claimed invention were discovered and investigated.

«Способ выявления незначащих лексических единиц в текстовом сообщении и компьютер» (патент на изобретение №2580424RU, Заявка: 2014147903/08, 28.11.2014, Патентообладатель: Общество с ограниченной ответственностью «Яндекс») Данный способ относится к системам обработки предназначенного пользователю входящего сообщения электронной почты. Технический результат заключается в обеспечении возможности выявления незначащих лексических единиц в тексте сообщения электронной почты. Такой результат достигается тем, что осуществляют синтаксический анализ сообщения электронной почты для определения лексической единицы в качестве кандидата в незначащие лексические единицы; осуществляют первую и вторую проверки кандидата в незначащие лексические единицы путем сопоставления с незначащими лексическими единицами из первой и из второй базы данных лексических единиц, где первая база данных сформирована в результате синтаксического анализа предыдущих сообщений электронной почты, предназначенных пользователю, а вторая база данных сформирована в результате синтаксического анализа предыдущих сообщений электронной почты, предназначенных группе пользователей из множества пользователей. В ответ на положительный результат любой из первой проверки и второй проверки определяют кандидата в незначащие лексические единицы в качестве незначащей лексической единицы.“A method for detecting insignificant lexical units in a text message and a computer” (patent for invention No. 2580424RU, Application: 2014147903/08, November 28, 2014, Patentee: Yandex Limited Liability Company) This method refers to systems for processing an incoming email message intended for the user mail. The technical result consists in providing the possibility of identifying insignificant lexical units in the text of an e-mail message. This result is achieved by parsing the e-mail message to determine the lexical item as a candidate for non-meaningful lexical items; carry out the first and second checks of the candidate for insignificant lexical units by comparing with insignificant lexical units from the first and from the second database of lexical units, where the first database is formed as a result of parsing previous e-mail messages intended for the user, and the second database is formed as a result of parsing previous e-mail messages intended for a group of users from a plurality of users. In response to a positive result of any of the first test and the second test, the candidate for the non-significant lexical items is determined as the non-significant lexical item.

Вышеуказанный способ отличается от заявленного изобретения и имеет иное назначение, но в целом данный способ в той или иной мере позволяет выявлять незначащие лексические единицы в тексте, как и заявляемое изобретение. В отличие от заявляемого технического решения в данном способе используется только семантический анализ текста и не используется лексический и морфологический анализ. Также данный способ предназначен для достижения иного технического результата.The above method differs from the claimed invention and has a different purpose, but in general, this method to some extent allows you to identify insignificant lexical units in the text, like the claimed invention. Unlike the claimed technical solution, this method uses only semantic text analysis and does not use lexical and morphological analysis. Also, this method is intended to achieve a different technical result.

«Method for displaying degree of difficulty of readability of text in word processor, involves comparing numerical readability subscripts in overview of text, so that readability assessments of passages are compared with each other» (Метод отображения степени сложности читабельности текста в текстовом процессоре включает сравнение числовых индексов читабельности в обзоре текста, так что оценки читабельности отрывков сравниваются друг с другом), (DE102010027146, Ramps Ullrich, Ведомство Германия, Номер заявки 102010027146, Дата подачи 09.07.2010, Номер публикации 102010027146, Дата публикации 12.01.2012) включает определение степени трудности чтения текста в текстовом процессоре приложением в зависимости от количественных лингвистических единиц, например букв, слогов и слов в тексте в узких отрывках. Индексы числовой читабельности одновременно и отдельно представлены для отдельных отрывков анализируемого текста как аналогичные замкнутые языковые единицы. Индексы представлены в обзоре текста, так что оценки читабельности отдельных отрывков сравниваются друг с другом."Method for displaying degree of difficulty of readability of text in word processor, involves comparing numerical readability subscripts in overview of text, so that readability assessments of passages are compared with each other" readability indexes in a text review so that readability scores of passages are compared with each other), (DE102010027146, Ramps Ullrich, German Office, Application number 102010027146, Filing date 07/09/2010, Publication number 102010027146, Publication date 01/12/2012) includes a determination of the degree of difficulty reading text in a word processor by an application based on quantitative linguistic units, such as letters, syllables, and words in text in narrow passages. Indexes of numerical readability are simultaneously and separately presented for individual passages of the analyzed text as similar closed language units. Indexes are presented in the text review so that the readability scores of individual passages are compared with each other.

Вышеуказанный метод, как и заявляемое изобретение, позволяет дать автоматическую оценку читабельности текста. Однако, в отличие от заявляемого изобретения данный метод оценивает читабельность текста только в части семантической сложности текста, и не предназначен для дальнейшего редактирования текста.The above method, like the claimed invention, allows you to automatically evaluate the readability of the text. However, unlike the claimed invention, this method evaluates the readability of the text only in terms of the semantic complexity of the text, and is not intended for further text editing.

Компьютерная система представленная в «Detecting document text that is hard to read» (Обнаружение трудночитаемого текста документа), (US08990224, Google Inc., Ведомство Соединенные Штаты Америки, Номер заявки 13674320, Дата подачи 12.11.2012, Номер публикации 08990224, Дата публикации 24.03.2015) сконфигурирована для определения частей текста, извлеченных из соответствующей группы документов; предоставляет возможность обрабатывать конкретную часть текста с помощью набора фильтров, где конкретная часть текста может соответствовать конкретному документу, и где каждый из фильтров может генерировать соответствующую оценку на основе обработки конкретной части текста; вычислить оценку удобочитабельности на основе соответствующих оценок, сгенерированных фильтрами; определить, что оценка удобочитабельности соответствует пороговой оценке; и сгенерировать или выбрать новую часть текста для конкретного документа на основе определения того, что оценка читабельности соответствует пороговой оценке. Данное техническое решение в отличие от заявляемого изобретения предназначено исключительно для анализа заголовков текста и поддержке пользователя в их исправлении для повышения их соответствия содержанию и упрощению восприятия.Computer system featured in "Detecting document text that is hard to read", (US08990224, Google Inc., United States Office, Application Number 13674320, Filing Date 11/12/2012, Publication Number 08990224, Publication Date 03/24 .2015) is configured to determine the parts of the text extracted from the corresponding group of documents; provides the ability to process a specific part of the text using a set of filters, where the specific part of the text can correspond to a specific document, and where each of the filters can generate an appropriate score based on the processing of a specific part of the text; calculate a readability score based on the corresponding scores generated by the filters; determine that the readability score meets the threshold score; and generate or select a new portion of text for a particular document based on the determination that the readability score meets the threshold score. This technical solution, in contrast to the claimed invention, is intended solely to analyze the headings of the text and support the user in correcting them to improve their compliance with the content and simplify perception.

Устройство и способ «Apparatus and method for improving line-to-line word positioning of text for easier reading» (Устройство и способ улучшения построчного позиционирования слов текста для облегчения чтения), (US6766495, International Business Machines Corporation, Ведомство Соединенные Штаты Америки, Номер заявки 09406188, Дата подачи 27.09.1999, Номер публикации 6766495, Дата публикации 20.07.2004) улучшают читабельность текста в компьютерной системе, изменяя расположение одного или нескольких слов, чтобы устранить потенциальные проблемы в удобочитабельности, которые можно идентифицировать, исследуя текст. Когда потенциальная проблема идентифицирована, позиционирование текста слово в слово может быть скорректировано для сжатия одной или нескольких строк и / или расширения одной или нескольких строк для перемещения одного или нескольких слов в другую строку. Например, если две соседние строки начинаются с одного и того же слова, первая строка может быть сжата, чтобы первое слово второй строки было перемещено в конец первой строки. В качестве альтернативы, первая строка может быть расширена так, чтобы последнее слово первой строки было перемещено в первое слово второй строки. Выборочно изменяя расположение слов, можно значительно улучшить читабельность текста. Apparatus and method for improving line-to-line word positioning of text for easier reading (US6766495, International Business Machines Corporation, Office of the United States of America, Number applications 09406188, Filing Date 09/27/1999, Publication Number 6766495, Publication Date 07/20/2004) improve the readability of text in a computer system by changing the arrangement of one or more words to eliminate potential problems in readability that can be identified by examining the text. When a potential problem is identified, word-for-word text positioning can be adjusted to compress one or more lines and/or expand one or more lines to move one or more words to another line. For example, if two adjacent lines start with the same word, the first line can be compressed so that the first word of the second line is moved to the end of the first line. Alternatively, the first line may be expanded so that the last word of the first line is moved to the first word of the second line. By selectively changing the arrangement of words, you can significantly improve the readability of the text.

Вышеуказанное техническое решение, как и заявляемое изобретение, позволяет анализировать текст на предмет наличия повторов, однако, в отличие от заявляемого изобретения, функционал данного технического решения ограничивается выявлением повторов и не проводит анализ текста на предмет наличия иных возможностей для улучшения читабельности текста.The above technical solution, like the claimed invention, allows you to analyze the text for the presence of repetitions, however, unlike the claimed invention, the functionality of this technical solution is limited to detecting repetitions and does not analyze the text for other opportunities to improve the readability of the text.

Система обработки естественного языка «Readability awareness in natural language processing systems» (Обеспечение читабельности в системах обработки естественного языка), (US20170193093 - International Business Machines Corporation, Ведомство Соединенные Штаты Америки, Номер заявки 15162641, Дата подачи 24.05.2016, Номер публикации 20170193093, Дата публикации 06.07.2017, Номер предоставления патента 09875300, Дата выдачи патента 23.01.2018) и развивающая ее технология «Readability awareness in natural language processing systems» (Обеспечение читабельности в системах обработки естественного языка), (US20190179840, International Business Machines Corporation,Ведомство Соединенные Штаты Америки, Номер заявки 16274663, Дата подачи 13.02.2019, Номер публикации 20190179840, Дата публикации 13.06.2019, Номер предоставления патента 10534803, Дата выдачи патента 14.01.2020) предназначены для определения уровня читабельности текста на основе индикатора уровня читабельности. В отличие от заявляемого изобретения, данные системы рассчитывают уровень читабельности в зависимости от наличия в тексте грамматических и орфографических ошибок, жаргонных терминов и не учитывает иных морфологических, лексических и синтаксических критериев.Natural language processing system "Readability awareness in natural language processing systems" (Ensuring readability in natural language processing systems), (US20170193093 - International Business Machines Corporation, Office of the United States of America, Application number 15162641, Filing date 05/24/2016, Publication number 20170193093, Publication date 07/06/2017, Patent grant number 09875300, Patent issue date 01/23/2018) and its developing technology "Readability awareness in natural language processing systems" (Providing readability in natural language processing systems), (US20190179840, International Business Machines Corporation, Office United States of America, Application number 16274663, Filing date 02/13/2019, Publication number 20190179840, Publication date 06/13/2019, Patent grant number 10534803, Patent date 01/14/2020) are designed to determine the readability level of the text based on the readability level indicator. Unlike the claimed invention, these systems calculate the level of readability depending on the presence of grammatical and spelling errors, jargon terms in the text and do not take into account other morphological, lexical and syntactic criteria.

Также можно выделить ряд технических решений, направленных на анализ текста на естественном языке: «Многоэтапное распознавание именованных сущностей в текстах на естественном языке на основе морфологических и семантических признаков (патент на изобретение №2619193 RU, Заявка: 2016124139, 17.06.2016, Патентообладатель: Общество с ограниченной ответственностью «Аби ИнфоПоиск» (RU)), «Выявление словосочетаний в текстах на естественном языке» (патент на изобретение №2618374 RU, Заявка: 2015147536, 05.11.2015, Патентообладатель: Общество с ограниченной ответственностью «Аби ИнфоПоиск» (RU)), «Способ суммаризации текста и используемые для его реализации устройство и машиночитаемый носитель информации» (патент на изобретение № RU 2 635 213, Заявка: 2016138082, 26.09.2016, Патентообладатель: Самсунг Электроникс КО., ЛТД. (KR)), «Обнаружение языковой неоднозначности в тексте», (патент на изобретение № RU 2 643 438, Заявка: 2013157757, 25.12.2013, Патентообладатель: Общество с ограниченной ответственностью «Аби Продакшн» (RU)). You can also highlight a number of technical solutions aimed at analyzing text in natural language: "Multi-stage recognition of named entities in texts in natural language based on morphological and semantic features (patent for invention No. 2619193 RU, Application: 2016124139, 06/17/2016, Patent holder: with limited liability "Abi InfoPoisk" (RU)), "Identification of phrases in texts in natural language" (patent for invention No. 2618374 RU, Application: 2015147536, 05.11.2015, Patentee: Limited Liability Company "Abi InfoPoisk" (RU) ), “The text summarization method and the device and machine-readable information carrier used for its implementation” (patent for invention No. RU 2 635 213, Application: 2016138082, 09/26/2016, Patent holder: Samsung Electronics CO., LTD. (KR)), “ Detection of linguistic ambiguity in the text ", (patent for invention No. RU 2 643 438, Application: 2013157757, 12/25/2013, Patent holder: Limited response company property of Abi Production (RU)).

Вышеуказанные технические решения, как и заявляемое изобретение, позволяют так или иначе проводить анализ текста, однако принцип и порядок работы анализа существенно отличается от реализованного в настоящем изобретении.The above technical solutions, as well as the claimed invention, allow one way or another to analyze the text, however, the principle and operation of the analysis differ significantly from that implemented in the present invention.

В качестве прототипа заявляемого изобретения можно рассматривать техническое решение, раскрытое в публикации US20160306787 «Сomputer processes for analyzing and suggesting improvements for text readability» («Компьютерные процессы для анализа и предложения улучшений для читабельности текста») (Правообладатель: Wordrake Holdings, LLC, Ведомство Соединенные Штаты Америки, Номер заявки 15191418, Дата подачи 23.06.2016, Номер публикации 20160306787, Дата публикации 20.10.2016, Номер предоставления патента 09953026).As a prototype of the claimed invention, one can consider the technical solution disclosed in the publication US20160306787 “Computer processes for analyzing and suggesting improvements for text readability” (“Computer processes for analyzing and suggesting improvements for text readability”) (Copyright holder: Wordrake Holdings, LLC, Office of the United States of America, Application number 15191418, Filing date 06/23/2016, Publication number 20160306787, Publication date 10/20/2016, Patent grant number 09953026).

Техническое решение, представленное в публикации US20160306787, описывает компьютерный процесс для анализа и улучшения читабельности документов. Читабельность документа улучшается за счет использования правил и соответствующей логики для автоматического обнаружения различных типов проблем с записью и внесения и / или предложения изменений для устранения таких проблем. Многие правила направлены на создание более лаконичных формулировок анализируемых предложений, например, путем исключения ненужных слов, перестановки слов и фраз и внесения различных других типов редактирования. Предлагаемые изменения могут быть переданы, например, через платформу обработки текста, путем изменения внешнего вида текста, чтобы указать, как текст будет выглядеть с (или с и без) редактированием.The technical solution presented in US20160306787 describes a computer process for analyzing and improving the readability of documents. The readability of a document is improved by using rules and appropriate logic to automatically detect various types of recording problems and make and/or suggest changes to fix such problems. Many of the rules aim to create more concise wording of the sentences being analyzed, for example by eliminating unnecessary words, rearranging words and phrases, and introducing various other types of editing. Proposed changes can be communicated, for example, through a word processing platform, by changing the appearance of the text to indicate how the text will look with (or with and without) editing.

Основным недостатком технического решения, представленного в публикации US20160306787, является то, что данная система предназначена для анализа английского языка и не может быть эффективно использована для анализа славянских языков, в том числе русского языка. Это обусловлено различиями между самими языками и их правилами словообразования, что в свою очередь предполагает использование различных подходов к анализу текстов на этих языках. The main disadvantage of the technical solution presented in the publication US20160306787 is that this system is intended for the analysis of the English language and cannot be effectively used for the analysis of Slavic languages, including the Russian language. This is due to differences between the languages themselves and their word-formation rules, which in turn implies the use of different approaches to the analysis of texts in these languages.

Английский язык относится к аналитическим языкам с агглютинативным строем, в которых грамматические значения выражаются преимущественно при помощи служебных слов. Форманты в английском языке, как правило, не образуют неделимых структур и не изменяются под влиянием других формантов. В связи с этим обработка текста на естественном английском языке возможна путем применения фиксированного словаря, содержащего конкретные словоформы, как и предусмотрено в вышеуказанном техническом решении US20160306787.English belongs to analytical languages with an agglutinative structure, in which grammatical meanings are expressed mainly with the help of function words. Formants in English, as a rule, do not form indivisible structures and do not change under the influence of other formants. In this regard, text processing in natural English is possible by using a fixed dictionary containing specific word forms, as provided in the above technical solution US20160306787.

Русский язык относится к синтетическим языкам с флективным строем, где доминирует словоизменение при помощи флексий — формантов, сочетающих сразу несколько значений. В связи с этим системы для обработки текста на естественном русском языке должны не только использовать фиксированный словарь, но и учитывать морфологию слов и их возможные окончания. Учитывая, что вышеуказанное техническое решение US20160306787 предусматривает применение фиксированного словаря, можно сделать вывод о том, что применение этого технического решения для обработки текста на русском языке не может быть полноценным, т.к. данная система основывается исключительно на алгоритме четкого поиска и не будет учитывать возможные все возможные словоформы, образующиеся при помощи флексий, и, как следствие, не будет реагировать на данные словоформы.The Russian language belongs to synthetic languages with an inflectional structure, where inflection dominates with the help of inflections - formants that combine several meanings at once. In this regard, systems for processing text in natural Russian should not only use a fixed dictionary, but also take into account the morphology of words and their possible endings. Given that the above technical solution US20160306787 provides for the use of a fixed dictionary, we can conclude that the use of this technical solution for processing text in Russian cannot be complete, because this system is based solely on the clear search algorithm and will not take into account all possible word forms formed with the help of inflections, and, as a result, will not respond to these word forms.

В заявляемом изобретении данная техническая задача решается за счет применения использования нечеткого поиска, когда слова указываются с возможными вариантами (например, «моё|его|её|... », которое соответствует притяжательным местоимениям в разных родах, лицах и падежах), с нечёткими окончаниями (например, «\bярки\w{1,2} +впечатлени\w{1,3}\b», которое соответствует фразам «яркие впечатления» или «ярких впечатлений»). Также в заявляемом изобретении используется механизм поиска по начальным формам слов (например, «<%на%><><%заря|восход|исход%>», которое соответствует «на заре», «на исходе» и др.) и механизм поиска по грамматическим признакам (например, «<деепр>» соответствует деепричастиям, а «(<A|S, им><>){5,10}» задаёт цепочку из 5–10 имён прилагательных или существительных в именительном падеже.). Данные технические признаки обуславливают возможность полноценного анализа текста на русском языке, когда система анализа текста учитывает не только конкретные словоформы, зафиксированные в словаре, но и все их возможные вариации.In the claimed invention, this technical problem is solved by using the use of a fuzzy search, when words are indicated with possible options (for example, "my|his|her|...", which corresponds to possessive pronouns in different genders, persons and cases), with fuzzy endings (e.g. "\bjary\w{1,2} +impressions\w{1,3}\b", which corresponds to the phrases "bright impressions" or "bright impressions"). Also, the claimed invention uses a search mechanism for the initial forms of words (for example, "<%at%><><%dawn|sunrise|exodus%>", which corresponds to "at dawn", "at the end", etc.) and the mechanism search by grammatical features (for example, “<deepr>” corresponds to gerunds, and “(<A|S, im><>){5,10}” specifies a chain of 5–10 adjectives or nouns in the nominative case.). These technical features make it possible to fully analyze the text in Russian, when the text analysis system takes into account not only specific word forms recorded in the dictionary, but also all their possible variations.

Также заявляемое изобретение в сравнении с прототипом имеет ряд и других технических отличий. В отличие от прототипа, заявляемое изобретение выполнено с возможностью автоматической оценки качества текста, возможностью оптимизации обработки текста для автоматической оценки качества текста после его редактирования, а также с возможностью предоставления АПИ для подключения сторонних систем (например, систем управления содержимым (Content Management System, CMS) и других информационных систем. Таким образом, заявляемое техническое решение существенно отличается от технического решения, представленного в прототипе.Also, the claimed invention in comparison with the prototype has a number of other technical differences. Unlike the prototype, the claimed invention is made with the possibility of automatic evaluation of text quality, the possibility of optimizing text processing to automatically evaluate the quality of text after editing, and also with the possibility of providing an API for connecting third-party systems (for example, content management systems (Content Management System, CMS ) and other information systems.Thus, the proposed technical solution differs significantly from the technical solution presented in the prototype.

Техническая задача, на решение которой направлено настоящее изобретение, заключается в создании способа, использование которого позволит редактору увеличить скорость и эффективность редактирования текста с целью улучшения его читабельности. The technical problem to be solved by the present invention is to create a method, the use of which will allow the editor to increase the speed and efficiency of text editing in order to improve its readability.

Технический результат настоящего изобретения заключается в обеспечении возможности автоматизированного подбора релевантных рекомендаций по улучшению читабельности текста. The technical result of the present invention is to enable automated selection of relevant recommendations to improve the readability of the text.

Указанный технический результат при использовании заявляемого изобретения достигается за счет автоматизированного морфологического, лексического и синтаксического анализа исходного текста, выполняемого с учетом особенностей славянских языков как синтетических языков с флективным строем, и направленного на выявление стоп-слов в анализируемом тексте, оценке читабельности текста, подбора и отображению релевантных рекомендаций по улучшению читабельности текста, а также повторному анализу отредактированных пользователем предложений и перерасчету оценке читабельности текста после редактирования. The specified technical result when using the claimed invention is achieved through automated morphological, lexical and syntactic analysis of the source text, performed taking into account the characteristics of Slavic languages as synthetic languages with an inflectional system, and aimed at identifying stop words in the analyzed text, assessing the readability of the text, selecting and displaying relevant recommendations for improving the readability of the text, as well as re-analyzing user-edited sentences and recalculating the readability score of the text after editing.

Раскрытие изобретенияDisclosure of invention

Способ автоматизированного анализа текста и подбора релевантных рекомендаций по улучшению его читабельности, выполняемый на ЭВМ, включающийMethod for automated text analysis and selection of relevant recommendations to improve its readability, performed on a computer, including

- получение исходного текста; - getting the original text;

- выявление стоп-слов в исходном тексте в соответствии с заданными правилами определения стоп-слов посредством проведения морфологического, лексического, семантического и синтаксического анализа исходного текста на основе применения нечеткого поиска, поиска по начальным формам слов, поиска по грамматическим признакам и поиска по пунктуации;- identifying stop words in the source text in accordance with the specified rules for determining stop words by conducting morphological, lexical, semantic and syntactic analysis of the source text based on the use of fuzzy search, search by initial forms of words, search by grammatical features and search by punctuation;

- оценку читабельности исходного текста на основе расчета соотношения количества всех слов в исходном тексте и количества выявленных в исходном тексте стоп-слов;- assessment of the readability of the source text based on the calculation of the ratio of the number of all words in the source text and the number of stop words identified in the source text;

- отображение релевантных рекомендаций по улучшению читабельности текста, включающих по меньшей мере указание на правило определения стоп-слов, в соответствии с которыми слова в исходном тексте были определены как стоп-слова.- displaying relevant recommendations for improving the readability of the text, including at least an indication of the rule for determining stop words, according to which words in the source text were determined to be stop words.

Получение исходного текста при реализации заявляемого на регистрацию способа может осуществляться путем введения исходного текста пользователем, копирования или загрузки текста из файла. Также исходный текст может быть получен от иных компьютерных систем посредством АПИ. Obtaining the source text when implementing the method claimed for registration can be carried out by entering the source text by the user, copying or loading the text from a file. Also, the source text can be obtained from other computer systems through the API.

Такие АПИ могут иметь ряд заданных ограничений в зависимости от заданных параметров доступа конкретного пользователя. Так, в зависимости от параметров доступа для конкретного пользователя может быть установлен допустимый объем исходного текста и/или доступный набор правил в соответствии с которыми осуществляется анализ исходного текста и/или содержание отображаемых пользователю рекомендаций по улучшению читабельности текста и/или доступное количество запросов в определенный период времени.Such APIs may have a number of predefined restrictions depending on the predefined access parameters of a particular user. So, depending on the access parameters for a particular user, the allowable amount of the source text and / or the available set of rules can be set in accordance with which the analysis of the source text and / or the content of recommendations displayed to the user to improve the readability of the text and / or the available number of requests in a certain time period.

Выявление стоп-слов в исходном тексте производится в соответствии с заданными правилами определения стоп-слов посредством проведения морфологического, лексического, семантического и синтаксического анализа исходного текста на основе применения нечеткого поиска, поиска по начальным формам слов, поиска по грамматическим признакам и поиска по пунктуации;Identification of stop words in the source text is carried out in accordance with the specified rules for determining stop words by conducting morphological, lexical, semantic and syntactic analysis of the source text based on the use of fuzzy search, search by initial forms of words, search by grammatical features and search by punctuation;

Правила определения стоп-слов задаются по методу регулярных и суперрегулярных выражений. The rules for determining stop words are set using the method of regular and superregular expressions.

Метод задания правил определения стоп-слов посредством регулярных выражений заключается в составлении шаблона на языке регулярных выражений (используется диалект, предоставляемый стандартной библиотекой языка Python), которое определяет одну или множество цепочек символов.The method for defining stopword rules using regular expressions is to construct a regular expression pattern (using a dialect provided by the Python standard library) that defines one or more strings of characters.

Примеры:Examples:

ШаблонSample ПояснениеExplanation Примеры строк, соответствующих шаблонуExample strings that match the pattern \bнередко\b\boften\b «\b» означает границу слова.
Шаблон задаёт строку «нередко», обрамлённую границами слов, т. е. фрагментами, где стыкуются буквенные и небуквенные символы, или же границы предложения."\b" means a word boundary.
The template defines the string "often" framed by word boundaries, i.e. fragments where alphabetic and non-alphabetic characters meet, or sentence boundaries. нередкоoften \bразного +рода\b\bvarious+kinds\b « +» (пробел, плюс) задаёт последовательность из одного или большего количества пробелов."+" (space, plus) specifies a sequence of one or more spaces. разного рода
разного рода
разного родаdifferent kind
different kind
different kind \bнемно(го|жко)\b\ba little(hot|hard)\b «(го|жко)» соответствует цепочке «го» или «жко»"(go|zhko)" matches the string "go" or "zhko" немного
немножкоLittle
a little \bактуальн\w{1,4}\b\bactual\w{1,4}\b «\w» определяет любой буквенный символ.
{1,4} опредеяет повторение символа от 1 до 4 раз."\w" matches any literal character.
{1,4} specifies the character to be repeated 1 to 4 times. актуальный
актуального
актуальномуtopical
up-to-date
relevant

Метод задания правил определения стоп-слов посредством суперрегулярных выражений заключается в составлении шаблона на языке суперрегулярных выражений, созданном в рамках данного изобретения. Данные шаблоны определяют множество цепочек лингвистических токенов, получаемых из проверяемого текста.The method of specifying the rules for determining stop words using superregular expressions is to compose a template in the superregular expression language created in the framework of this invention. These templates define a set of chains of linguistic tokens obtained from the text being checked.

Данный язык имеет четыре компонента компонента: лингвистические токены, язык определения шаблонов наборов граммем, язык определения шаблонов лингвистических токенов, язык определения шаблонов цепочек шаблонов токенов.This language has four component components: linguistic tokens, gramme set pattern definition language, linguistic token pattern definition language, token pattern chain pattern definition language.

1. Лингвистические токены1. Linguistic tokens

При подготовке проверяемого текста, он разделяется на слова и междусловные цепочки символов, каждому слову сопоставляется начальная форма и набор грамматических признаков (граммем). В случае неоднозначности слову может сопоставляться несколько комбинаций исходных форм и граммем.When preparing the text to be checked, it is divided into words and interword chains of characters, each word is matched with the initial form and a set of grammatical features (grammes). In case of ambiguity, several combinations of initial forms and grammes can be associated with a word.

Примеры:Examples:

Исходное словоoriginal word Начальная формаinitial form ГраммемыGrammemes другуfriend другfriend S, муж, од, дат, едS, husband, od, dat, ed огоwow огоwow INTJINTJ анализanalysis анализanalysis S, муж, неод, вин, едS, husband, neod, vin, ed анализanalysis S, муж, неод, им, едS, husband, neod, im, ed

2. Язык определения наборов граммем2. Language for defining gramme sets

Язык определения наборов граммем используется для составления шаблонов, на соответствие которым проверяются наборы граммем, полученных после разделения предложения на лексические токены. Язык позволяет находить токены с определёнными грамматическими характеристиками.The gramme set definition language is used to construct templates against which the gramme sets obtained after splitting a sentence into lexical tokens are checked. The language allows you to find tokens with certain grammatical characteristics.

Грамматика языка в формате EBNF:Language grammar in EBNF format:

pattern = and_expr | or_expr "," and_exprpattern = and_expr | or_expr "," and_expr

or_expr = not_expr | not_expr OR or_expror_expr = not_expr | not_expr OR or_expr

not_expr = "~" simple_expr | simple_exprnot_expr = "~" simple_expr | simple_expr

simple_expr = stringsimple_expr = string

string = letter | letter stringstring = letter | letter string

letter = "A" | ... | "Z" | "a" | ... "z" | "A" | ... | "я" | "-" | "_"letter = "A" | ... | "Z" | "a" | ... "z" | "A" | ... | "I" | "-" | "_"

В соответствии с грамматикой приоритет операций определяется так: «AND», «OR», «NOT» .In accordance with the grammar, the priority of operations is defined as follows: "AND", "OR", "NOT".

Примеры:Examples:

ШаблонSample ПояснениеExplanation Примеры наборов, соответствующих шаблонуExamples of sets that match the pattern SS Набор должен содержать граммему «S» (существительное)The set must contain the gramme "S" (noun) S, муж, од, дат, ед
(«другу»)S, husband, od, dat, ed
("friend") S,средS, Wednesday Набор должен содержать граммему «S» и граммему «сред»The set must contain the gramme "S" and the gramme "environment" S,сред,неод,вин,ед («слово»)S, medium, neod, vin, ed ("word") S|AS|A Набор должен содержать граммему «S» или граммему «A»The set must contain the gramme "S" or the gramme "A" S,муж,неод,вин,ед («пример»)

A,вин,ед,полн,муж,неод («примерный»)S, husband, neod, vin, ed (“example”)

A, vin, ed, full, husband, neod ("approximate") ~V~V Набор не должен содержать граммему «V»The set must not contain the gramme "V" INTJ («ого»)INTJ ("wow") S|A,род,~едS|A, genus, ~ unit Набор должен содержать граммему «S» или «A», а также граммему «вин», но не должен содержать граммему «ед»The set must contain the gramme "S" or "A", as well as the gramme "wine", but must not contain the gramme "ed" S,сред,неод,род,мн («слов»)S, medium, neod, genus, mn ("words")

3. Язык определения лингвистических токенов3. Language for defining linguistic tokens

Язык определения лингвистических токенов используется для составления шаблонов, с которым сопоставляются лингвистические токены, полученных после лингвистического анализа проверяемого текста.The linguistic token definition language is used to compile templates against which linguistic tokens are compared, obtained after linguistic analysis of the text being checked.

Каждый шаблон состоит из трёх частей:Each template consists of three parts:

1. Подшаблон, которому должен соответствовать набор граммем токена.1. A subpattern that the set of grammes of the token must match.

2. Подшаблон, которому должна соответствовать начальная форма токена.2. A subpattern that the initial form of the token must match.

3. Подшаблон, которому должна соответствовать конкретная цепочка символов из проверяемого текста.3. A subpattern that a specific string of characters from the text being checked must match.

При этом каждая часть является необязательной.However, each part is optional.

Лингвистический токен считается соответствующим шаблону на языке определения лингвистических токенов тогда и только тогда, когда выполняется соответствие всем указанным подшаблонам.A linguistic token is considered to match a pattern in the language token definition language if and only if all of the specified subpatterns are matched.

Шаблон на языке токенов составляется в соответствии со следующей грамматикой:The token language template is composed according to the following grammar:

pattern = subpatterns | "!" subpatternspattern = subpatterns | "!" subpatterns

subpatterns = grammar_pattern | grammar_pattern "%" subppatterns_2
| "%" subppatterns_2subpatterns = grammar_pattern | grammar_pattern "%" subpatterns_2
| "%" subpatterns_2

subpatterns_2 = lexeme_pattern | lexeme_pattern "%" form_pattern
| "%" form_pattensubpatterns_2 = lexeme_pattern | lexeme_pattern "%" form_pattern
| "%" form_patten

lexeme_pattern = regexlexeme_pattern = regex

word_pattern = regexword_pattern = regex

При этом grammar_pattern — это шаблон на языке определения наборов граммем, а regex — язык регулярных выражений, используемый в языке Python.At the same time, grammar_pattern is a pattern in the grammar set definition language, and regex is the regular expression language used in the Python language.

В случае неоднозначности, если у данной словоформы определяется несколько наборов граммем и/или несколько начальных форм, лингвистический токен считается соответствующим шаблону в соответствии с наличием знака «!» в начале шаблона: если знак «!» указан, то токен считается соответствующим шаблону, если ему соовтетствую все варианты грамматического разбора. Если «!» не указан, то достаточно соответствия одного из вариантов.In case of ambiguity, if a given word form has several sets of grammes and/or several initial forms, the linguistic token is considered to match the template in accordance with the presence of the sign "!" at the beginning of the template: if the sign "!" is specified, then the token is considered to match the pattern if all parsing options match it. If a "!" is not specified, then one of the options is sufficient.

Примеры:Examples:

ШаблонSample ПояснениеExplanation Примеры фрагментов текста, соответствующих шаблонуExamples of text snippets that match the pattern A|S,имA|S, im Задан подшаблон для граммем, но подшаблоны для начальной формы и конкретной формы отсутствуютA subpattern for grammes was specified, but there are no subpatterns for initial form and specific form словоword %заря|восход|исход%%dawn|sunrise|exodus% Задан подшаблон для начальной формы, но шаблоны для граммем и конкретной формы отсутствуют.A subpattern for the initial form is specified, but there are no templates for grammes and a specific form. зари
восходе
исходомdawn
sunrise
outcome %%!%%! Задан подшаблон для конкретной формы междусловной цепочки символовSpecified subpattern for a specific form of interword character string !! %%[^,]+%%[^,]+ Задан подшаблон для цепочки символов, не содержащий запятую.A subpattern was specified for a character string that does not contain a comma. примерexample

4. Язык определения цепочек токенов (суперрегулярные выражения)4. Language for defining chains of tokens (superregular expressions)

Язык определения шаблонов цепочек шаблонов токенов используется для составления шаблонов, на соответствие с которыми проверяется цепочка из одного или многих лингвистических токенов, полученный в результате анализа проверяемого текста.The token pattern chain pattern definition language is used to construct patterns against which a chain of one or more linguistic tokens, obtained as a result of parsing the text being checked, is checked against.

Язык определяется следующей грамматикойThe language is defined by the following grammar

superegex = partsuperegex = part

| part superegex | part superegex

| superegex "|" superegex | superegex "|" superegex

part = predicatepart = predicate

| repeatable | repeatable

| repeatable "+" | repeatable "+"

| repeatable "*" | repeatable "*"

| repeatable "?" | repeatable "?"

| repeatable "{" number comma number "}" | repeatable "{" number comma number "}"

| repeatable "{" comma number "}" | repeatable "{" comma number "}"

| repeatable "{" number comma "}" | repeatable "{" number comma "}"

| repeatable "{" number "+" "}" | repeatable "{" number "+" "}"

| repeatable "{" number "}" | repeatable "{" number "}"

| repeatable "{" "}" | repeatable "{" "}"

| repeatable "{" comma "}" | repeatable "{" comma "}"

| "^" | "^"

| "$" | "$"

repeatable = "(" superegex ")" | predicaterepeatable = "(" superegex ")" | predicate

number = digit | digit numbernumber=digit | digital number

digit = "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9"digit = "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "nine"

comma = "," | ";"comma="," | ";"

predicate = "<" token_pattern ">"predicate="<"token_pattern">"

token_pattern в данном контексте — это шаблон лингвистического токена, описанный выше.token_pattern in this context is the linguistic token pattern described above.

Шаблоны позволяют задавать последовательности лингвистических токенов с указанием опциональности вхождения токена, повторения токенов в определённом количестве, с указанием вариантов токенов, указанием места нахождения токена (начало или конец предложения).Templates allow you to set sequences of linguistic tokens indicating the optional occurrence of a token, repetition of tokens in a certain amount, indicating variants of tokens, indicating the location of the token (beginning or end of a sentence).

СинтаксисSyntax ЗначениеMeaning <...><...> Один токен.One token. <...> <> <...> <> <...> <...> <> <...> <> <...> Последовательность из пяти токенов.

<…> может задавать определённые характеристики слов, а <> между словами соответствует любому токену и таким образом может соответствовать знакам препинания или пробелам между словами.A sequence of five tokens.

<…> can specify certain characteristics of words, and <> between words matches any token and thus can match punctuation or spaces between words. Повторенияrepetitions <...>{n, m}<...>{n, m} Повторить токен от n до m раз.Repeat the token n to m times. <...>{n,}
<...>{n+}<...>{n,}
<...>{n+} Повторить токен от n раз.Repeat the token from n times. <...>{,n}<...>{,n} Повторить токен до n раз.Repeat the token up to n times. <...>{n}<...>{n} Повторить токен ровно n раз.Repeat the token exactly n times. <...>?
<...>{0,1}<...>?
<...>{0,1} Опциональность вхождения токенаOptionality of token entry <...>+
<...>{1, }<...>+
<...>{1, } Повторить токен 1 или больше разRepeat token 1 or more times <...>*
<...>{0, }<...>*
<...>{0, } Повторить токен 0 или больше разRepeat token 0 or more times Группировкаgrouping ( <...> <...> <...> )( <...> <...> <...> ) Шаблоны можно группировать с помощью круглых скобокPatterns can be grouped using parentheses АльтернативыAlternatives <...> <...> | <...> <...><...> <...> | <...> <...> Если символ «|» указан вне скобок между разными шаблонами, цепочка токенов считается соответствующей шаблону, если она соответствует левой или правой части.If the character "|" specified outside parentheses between different patterns, the token chain is considered to match the pattern if it matches the left or right side. <...> (<...> | <...>) <...><...> (<...> | <...>) <...> Если «|» указан внутри скобок, он задаёт альтернативы для фрагмента шаблона.If "|" specified inside parentheses, it specifies alternatives for the template fragment. Начало и конецBeginning and the end ^ ...^... Соответствует началу текста (предложения).Corresponds to the beginning of the text (sentence). … $… $ Соответствует окончанию текста (предложения).Corresponds to the end of the text (sentence).

ПримерыExamples

ШаблонSample ПояснениеExplanation <CONJ><CONJ> Соответствует цепочке из одного союза.Matches a chain of one union. ^<CONJ>^<CONJ> Соответствует цепочке из одного союза в начале предложения.Matches a string of one conjunction at the beginning of a sentence. ^<V, инф>^<V, inf> Соответствует цепочке из глагола в форме инфинитива в начале предложения. Matches a string of infinitive verbs at the beginning of a sentence. (<S><>){5,}(<S><>){5,} Соответствует цепочке из пяти или более существительных подряд, разделённых произвольными междусловными символами.Matches a string of five or more nouns in a row, separated by arbitrary interword characters. <%наличие|отсутствие%><><род><%presence|absence%><><genus> Соответствует цепочке, состоящей из слова, начальная форма которого — «наличие» или «отсутствие», за которым следует произвольная цепочка символов, за которым следует слово в родительном падежеMatches a string consisting of a word whose initial form is "presence" or "absence", followed by an arbitrary string of characters, followed by a word in the genitive <V%с?делать%><>+<твор><V%s?do%><>+<create> Соответствует глаголу, начальная форма которого — «сделать» или «делать», за которым следует цепочка из одного или большего количества произвольных токенов, за которым следует слово в творительном падеже.Matches a verb whose initial form is "make" or "do" followed by a string of one or more arbitrary tokens followed by a word in the instrumental case.

Правила распределяются по категориям:The rules are divided into categories:

Открыточный штампpostcard stamp

Личное местоимениеPersonal pronoun

Притяжательное местоимениеPossessive pronoun

УсилительAmplifier

ОбобщениеGeneralization

НеопределенностьUncertainty

Необъективная оценкаBiased assessment

КанцеляризмChancery

Современный журналистский штампModern journalistic stamp

Штамп информационного стиляInformation style stamp

Бытовой штампhousehold stamp

Газетный штампJournalese

ПлеоназмPleonasm

Паразит времениTime parasite

Корпоративный штампCorporate stamp

Рекламный штампAdvertising stamp

Скажите это по-русскиSay it in Russian

Политкорректность или эвфемизмPolitical correctness or euphemism

Составное сказуемоеCompound predicate

Возможно, плохое подлежащееPossibly a bad subject

Вводная конструкцияintroductory construction

Сложный синтаксисComplex syntax

Лишнее подчинениеSuperfluous submission

Фраза с отглагольным существительнымPhrase with verbal noun

Фраза с модальным глаголомPhrase with modal verb

Неопределенноеindefinite

ПредлогиPrepositions

Слабый глаголWeak verb

Цепочка слов в общем падежеChain of words in common case

Возможно, проблема с синтаксисомPossibly a syntax issue

Тяжеловато читаетсяhard to read

Подозрение на сложную подчиненную конструкциюSuspicion of a complex subconstruction

Подозрение на парцелляциюSuspicion of parceling

Тяжелая вводная конструкцияHeavy opening construction

Стилистические особенностиStylistic features

ФичеризмFeatureism

Второстепенный синтаксисMinor Syntax

Матерное выражениеobscene expression

ЭвфемизмEuphemism

Исходный текст разделяется на предложения, в каждом из которых осуществляется поиск стоп-слов. Поиск стоп-слов в исходном тексте осуществляется путем применения нечеткого поиска, поиска по начальным формам слов, поиска по грамматическим признакам и поиска по пунктуации.The source text is divided into sentences, each of which is searched for stop words. The search for stop words in the source text is carried out by applying fuzzy search, search by the initial forms of words, search by grammatical features and search by punctuation.

При проведении нечеткого поиска стоп-слов в исходном тексте, стоп-слова выявляются с учетом возможных вариантов (словоформ). Так, например, если правила определения стоп-слов устанавливают в качестве стоп-слова флективные местоимений «моё|его|её|... », которое соответствует притяжательным местоимениям в разных родах, лицах и падежах), в результате нечеткого поиска данные местоимения могут быть выявлены вне зависимости от их конкретной словоформы. с нечёткими окончаниями (например, «\bярки\w{1,2} +впечатлени\w{1,3}\b», которое соответствует фразам «яркие впечатления» или «ярких впечатлений»).When conducting a fuzzy search for stop words in the source text, stop words are identified taking into account possible variants (word forms). So, for example, if the rules for determining stop words set the inflectional pronouns “my|his|her|...” as a stop word, which corresponds to possessive pronouns in different genders, persons and cases), as a result of a fuzzy search, these pronouns can be identified regardless of their specific word form. with indistinct endings (for example, "\bjary\w{1,2} +impressions\w{1,3}\b", which corresponds to the phrases "bright impressions" or "bright impressions").

Проведение нечеткого поиска позволяет использовать заявляемый способ для обеспечения возможности автоматизированного подбора релевантных рекомендаций по улучшению читабельности текста на синтетических языках с флективным строем, в том числе на языках славянской языковой группы, например, на русском языке.Conducting a fuzzy search allows using the claimed method to provide the possibility of automated selection of relevant recommendations to improve the readability of text in synthetic languages with an inflectional system, including languages of the Slavic language group, for example, in Russian.

При проведении поиска по начальным формам слов стоп-слова могут быть обнаружены по их начальным формам. Например, если правила определения стоп-слов устанавливают в качестве стоп-слова газетный штамп «на заре», «на восходе» и «на исходе», такие слова могут быть выявлены путем поиска по их начальным формам по правилу «<%на%><><%заря|восход|исход%>», в результате которого в качестве стоп-слов в исходном тексте будут выявлены лексемы «на заре», «на восходе», «на исходе» вне зависимости от их конкретной словоформы, зависящей от падежа существительного.When searching for the initial forms of words, stop words can be found by their initial forms. For example, if the stop word rules set the stop word to the newspaper stamp "at dawn", "at sunrise" and "at the end", such words can be identified by searching their initial forms using the rule "<%on%> <><%dawn|sunrise|exodus%>”, as a result of which the lexemes “at dawn”, “at sunrise”, “at the end” will be identified as stop words in the source text, regardless of their specific word form, depending on noun case.

При проведении поиска по грамматическим признакам стоп-слова в исходном тексте определяются в соответствии с их грамматическим признаком. Так, например, если правила определения стоп-слов устанавливают в качестве стоп-слов последовательность из трех идущих подряд существительных, то для поиска по грамматическим признакам устанавливается, что «<сущ>» соответствует существительным, а правило «(<A|S, им><>){3,…}» задает последовательность из трех имён существительных и прилагательных в именительном падеже, за счет чего обеспечивается поиск в исходном тексте последовательности из трех имён существительных в именительном падеже.When searching by grammatical features, stop words in the source text are determined in accordance with their grammatical feature. So, for example, if the rules for determining stop words set a sequence of three consecutive nouns as stop words, then for the search by grammatical features, it is established that “<n>” matches nouns, and the rule “(<A|S, them ><>){3,…}” defines a sequence of three nouns and adjectives in the nominative case, which makes it possible to search the source text for a sequence of three nouns in the nominative case.

При проведении поиска по пунктуации речевые конструкции, ухудшающие читабельность, выявляются с помощью анализа знаков препинания. Например, чтобы выделить фрагменты с вопросительных предложений, оканчивающихся словами «что?»/«чем?»/«чём», может быть задано правило:When searching for punctuation, speech structures that impair readability are identified using the analysis of punctuation marks. For example, to highlight fragments from interrogative sentences ending in the words “what?” / “what?” / “what”, a rule can be set:

<% что %> <% \? % > <% what %> <% \? %>

Все вышеуказанные методы поиска могут комбинироваться. Например, можем задать одно правило, которое будет определять:All of the above search methods can be combined. For example, we can set one rule that will determine:

1. Слово существительное или прилагательное1. Word noun or adjective

2. Слово — не в именительном падеже2. The word is not in the nominative case

3. Исходная форма слова начинается на «ден»3. The original form of the word begins with "den"

4. Конкретная словоформа заканчивается на «м».4. A specific word form ends with "m".

Данное правило может быть представлено в форме:This rule can be presented in the form:

<сущ|прил, ~им % ден\w+ % \w+м><n|adj, ~im % den\w+ % \w+m>

Использование данного правила даст возможность выявить в тексте слово «днём», у которого начальная форма — «день».Using this rule will make it possible to identify the word "day" in the text, whose initial form is "day".

Также при использовании заявляемого способа правила можно задавать для нескольких идущих подряд слов. Например, мы можем задать правило, которое будет искать слово «днём», если до него стоит слово «но», а после него нет слова «это». В таком случае правило может быть представлено как:Also, when using the proposed method, the rules can be set for several consecutive words. For example, we can set a rule that will look for the word "day" if it is preceded by the word "but" and not followed by the word "this". In this case, the rule can be represented as:

< % но % > <> <сущ|прил, ~им % ден\w+ % \w+м> <> < % (?! это) % >< % but % > <> <n|adj, ~im % den\w+ % \w+m> <> < % (?! this) % >

В результате проведения морфологического и семантического анализа исходного текста производится выявление стоп-слов в исходном тексте.As a result of the morphological and semantic analysis of the source text, stop words are identified in the source text.

Далее осуществляется оценка читабельности текста и отображение оценки читабельности текста пользователю. Для оценки читабельности текста определяется количество слов в исходном тексте и количество выявленных стоп-слов. Оценка читабельности текста рассчитывается по соотношению количества всех слов в исходном тексте к количеству тех слов в исходном тексте, которые определены как стоп-слова. Next, the readability of the text is evaluated and the readability of the text is displayed to the user. To assess the readability of the text, the number of words in the source text and the number of identified stop words are determined. The text readability score is calculated by the ratio of the number of all words in the source text to the number of those words in the source text that are defined as stop words.

В том случае если заявляемый способ применяется для дальнейшей автоматизированной публикации текста через компьютерные системы, такие системы могут предусматривать запрет публикации текста, показатель оценки читабельности которых меньше, чем установленный допустимый показатель оценки читабельности текста, допускаемого для публикации.In the event that the claimed method is used for further automated publication of text through computer systems, such systems may provide for a ban on the publication of text whose readability score is less than the established allowable readability score of the text allowed for publication.

Далее производится отображение релевантных рекомендаций по улучшению читабельности текста. Further, relevant recommendations for improving the readability of the text are displayed.

Рекомендации по улучшению читабельности текста отображаются в отношении каждого выявленного стоп-слова и зависят от правила, в соответствии с которым слово в исходном тексте было определено как стоп-слово. Рекомендация по улучшению читаемости может содержать описание на естественном языке правила, по которому данное слово было определено как стоп-слово, указание категории правил выявленного стоп-слова, примеры редактуры стоп-слов указанной категории в формате «до-после», гиперссылки на публикацию, содержащую более подробные рекомендации по редактуре стоп-слов из указанной категории, а также иные рекомендации, которые будут полезны редактору для проработки стоп-слова с целью повышения читабельности и информативности текста. Recommendations for improving the readability of the text are displayed for each detected stop word and depend on the rule according to which the word in the source text was determined to be a stop word. A recommendation to improve readability may contain a description in natural language of the rule by which this word was determined as a stop word, an indication of the category of rules of the identified stop word, examples of editing stop words of the specified category in the "before-after" format, hyperlinks to the publication, containing more detailed recommendations on editing stop words from the specified category, as well as other recommendations that will be useful for the editor to work out the stop word in order to increase the readability and informativeness of the text.

Заявляемый способ может дополнительно включать отображение оценки читабельности текста пользователю ЭВМ на основе проведенного анализа исходного текста.The claimed method may additionally include displaying a text readability score to a computer user based on the analysis of the source text.

Оценка читабельности текста используется пользователем для того, чтобы принять решение о том, стоит ли производить дальнейшее редактирование исходного текста, или текст уже имеет достаточно высокую оценку читабельности, а значит будет квалифицирован читателем как удобочитаемый.The readability score of the text is used by the user to decide whether it is worth further editing the source text, or the text already has a sufficiently high readability score, which means it will be qualified by the reader as readable.

Заявляемый способ при оценке читабельности текста может дополнительно учитывать категории правил, в соответствии с которыми стоп-слова были выявлены в исходном тексте, при оценке читабельности исходного текста.The claimed method, when assessing the readability of the text, can additionally take into account the categories of rules, according to which the stop words were identified in the source text, when assessing the readability of the source text.

В таком случае для оценки читабельности текста определяется количество слов в исходном тексте и определяется сумма весов правил определения стоп-слов. Каждому правилу задается условный вес, выражаемый числом. В зависимости от категорий и количества обнаруженных в исходном тексте стоп-слов, а также соотношения количества слов в исходном тексте и количества тех слов в исходном тексте, которые определены как стоп-слова, рассчитывается оценка читабельности текста.In this case, to assess the readability of the text, the number of words in the source text is determined and the sum of the weights of the rules for determining stop words is determined. Each rule is assigned a conditional weight expressed as a number. Depending on the categories and the number of stop words found in the source text, as well as the ratio of the number of words in the source text and the number of those words in the source text that are defined as stop words, the text readability score is calculated.

Например, читабельность текста может быть оценена по десятибалльной шкале по формуле:For example, the readability of a text can be evaluated on a ten-point scale using the formula:

— количество баллов по шкале от 0 до 1 (при этом значение может выходить за границы диапазона).

- the number of points on a scale from 0 to 1 (in this case, the value may go beyond the limits of the range).

— итоговое количество баллов по шкале от 0 до 10

- the final number of points on a scale from 0 to 10

где переменная weights соответствует сумме весов правил определения стоп-слов, переменной words соответствует количество слов в исходном тексте, переменной penalties соответствует сумма оценочных штрафов по найденным стоп-словам, а показатель score является итоговым показателем оценки читабельности текста по десятибалльной шкале.where the weights variable corresponds to the sum of the weights of the rules for determining stop words, the words variable corresponds to the number of words in the source text, the penalties variable corresponds to the sum of the estimated penalties for the found stop words, and the score indicator is the final indicator of the text readability on a ten-point scale.

Например, анализируется следующий абзац: For example, the following paragraph is parsed:

«Кому не знакома боль в спине? Это, безусловно, одна из главных проблем современного человека. Доктора со всего мира ломают голову над вопросом, как избавить нас от этой напасти»Who doesn't know back pain? This is certainly one of the main problems of modern man. Doctors from all over the world are puzzling over the question of how to save us from this scourge.

В таком тексте при использовании заявляемого способа будут выявлены следующие стоп-слова:In such a text, when using the proposed method, the following stop words will be detected:

ФрагментFragment КатегорияCategory ВесThe weight ШтрафFine Кому не знакома боль в спине?Who doesn't know back pain? правила синтаксисаsyntax rules 50fifty 00 безусловноundoubtedly усилительamplifier 100100 00 одна из главныхone of the main неопределённоеindefinite 100100 00 современного человекmodern man рекламный штампadvertising stamp 100100 00 ломают головуracking their brains газетный штампjournalese 100100 00 напастиmisfortune газетный штампjournalese 100100 00 насus личное местоимениеpersonal pronoun 00 00 отfrom предлогиprepositions 30thirty 00 со всего мираfrom all over the world обобщениеgeneralization 100100 00 ИтогоTotal 680680 00

В 28 словах исходного текста обнаружено 9 стоп-слов (слов, словосочетаний, конструкций) из разных категорий. Таким образом:In 28 words of the source text, 9 stop words (words, phrases, constructions) from different categories were found. Thus:

С учетом весов правил оценка по десятибалльной шкале рассчитывается как 4,3, что означает, что читабельность и информативность текста является низкой.Taking into account the weights of the rules, the score on a ten-point scale is calculated as 4.3, which means that the readability and informativeness of the text is low.

Заявляемый способ может дополнительно включать визуальное выделение в исходном тексте стоп-слов, определенных в результате проведенного анализа исходного текста. Визуальное выделение производится для того, чтобы пользователь быстрее мог определить, какие именно слова исходного текста определены в качестве стоп-слов. Например, визуальное выделение стоп-слов может быть произведено путем их подчеркивания и/или выделения выявленных стоп-слов цветом, отличающемся от цвета остальных слов исходного текста, не определенных в качестве стоп-слов.The claimed method may additionally include visual selection of stop words in the source text, determined as a result of the analysis of the source text. Visual highlighting is done so that the user can quickly determine which words of the source text are defined as stop words. For example, visual highlighting of stop words can be done by underlining them and/or highlighting the identified stop words with a color that differs from the color of other words in the source text that are not defined as stop words.

Заявляемый способ может дополнительно обеспечивать возможность редактирования пользователем ЭВМ исходного текста и проведение анализа и подбора рекомендаций по улучшению читабельности отредактированного текста. The claimed method may additionally provide the possibility of editing the source text by the computer user and analysis and selection of recommendations to improve the readability of the edited text.

В таком случае, пользователь редактирует исходный текст в соответствии с рекомендациями по улучшению читабельности текста, затем производится повторный анализ текста. С целью оптимизации нагрузки на исполняющую ЭВМ повторный анализ может быть проведен не для всего текста, а только для отредактированной части. В таком случае отбираются те предложения, анализ которых до этого не проводился, в частности отбираются изменённые и добавленные предложения. Далее проводится анализ отредактированных пользователем предложений так же, как проводился анализ исходного текста.In this case, the user edits the source text in accordance with the recommendations for improving the readability of the text, then the text is reanalyzed. In order to optimize the load on the executing computer, reanalysis can be carried out not for the entire text, but only for the edited part. In this case, those proposals are selected that have not been analyzed before, in particular, modified and added proposals are selected. Further, the analysis of sentences edited by the user is carried out in the same way as the analysis of the original text was carried out.

Затем производится оценка читабельности текста с учетом отредактированных пользователем предложений и отображение оценки читабельности отредактированного текста пользователю. На основании оценки читабельности отредактированного текста пользователь может принять решение о продолжении редактирования текста или об окончании редактирования текста.Then, the readability of the text is evaluated, taking into account the sentences edited by the user, and the readability rating of the edited text is displayed to the user. Based on the readability score of the edited text, the user may decide to continue editing the text or to end editing the text.

Пример осуществления изобретенияAn exemplary embodiment of the invention

Заявляемый на регистрацию в качестве изобретения способ может быть осуществлен в автоматизированной компьютерно-реализуемой системе поддержки принятия решений для редактора (далее — СППР).The method claimed for registration as an invention can be implemented in an automated computer-implemented decision support system for the editor (hereinafter referred to as DSS).

В таком случае система будет работать следующим образом.In this case, the system will work as follows.

Редактору на мониторе ЭВМ демонстрируется пользовательский интерфейс — веб-интерфейс системы или ее интерфейс в сторонних приложениях или плагинах, с которыми интегрирована система. Пользовательский интерфейс содержит два поля. Первое поле предназначено для введения и отображения исходного текста. Второе поле предназначено для отображения рекомендаций СППР, соответствующих стоп-слову, выбранному редактором в исходном тексте.The editor on the computer monitor is shown the user interface - the web interface of the system or its interface in third-party applications or plug-ins with which the system is integrated. The user interface contains two fields. The first field is for entering and displaying the source text. The second field is designed to display DSS recommendations corresponding to the stop word selected by the editor in the source text.

На первом этапе способа редактор вводит исходный текст в соответствующее поле СППР. Также текст может быть получен СППР через АПИ напрямую от других компьютерных систем. В результате выполнения этапа СППР получает исходный текст для дальнейшего анализа.At the first stage of the method, the editor enters the source text into the appropriate field of the DSS. Also, the text can be received by the DSS through the API directly from other computer systems. As a result of the stage execution, the DSS receives the source text for further analysis.

На втором этапе реализации способа внутренний контроллер СППР проводит морфологический и семантический анализ текста, взаимодействуя с базой правил и кэшем скомпилированных правил, содержащимся в системе или внешнем хранилище данных. Для анализа текста используется модуль лингвистического анализатора, модуль механизма поиска по регулярным выражениям, модуль механизма поиска по лингвистическим регулярным выражениям, а также модуль оптимизированных функций поиска. При этом СППР могут быть подключены и иные модули, направленные на решение задач пользователя.At the second stage of the implementation of the method, the internal DSS controller performs morphological and semantic analysis of the text, interacting with the rule base and the cache of compiled rules contained in the system or external data storage. For text analysis, a linguistic analyzer module, a regular expression search engine module, a linguistic regular expression search engine module, and a module of optimized search functions are used. At the same time, DSS can be connected to other modules aimed at solving user problems.

В случае необходимости, СППР выделяет подчеркиванием и цветом стоп-слова в исходном тексте. Способ визуального выделения текста может зависеть от категории правил, по которой стоп-слово было определено.If necessary, the DSS highlights stop words in the source text with underlining and color. The way text is visually highlighted may depend on the category of rules by which the stop word was defined.

На четвертом этапе реализации способа СППР выполняет функцию оценивания и рассчитывает оценку читабельности текста по десятибалльной шкале. Оценка может быть отражена редактору.At the fourth stage of the implementation of the method, the DSS performs the function of evaluation and calculates the readability of the text on a ten-point scale. The score may be reflected to the editor.

На четвертом этапе реализации способа СППР отображает редактору рекомендации по улучшению читабельности текста. Демонстрация рекомендаций по проработке стоп-слов может быть реализована следующим образом: редактор наводит курсор мыши или каретку на выделенное СППР стоп-слово, после чего СППР отображает редактору категорию данного стоп-слова, совет по его устранению, а также может отобразить примеры устранения аналогичных стоп-слов в формате «было-стало», и/или гиперссылку на публикацию, содержащую более подробные рекомендации по проработке стоп-слов из указанной категории. Такая публикация с более подробными рекомендациями может представлять собой статью, аудио или видеофайл, интерактивный обучающий тренажер.At the fourth stage of the implementation of the DSS method, the editor displays recommendations for improving the readability of the text. Demonstration of recommendations for the development of stop words can be implemented as follows: the editor points the mouse or caret cursor over the stop word selected by the DSS, after which the DSS displays to the editor the category of this stop word, advice on how to eliminate it, and can also display examples of eliminating similar stops -words in the “before-before” format, and/or a hyperlink to a publication containing more detailed recommendations on working out stop words from the specified category. Such a publication with more detailed recommendations can be an article, an audio or video file, an interactive training simulator.

Далее редактор производит редактирование текста, причем СППР может произвести кеширование текста и произвести анализ только отредактированных или новых предложений, не проводя повторный анализ неизмененных пользователем предложений. В результате СППР перерасчитывает оценку читабельности текста по десятибалльной шкале и отображает оценку читабельности отредактированного текста.Next, the editor edits the text, and the DSS can cache the text and analyze only the edited or new sentences without re-analyzing sentences that have not been changed by the user. As a result, the DSS recalculates the readability score of the text on a ten-point scale and displays the readability score of the edited text.

Claims

1. A method for automated text analysis and selection of relevant recommendations to improve its readability, performed on a computer, including:

getting the original text;

identifying stop words in the source text in accordance with the specified rules for determining stop words by conducting morphological, lexical, semantic and syntactic analysis of the source text based on the use of fuzzy search, search by initial forms of words, search by grammatical features and search by punctuation;

assessment of the readability of the source text based on the calculation of the ratio of the number of all words in the source text and the number of stop words identified in the source text;

displaying relevant recommendations for improving the readability of the text, including at least an indication of the stop word definition rule, according to which words in the source text were determined to be stop words.

2. The method according to claim 1, additionally including displaying a text readability score to a computer user based on the analysis of the source text.

3. The method according to claim 1, in which the assessment of the readability of the source text based on the calculation of the ratio of the number of all words in the source text and the number of stop words detected in the source text is carried out taking into account the categories of rules, according to which the stop words were detected in the source text. text.

4. The method according to claim 1, additionally including visual selection of stop words in the source text, determined as a result of the analysis of the source text.

5. The method according to claim 1, additionally providing the possibility of editing the source text by the computer user and carrying out subsequent automated analysis of the source text sentences edited by the computer user, assessing the readability of the edited text and selecting relevant recommendations to improve the readability of the edited text.