EA038241B1

EA038241B1 - Method and system for searching for relevant news

Info

Publication number: EA038241B1
Application number: EA201990538A
Authority: EA
Inventors: Федор Борисович ФЕДОРОВ; Александра Евгеньевна ЛИПАЧЕВА; Владимир Алексеевич КУЗНЕЦОВ; Роман Владиславович ЧЕРКАСОВ
Original assignee: Публичное Акционерное Общество "Сбербанк России" (Пао Сбербанк)
Priority date: 2019-03-14
Filing date: 2019-03-19
Publication date: 2021-07-29
Also published as: RU2698916C1; WO2020185110A1; EA201990538A1

Abstract

The invention relates to the field of information technology, and more particularly to search engines designed to identify relevant information from different types of data sources. The technical result is that of enabling the formation of a related set of information from news sources with grouping according to companies that are the subject of news or given types of events. In a first preferred embodiment of the claimed invention, a computer-implemented method for searching for relevant news is proposed in which a set of news is received on a control server from at least one news aggregator server; an analysis of the received set of news is performed on the control server, which includes lemmatizing the texts of each piece of news from said set of news, processing the resulting lemmas of the news texts with the aid of a machine learning model containing a predetermined set of company data and a list of events, wherein for each event in the machine learning model a given set of lemmas is predetermined, identifying news containing lemmas that characterize given events, and forming links between the identified events and at least one company; whereupon, a list of relevant news is generated on the basis of the analysis performed.

Description

Область техникиTechnology area

Настоящее техническое решение в общем относится к области информационных технологий, а в частности к поисковым механизмам, предназначенным для выявления релевантной информации из разнородных источников данных.This technical solution generally relates to the field of information technology, and in particular to search engines designed to identify relevant information from heterogeneous data sources.

Уровень техникиState of the art

В настоящее время сбор данных (англ. Data Mining) является важной составляющей для различных сфер бизнеса, в особенности в сферах аналитики и прогнозирования. Зачастую источником данных об информации по интересующим темам являются общедоступные ресурсы в сети Интернет, например новостные ресурсы (веб-сайты, каналы в мессенджерах и т.п.).Currently, data mining is an important component for various business areas, especially in the areas of analytics and forecasting. Often the source of information about information on topics of interest are public resources on the Internet, such as news resources (websites, channels in instant messengers, etc.).

При анализе данных основной проблемой является агрегирование массива новостных источников, в частности привязка действительных событий к компаниям для целей последующего поиска. Как правило, на сегодняшний день нет эффективных средств для фильтрации собираемого новостного контента для создания агрегированных массивов информации с привязкой по объектам новостей, например компаниям. Из существующего уровня техники известны различные алгоритмы для сбора данных, например решение, описанное в заявке WO 1999005614 (автор: Louis Gay et al., опубликовано 04.02.1999), которое позволяет агрегировать данные из множества источников и отслеживать ретроспективную актуальность собираемых данных. Из патента RU 2382401 (патентообладатель Майкрософт корпорейшн, опубликовано 20.02.2010) известен подход для анализа и сравнения совокупностей документов, в соответствии с чем документы могут быть предположительно организованы в группы по своему содержимому или источнику и проанализированы на предмет межгрупповых и внутригрупповых различий и общностей. Например, сопоставление двух групп документов, посвященных одной теме, но полученных из двух различных источников, к примеру, информационного обзора происшествия в различных частях мира, может показать интересные различия мнений и общих истолкований ситуаций. За счет перемещения содержимого из статичных совокупностей в наборы статей, генерируемых во времени, может быть рассмотрено его развитие. Например, поток новостных статей по общему описанию может быть рассмотрен во времени с целью выделения действительно информативных свежих новостей и фильтрования множества статей, которые в значительной степени передают практически то же самое. Общим недостатком существующих подходов является отсутствие способа выявления релевантных новостей относительно привязки к объекту новости, например компании и соответствующему событию, связанному с ней, что не позволяет эффективно осуществить сбор релевантной информации из множества источников данных.When analyzing data, the main problem is the aggregation of an array of news sources, in particular, linking actual events to companies for subsequent search. As a rule, today there are no effective tools for filtering the collected news content to create aggregated arrays of information with reference to news objects, for example, companies. Various algorithms for collecting data are known in the prior art, for example the solution described in WO 1999005614 (by Louis Gay et al., Published 02/04/1999), which allows data to be aggregated from multiple sources and to track the historical relevance of the collected data. From the patent RU 2382401 (patent holder Microsoft Corporation, published 02/20/2010) an approach is known for analyzing and comparing collections of documents, according to which documents can be presumably organized into groups according to their content or source and analyzed for intergroup and intragroup differences and commonalities. For example, comparing two sets of documents on the same topic but obtained from two different sources, such as an informational overview of an incident in different parts of the world, can reveal interesting differences of opinion and general interpretations of situations. By moving content from static collections to sets of articles generated over time, its evolution can be considered. For example, a stream of news articles by general description can be viewed over time to highlight really informative news stories and filter out the many articles that pretty much convey much the same thing. A common disadvantage of the existing approaches is the lack of a way to identify relevant news in relation to binding to a news object, for example, a company and the corresponding event associated with it, which does not allow efficient collection of relevant information from a variety of data sources.

Раскрытие изобретенияDisclosure of invention

Решаемой технической проблемой или технической задачей с помощью заявленного подхода является обеспечение процесса поиска и формирования набора новостей с привязкой к заданному набору наименования компаний как объектов новостей и событий, о которых появляется информация в открытых источниках данных. Техническим результатом, достигаемым при решении вышеуказанной технической задачи, является обеспечение формирования связанного набора информации из новостных источников с группировкой по компаниям, являющимся объектами новостей, и заданными типами событий.The technical problem or technical problem to be solved using the stated approach is to ensure the process of searching and forming a set of news with reference to a given set of company names as objects of news and events about which information appears in open data sources. The technical result achieved when solving the above technical problem is to ensure the formation of a related set of information from news sources with a grouping by companies that are news objects and specified types of events.

Дополнительным техническим результатом является повышение точности выявления информации о компаниях для заданного типа событий в общедоступных источниках информации.An additional technical result is an increase in the accuracy of identifying information about companies for a given type of event in publicly available information sources.

Указанный технический результат достигается благодаря осуществлению компьютернореализуемого способа поиска релевантных новостей, в котором получают на управляющем сервере набор новостей по меньшей мере от одного сервера новостного агрегатора;The specified technical result is achieved due to the implementation of a computer-implemented method for searching for relevant news, in which a set of news from at least one server of the news aggregator is received on the control server;

осуществляют на управляющем сервере анализ полученного набора новостей, который включает в себя лемматизацию текстов каждой новости из упомянутого набора новостей;the analysis of the received set of news is carried out on the control server, which includes lemmatization of the texts of each news from the said set of news;

обработку полученных лемм текстов новостей с помощью модели машинного обучения, которая содержит установленный набор данных компаний и список событий, причем для каждого события в модели машинного обучения установлен заданный набор лемм;processing the obtained lemmas of news texts using a machine learning model, which contains an established set of company data and a list of events, and for each event in the machine learning model a given set of lemmas is set;

определение новостей, содержащих леммы, идентифицирующие заданные события и формирование связи выявленных событий по меньшей мере с одной компанией;defining news containing lemmas identifying the specified events and forming the connection of the identified events with at least one company;

формируют список релевантных новостей на основании выполненного анализа.form a list of relevant news based on the analysis performed.

В одном из частных примеров осуществления способа при получении набора новостей осуществляется фильтрация дублирующих новостей.In one particular embodiment of the method, when a set of news is received, duplicate news is filtered.

В другом частном примере осуществления способа фильтрация осуществляется с помощью вычисления меры Жаккарда между сигнатурами новостей.In another particular embodiment of the method, filtering is performed by calculating the Jacquard measure between news signatures.

В другом частном примере осуществления способа события, присвоенные новости о компании, сохраняют в базе данных.In another particular embodiment of the method, the events assigned to the company news are stored in a database.

В другом частном примере осуществления способа новостной агрегатор обновляет список новостей с помощью информационных каналов.In another particular embodiment of the method, the news aggregator updates the news list using information channels.

В другом частном примере осуществления способа информационные каналы представляют собой веб-сайты в сети Интернет и/или мессенджер-каналы.In another particular embodiment of the method, the information channels are Internet websites and / or messenger channels.

- 1 038241- 1 038241

В другом частном примере осуществления способа в ходе анализа новостей выполняется определение принадлежности события основной или дочерней компании.In another particular example of implementation of the method, during the analysis of news, it is determined whether the event belongs to the main or subsidiary company.

В другом частном примере осуществления способа принадлежность компании, упоминаемой в новости, определяется с помощью алгоритма решающих деревьев.In another particular embodiment of the method, the ownership of the company mentioned in the news is determined using a decision tree algorithm.

В другом частном примере осуществления способа в ходе лемматизации текстов новостей осуществляется их очистка от знаков пунктуации, стоп-слов и именованных сущностей.In another particular embodiment of the method, during the lemmatization of news texts, they are cleared of punctuation marks, stop words and named entities.

В другом частном примере осуществления способа в ходе лемматизации для каждой леммы текста новости рассчитывается статистическая мера.In another particular embodiment of the method, a statistical measure is calculated for each lemma of the news text during lemmatization.

В другом частном примере осуществления способа алгоритм машинного обучения представляет собой логическую регрессию, классифицирующий принадлежность новости событию на основании анализа статистической меры лемм.In another particular example of the implementation of the method, the machine learning algorithm is a logical regression that classifies the belonging of the news to the event based on the analysis of the statistical measure of the lemmas.

В другом частном примере осуществления способа для каждого текста новости выполняется определение частотных словосочетаний длиной от 2 до 10 лемм.In another particular embodiment of the method, for each news text, the definition of frequency phrases with a length of 2 to 10 lemmas is performed.

В другом частном примере осуществления способа алгоритм машинного обучения представляет собой градиентный бустинг, обученный для классификации события на основе количества предложений, содержащих леммы, идентифицирующие событие из поискового запроса.In another particular embodiment of the method, the machine learning algorithm is a gradient boosting trained to classify an event based on the number of sentences containing lemmas that identify an event from a search query.

В другом частном примере осуществления способа после присвоения события из новости компании выполняется выделение лемм и/или предложений, содержащих леммы, идентифицирующее упомянутое событие.In another particular embodiment of the method, after assigning an event from a company news, the extraction of lemmas and / or sentences containing lemmas identifying the mentioned event is performed.

В другом предпочтительном варианте осуществления заявленного решения представлена система поиска релевантных новостей, содержащая по меньшей мере один процессор и по меньшей мере одну память, которая содержит машиночитаемые инструкции, которые при их исполнении по меньшей мере одним процессором выполняют вышеуказанный способ.In another preferred embodiment of the claimed solution, a relevant news search system is provided comprising at least one processor and at least one memory that contains machine-readable instructions that, when executed by at least one processor, perform the above method.

Краткое описание чертежейBrief Description of Drawings

Признаки и преимущества настоящего технического решения станут очевидными из приводимого ниже подробного описания и прилагаемых чертежей.The features and advantages of the present technical solution will become apparent from the following detailed description and accompanying drawings.

Фиг. 1 иллюстрирует взаимодействие элементов, входящих в заявленное решение.FIG. 1 illustrates the interaction of the elements included in the claimed solution.

Фиг. 2 иллюстрирует общий процесс выполнения способа.FIG. 2 illustrates the general flow of the method.

Фиг. 3 иллюстрирует процесс обработки текстовых данных.FIG. 3 illustrates the processing of text data.

Фиг. 4 - представлен пример графического интерфейса пользователя при взаимодействии с сервисом по подбору релевантных новостей.FIG. 4 - an example of a graphical user interface is presented when interacting with a service for the selection of relevant news.

Фиг. 5 иллюстрирует общий вид вычислительного устройства.FIG. 5 illustrates a general view of a computing device.

Осуществление изобретенияImplementation of the invention

На фиг. 1 представлена общая вычислительная архитектура (100) представленного решения. Основной функционал по сбору и обработке информации выполняется на управляющем сервере (110), который посредством канала передачи данных получает информацию сервера (120) новостного агрегатора, который связан посредством сети Интернет (150) со множеством новостных ресурсов (130). Сервер (110) обеспечивает взаимодействие с пользователями (10) для отображения данных по собранной новостной информации, а также дополнительный функционал, который будет раскрыт далее в материалах заявки.FIG. 1 shows the general computing architecture (100) of the presented solution. The main functionality for collecting and processing information is performed on the control server (110), which, through the data transmission channel, receives information from the server (120) of the news aggregator, which is connected via the Internet (150) with a plurality of news resources (130). The server (110) provides interaction with users (10) to display data on the collected news information, as well as additional functionality, which will be disclosed later in the application materials.

В качестве канала передачи данных между управляющем сервером (110) и сервером новостного агрегатора (120) может выступать Интернет или Интранет. При этом сервер новостного агрегатора (120) может представлять собой несколько устройств, входящих в состав различного сетевого окружения, например совокупность серверов, маршрутизаторов, кластеров и т.п. Канал передачи данных может быть организован с помощью различного вида известных протоколов передачи данных, как проводных, так и беспроводных, например TCP/IP, 802.11, Ethernet, FTP и др., обеспечивая формирование различного сетевого взаимодействия, в частности LAN, WAN, PAN, WLAN и т.п. Управляющий сервер (110) выполняет основную обработку информации, получаемой от сервера новостного агрегатора (120), хранит и формирует данные для отображения пользователям (10). Отображение информации может формироваться с помощью специализированного графического интерфейса пользователя. Пользователи (10) могут взаимодействовать с управляющим сервером (110) с помощью веб-портала или иного типа программного приложения, обеспечивающего доступ к агрегированной новостной информации. Доступ может предоставляться, например, посредством API. Взаимодействие пользователей (10) может осуществляться с помощью различных электронных устройств, в качестве которых могут выступать, например, компьютер, ноутбук, смартфон, планшет, игровая приставка, умное носимое электронное устройство, тонкий клиент, а также устройства дополненной, смешанной или виртуальной реальности и др.The Internet or Intranet can be used as a data transmission channel between the control server (110) and the server of the news aggregator (120). In this case, the server of the news aggregator (120) can represent several devices that are part of a different network environment, for example, a set of servers, routers, clusters, etc. A data transmission channel can be organized using various types of known data transmission protocols, both wired and wireless, for example TCP / IP, 802.11, Ethernet, FTP, etc., providing the formation of various network interactions, in particular LAN, WAN, PAN, WLAN, etc. The control server (110) performs the main processing of information received from the news aggregator server (120), stores and generates data for display to users (10). Information display can be formed using a specialized graphical user interface. Users (10) can interact with the management server (110) using a web portal or other type of software application that provides access to aggregated news information. Access can be provided, for example, through an API. The interaction of users (10) can be carried out using various electronic devices, which can be, for example, a computer, laptop, smartphone, tablet, game console, smart wearable electronic device, thin client, as well as devices of augmented, mixed or virtual reality, and dr.

Сервер новостного агрегатора (120) связан посредством сети Интернет (150) с различными информационными ресурсами (130) или информационными каналами, предоставляющими новостную информацию. Такими ресурсами (130) могут выступать, например, веб-сайты, каналы мессенджеров (Telegram™, WatsApp™, Viber™ и др.), социальные сети (Facebook™, Вконтакте™ и т.п.). Сохранение полученной информации на сервере (110) может осуществляться в формате JSON в хранилище данных, например базе данных. При этом может учитываться источник получения новостной информации и дата ее размещения на соответствующем ресурсе (130).The server of the news aggregator (120) is connected via the Internet (150) with various information resources (130) or information channels providing news information. Such resources (130) can be, for example, websites, messenger channels (Telegram ™, WatsApp ™, Viber ™, etc.), social networks (Facebook ™, Vkontakte ™, etc.). The storage of the received information on the server (110) can be carried out in JSON format in a data store, for example, a database. In this case, the source of obtaining news information and the date of its posting on the corresponding resource can be taken into account (130).

- 2 038241- 2 038241

На фиг. 2 представлен общий процесс выполнения заявленного способа поиска релевантной новостной информации (200). Информация из новостных источников, собранная и хранимая на сервере новостного агрегатора (120), передается (201) на управляющий сервер (110). Информация от сервера новостного агрегатора (120) может передаваться в режиме онлайн или офлайн. В онлайн режиме данные из сети Интернет (150) передаются по факту их появления на веб-ресурсе, к которому имеется подключение у сервера новостного агрегатора (120). В режиме офлайн новости сохраняются на сервере новостного агрегатора (120), например в базе данных, и в установленное время (например, каждый час, раз в день и т.п.) или по запросу от управляющего сервера (110) передаются на него.FIG. 2 shows the general process of performing the claimed method of searching for relevant news information (200). Information from news sources, collected and stored on the server of the news aggregator (120), is transmitted (201) to the control server (110). Information from the server of the news aggregator (120) can be transmitted online or offline. In online mode, data from the Internet (150) is transmitted as soon as it appears on a web resource to which the news aggregator server has a connection (120). In offline mode, news is stored on the news aggregator server (120), for example, in a database, and at a set time (for example, every hour, once a day, etc.) or upon request from the control server (110) are transmitted to it.

Данные от сервера новостного агрегатора (120) могут передаваться в различных форматах, например xml, html, txt и т.п. Формат данных для передачи также может изменяться в зависимости от режима передачи информации на управляющий сервер (110). Помимо самого текста новости, данные содержат информацию о компаниях, упомянутых в тексте.Data from the news aggregator server (120) can be transmitted in various formats, for example, xml, html, txt, etc. The data format for transmission can also change depending on the mode of information transmission to the control server (110). In addition to the text of the news itself, the data contains information about the companies mentioned in the text.

На управляющем сервере (110) находится сформированный список, содержащий наименования компаний (2021) и событий (2022), на предмет которых осуществляется анализ входящей новостной информации от сервера новостного агрегатора (120). Указанные данные хранятся в базе данных управляющего сервера (110). В качестве событий могут выступать, например, арест/заморозка счетов компании, банкротство компании, наличие исков к компании, обвал/рост акций и т.п. Список событий (2022) и компаний (2021) может обновляться или изменяться в течение времени. Поиск релевантной информации по данным, полученным от сервера новостного агрегатора (120), осуществляется с помощью обработки (202) полученного массива данных с помощью модели машинного обучения, которая обучена осуществлять поиск по наименованиям компаний (2021) и соответствующих событий (2022) в массиве текстовой информации и выдавать суждение о релевантности соответствующей информации. Обработка данных на сервере (110) выполняется по факту получения нового массива данных от сервера новостного агрегатора (120) либо по заранее установленному сценарию. В качестве сценария может настраиваться автоматический скрипт, который в установленное время осуществляет активацию модели машинного обучения для обработки данных (202).On the control server (110) there is a generated list containing the names of companies (2021) and events (2022), for which the analysis of incoming news information from the server of the news aggregator (120) is carried out. The specified data is stored in the database of the control server (110). Events can be, for example, the arrest / freezing of company accounts, bankruptcy of the company, the presence of claims against the company, the collapse / growth of shares, etc. The list of events (2022) and companies (2021) can be updated or changed over time. The search for relevant information according to the data received from the server of the news aggregator (120) is carried out by processing (202) the obtained data array using a machine learning model, which is trained to search by company names (2021) and corresponding events (2022) in the text array. information and make judgments about the relevance of the relevant information. Data processing on the server (110) is performed upon receipt of a new data array from the news aggregator server (120) or according to a predetermined scenario. As a script, an automatic script can be configured, which at a set time activates a machine learning model for data processing (202).

При выполнении этапа обработки (202) выполняется обращение к хранилищу информации управляющего сервера (110), которое содержит полученные от сервера новостного агрегатора (120) данные из новостных источников (130). При доступе к сохраненной на управляющем сервере (110) информации осуществляется ее обработка (202) для выявления релевантных данных и привязки данных (203) из новостей к соответствующим типам событий в ходе обработки информации с помощью модели машинного обучения.When the processing step (202) is performed, the information store of the control server (110) is accessed, which contains data received from the news aggregator server (120) from news sources (130). When accessing the information stored on the control server (110), it is processed (202) to identify relevant data and bind data (203) from news to the corresponding types of events during information processing using a machine learning model.

На фиг. 3 представлен процесс (300) осуществления обработки новостных данных, полученных от сервера новостного агрегатора (120), которая осуществляется в процессе выполнения этапов (202)-(203). На первом шаге (301) новостные текстовые данные, полученные от сервера новостного агрегатора (120), проходят лемматизацию, в ходе которой выполняется разделение на леммы корпуса текста каждой новости. Из полученных данных извлекается текст новости и метаданные из файлов. В ходе выполнения процесса лемматизации текстов (301) тело новости разделяется на слова по всем пунктуационным разделителям, после чего приводится к нормальной форме, например с помощью библиотеки pymorphy2. Затем осуществляется преобразование текста, в частности выполняется очистка текста от знаков пунктуации, стоп-слов (предлоги, союзы, местоимения) и именных сущностей. Именной сущностью в данном случае считается любое слово, начинающееся с большой буквы и не являющееся при этом первым словом в предложении. Также может выполняться процесс N-грамминга (https://ru.wikipedia.org/wiki/N-грамма), при котором в тексте выделяются наиболее частотные словосочетания длины от 2 до 10 лемм. Список наиболее частотных словосочетаний получен путем автоматического анализа большого корпуса текста и содержит более 9 млн объектов.FIG. 3 shows a process (300) for processing news data received from a news aggregator server (120), which is performed during steps (202) - (203). At the first step (301), the news text data received from the news aggregator server (120) is lemmatized, during which the text corpus of each news is divided into lemmas. From the received data, news text and metadata from files are extracted. During the process of lemmatization of texts (301), the body of the news is divided into words using all punctuation separators, and then reduced to normal form, for example, using the pymorphy2 library. Then the text is converted, in particular, the text is cleared of punctuation marks, stop words (prepositions, conjunctions, pronouns) and nominal entities. In this case, a nominal entity is considered to be any word that begins with a capital letter and is not the first word in a sentence. The process of N-gramming (https://ru.wikipedia.org/wiki/N-gram) can also be performed, in which the most frequent phrases of length from 2 to 10 lemmas are highlighted in the text. The list of the most frequent phrases was obtained by automatic analysis of a large body of text and contains more than 9 million objects.

Также входящие новости проходят процедуру дедупликации, в ходе которой отфильтровываются повторяющиеся новости. В ходе выполнения процедуры дедупликации для каждой новости считается сигнатура MinHash (см. https://en.wikipedia.org/wiki/MinHash), после чего для каждой пары новостей вычисляется схожесть сигнатур по мере Жаккара (иногда коэффициент Жаккара). Если схожесть пары новостей превышает заданный порог, например, 0.7, то более короткая новость из пары корпусов текстов считается дублирующей и не подвергается дальнейшей обработке. На следующем шаге (302) после лемматизации текстов новостей выполняется обработка нормализованного текста. В тексте новости осуществляется поиск наименования компаний, не имеющих омонимов (например, Сбербанк™). Находятся все словосочетания с большой буквы и в кавычках, после чего проводится поиск лемм найденных словосочетаний в списке компаний, хранимого в базе данных сервера (110). Найденные наименования компаний классифицируются по признаку основная или дополнительная компания (т.е. которая является косвенно упоминаемой в тексте новости). Компания считается основной, если она является предметом новости, и дополнительной, если наименование компании просто упоминается в теленовости. Классификация осуществляется с помощью модели машинного обучения, в частности алгоритма принятия решений, например с помощью решающих деревьев. Список признаков решающего дерева выглядит следующим образом.Also, incoming news goes through a deduplication procedure, during which duplicate news is filtered out. During the deduplication procedure, the signature MinHash is calculated for each news (see https://en.wikipedia.org/wiki/MinHash), after which the signature similarity is calculated for each pair of news in the Jaccard measure (sometimes the Jaccard coefficient). If the similarity of a pair of news exceeds the specified threshold, for example, 0.7, then the shorter news from the pair of text corpuses is considered duplicate and is not subjected to further processing. In the next step (302), after lemmatization of the news texts, the normalized text is processed. The text of the news is searched for the names of companies that do not have homonyms (for example, Sberbank ™). All phrases with a capital letter and in quotes are found, after which the lemmas of the found phrases are searched in the list of companies stored in the server database (110). Found company names are classified according to the main or additional company (that is, which is indirectly mentioned in the text of the news). A company is considered primary if it is the subject of the news, and secondary if the name of the company is simply mentioned in the TV news. The classification is carried out using a machine learning model, in particular a decision-making algorithm, for example, using decision trees. The decision tree features are as follows.

- 3 038241- 3 038241

1) Номер предложения первого упоминания компании (0, если это заголовок).1) Number of the sentence of the first mention of the company (0, if it is a title).

2) Номер предложения первого упоминания компании, нормированный на число предложений.2) Number of the sentence of the first mention of the company, normalized to the number of sentences.

3) Длина текста в символах; значение порога классификации - 0.5.3) The length of the text in characters; the classification threshold is 0.5.

Для определения релевантности того или иного события для компаний, указываемых в теленовостях, осуществляется обработка полученных лемм из тела новости на шаге (303) с помощью моделей машинного обучения.To determine the relevance of an event for the companies indicated in the TV news, the obtained lemmas from the news body are processed at step (303) using machine learning models.

В качестве одного примера модели машинного обучения может применяться логическая регрессия с помощью расчета статистической меры TF-IDF для лемм текста (см. https://ru.wikipedia.org/wiki/TFIDF). Для каждой леммы в тексте считается статистическая мера, после чего на полученных признаках делается суждение заранее обученной логистической регрессии. Помимо обработки с помощью модели машинного обучения, составляется список заданных лемм, например, список может содержать 3040 лемм, имеющих наибольший вес в логистической регрессии. Список строится для каждого события после процесса обучения логистической регрессии. На выходе модели определяется вес каждой леммы, по которым осуществляется отбор лемм для списка на основании значений их весов.As one example of a machine learning model, logical regression can be applied by calculating the TF-IDF statistical measure for text lemmas (see https://ru.wikipedia.org/wiki/TFIDF). For each lemma, a statistical measure is calculated in the text, after which a pre-trained logistic regression judgment is made on the obtained features. In addition to processing with a machine learning model, a list of given lemmas is compiled, for example, the list may contain 3040 lemmas that have the highest weight in logistic regression. The list is built for each event after the logistic regression learning process. At the output of the model, the weight of each lemma is determined, according to which the selection of lemmas for the list is carried out based on the values of their weights.

Если полученное значение вероятности суждения модели (303) выше заранее установленного порога и в тексте новости встречается хотя бы одна лемма из упомянутого списка, то по меньшей мере одно событие присваивается новости (304) для выявленного в тексте наименования компании.If the obtained value of the probability of judgment of the model (303) is higher than a predetermined threshold and at least one lemma from the mentioned list occurs in the news text, then at least one event is assigned to news (304) for the company name identified in the text.

Дополнительно для каждого события может задаваться набор лемм, например 10-15 лемм, наиболее соответствующих событию, которые выделяются из ранее определенного списка лемм, и если событие было присвоено новости на этапе (304), то все найденные в тексте леммы из упомянутого набора выделяются в тексте. Вторым примером применения модели машинного обучения является классифицирующий алгоритм в виде градиентного бустинга, например LightGBM (https://lightgbm.readthedocs.io). Для каждого текста новости считается количество предложений, содержащих пары характерных для события лемм. Пары характерных лемм подбираются для каждого события в ходе обучения классификатора. Характерные леммы (и их количество) подбираются автоматически в ходе обучения.Additionally, for each event, a set of lemmas can be specified, for example, 10-15 lemmas that most correspond to the event, which are selected from the previously defined list of lemmas, and if the event was assigned to news at step (304), then all the lemmas found in the text from the mentioned set are selected in text. The second example of the application of the machine learning model is a classifying algorithm in the form of gradient boosting, for example LightGBM (https://lightgbm.readthedocs.io). For each news text, the number of sentences containing pairs of lemmas characteristic of the event is counted. Pairs of characteristic lemmas are selected for each event during the training of the classifier. Typical lemmas (and their number) are selected automatically during training.

На полученных таким образом признаках (парах лемм) делается суждение с помощью упомянутой модели машинного обучения (303). Если полученное значение вероятности выше заранее подобранного порога, то событие присваивается новости (304) для одной или нескольких компаний, указанных в новости. Дополнительно в каждом тексте могут выделяться предложения, содержащие пары характерных для события лемм. Если в ходе обработки новостных данных не осуществляется выявление релевантных событий для указанных наименований компаний, то такая информация не учитывается (305).On the features obtained in this way (pairs of lemmas), a judgment is made using the aforementioned machine learning model (303). If the obtained probability value is higher than a predetermined threshold, then the event is assigned to news (304) for one or more companies specified in the news. Additionally, sentences containing pairs of lemmas characteristic of the event can be highlighted in each text. If, in the course of processing news data, the identification of relevant events for the specified names of companies is not carried out, then such information is not taken into account (305).

На фиг. 4 представлен пример графического интерфейса пользователя (400) для взаимодействия с сервисом по подбору релевантной новостной информации. Интерфейс (400) предоставляет функционал по отображению и управлению содержанием предоставляемых данных. Формирование поискового запроса выполняется с помощью панели ввода информации о наименовании компании (401). В основном поле (404) для отображения текущей или найденной информации представлен перечень компаний, для которых осуществляется обработка выявления релевантной информации из базы данных сервера (110).FIG. 4 shows an example of a graphical user interface (400) for interacting with a service for the selection of relevant news information. The interface (400) provides functionality for displaying and managing the content of the provided data. The formation of a search query is performed using the information input panel about the name of the company (401). In the main field (404) for displaying current or found information, a list of companies is presented for which processing of identifying relevant information from the server database (110) is carried out.

Компании в поле (404) могут отображаться в различном иерархическом порядке, например в алфавитном, по количеству новостей и т.п. Информация может отфильтровываться по временному диапазону, который устанавливается в поле ввода дат (402).Companies in the field (404) can be displayed in a different hierarchical order, for example, alphabetically, by the number of news, etc. The information can be filtered by the time range, which is set in the date input field (402).

Также, интерфейс (400) содержит панель управления для настройки параметров поисковых запросов (403). С помощью панели управления (403) можно осуществлять настройку выявления тех или иных типов событий, осуществлять привязку компаний, конфигурировать параметры сервиса и т.п. В поле (405) отображается список выявленных новостных источников в соответствии с заданными событиями для компаний.Also, the interface (400) contains a control panel for setting parameters of search queries (403). Using the control panel (403), you can configure the detection of certain types of events, link companies, configure service parameters, etc. The field (405) displays a list of identified news sources in accordance with the specified events for the companies.

Пользователи (10) также могут устанавливать функцию оповещения для выбранных наименований компаний. Оповещения о поступлении новых новостей могут передаваться посредством сообщений электронной почты, PUSH уведомлений, SMS уведомлений и т.п. При настройке функции оповещения пользователь (10) может настраивать требуемые параметры, например наименование компании, тип событий, связанных с компаниями.Users (10) can also set the alert function for selected company names. Notifications about the arrival of new news can be sent via e-mail messages, PUSH notifications, SMS notifications, etc. When setting up the notification function, the user (10) can configure the required parameters, for example, the name of the company, the type of events related to companies.

Сформированная информация по обработанным новостям также может отображаться с применением фильтра, настроенным относительно роли пользователя (10), взаимодействующего с интерфейсом (400). С учетом параметров учетной записи пользователя (10) ему могут отображаться только те новости, которые содержат связанный с его ролью тип событий.The generated information on processed news can also be displayed using a filter configured with respect to the role of the user (10) interacting with the interface (400). Taking into account the parameters of the user account (10), only those news that contain the type of events associated with his role can be displayed to him.

На фиг. 5 представлен пример общего вида устройства (500), которое обеспечивает реализацию представленного решения. На базе устройства (500) может реализовываться различный спектр вычислительных устройств, например управляющий сервер (110), сервер новостного агрегатора (120), устройства пользователей (10) и т.д. В общем виде устройство (500) содержит объединенные общей шиной информационного обмена один или несколько процессоров (501), средства памяти, такие как ОЗУ (502) и ПЗУ (503), интерфейсы ввода/вывода (504), устройства ввода/вывода (505) и устройство для сетевого взаимодействия (506).FIG. 5 shows an example of a general view of the device (500), which provides the implementation of the presented solution. A different range of computing devices can be implemented on the basis of the device (500), for example, a control server (110), a news aggregator server (120), user devices (10), etc. In general, the device (500) contains one or more processors (501) united by a common bus of information exchange, memory means such as RAM (502) and ROM (503), input / output interfaces (504), input / output devices (505 ) and a device for networking (506).

Процессор (501) (или несколько процессоров, многоядерный процессор и т.п.) может выбираться изProcessor (501) (or multiple processors, multi-core processor, etc.) may be selected from

- 4 038241 ассортимента устройств, широко применяемых в настоящее время, например, таких производителей как Intel™, AMD™, Apple™, Samsung Exynos™, MediaTEK™, Qualcomm Snapdragon™ и т.п. Под процессором или одним из используемых процессоров в устройстве (500) также необходимо учитывать графический процессор, например GPU NVIDIA или Graphcore, тип которых также является пригодным для полного или частичного выполнения способа (200), а также может применяться для обучения и применения моделей машинного обучения в различных информационных системах.- 4,038241 assortment of devices currently widely used, for example from manufacturers such as Intel ™, AMD ™, Apple ™, Samsung Exynos ™, MediaTEK ™, Qualcomm Snapdragon ™ and the like. Under the processor or one of the processors used in the device (500), it is also necessary to take into account the graphics processor, for example, NVIDIA GPU or Graphcore, the type of which is also suitable for full or partial execution of the method (200), and can also be used for training and applying machine learning models in various information systems.

ОЗУ (502) представляет собой оперативную память и предназначено для хранения исполняемых процессором (501) машиночитаемых инструкций для выполнения необходимых операций по логической обработке данных. ОЗУ (502), как правило, содержит исполняемые инструкции операционной системы и соответствующих программных компонент (приложения, программные модули и т.п.). При этом в качестве ОЗУ (502) может выступать доступный объем памяти графической карты или графического процессора.RAM (502) is a random access memory and is intended for storing machine-readable instructions executed by the processor (501) for performing the necessary operations for logical data processing. RAM (502), as a rule, contains executable instructions of the operating system and corresponding software components (applications, software modules, etc.). In this case, the available memory of the graphics card or graphics processor can act as RAM (502).

ПЗУ (503) представляет собой одно или более средств для постоянного хранения данных, например жесткий диск (HDD), твердотельный накопитель данных (SSD), флэш-память (EEPROM, NAND и т.п.), оптические носители информации (CD-R/RW, DVD-R/RW, BlueRay Disc, MD) и др.ROM (503) is one or more means for permanent storage of data, such as a hard disk drive (HDD), solid state data storage device (SSD), flash memory (EEPROM, NAND, etc.), optical storage media (CD-R / RW, DVD-R / RW, BlueRay Disc, MD), etc.

Для организации работы компонентов устройства (500) и организации работы внешних подключаемых устройств применяются различные виды интерфейсов В/В (504). Выбор соответствующих интерфейсов зависит от конкретного исполнения вычислительного устройства, которые могут представлять собой, не ограничиваясь PCI, AGP, PS/2, IrDa, FireWire, LPT, COM, SATA, IDE, Lightning, USB (2.0, 3.0, 3.1, micro, mini, type C), TRS/Audio jack (2.5, 3.5, 6.35), HDMI, DVI, VGA, Display Port, RJ45, RS232 и т.п. Для обеспечения взаимодействия пользователя с вычислительной системой (500) применяются различные средства (505) В/В информации, например клавиатура, дисплей (монитор), сенсорный дисплей, тачпад, джойстик, манипулятор мышь, световое перо, стилус, сенсорная панель, трекбол, динамики, микрофон, средства дополненной реальности, оптические сенсоры, планшет, световые индикаторы, проектор, камера, средства биометрической идентификации (сканер сетчатки глаза, сканер отпечатков пальцев, модуль распознавания голоса) и т.п.Various types of I / O interfaces (504) are used to organize the operation of device components (500) and to organize the operation of external connected devices. The choice of appropriate interfaces depends on the specific design of the computing device, which can be, but are not limited to PCI, AGP, PS / 2, IrDa, FireWire, LPT, COM, SATA, IDE, Lightning, USB (2.0, 3.0, 3.1, micro, mini , type C), TRS / Audio jack (2.5, 3.5, 6.35), HDMI, DVI, VGA, Display Port, RJ45, RS232, etc. To ensure user interaction with the computing system (500), various means (505) of I / O information are used, for example, a keyboard, display (monitor), touch display, touchpad, joystick, mouse manipulator, light pen, stylus, touch panel, trackball, speakers , microphone, augmented reality, optical sensors, tablet, light indicators, projector, camera, biometric identification (retina scanner, fingerprint scanner, voice recognition module), etc.

Средство сетевого взаимодействия (506) обеспечивает передачу данных посредством внутренней или внешней вычислительной сети, например Интранет, Интернет, ЛВС и т.п. В качестве одного или более средств (506) может использоваться, но не ограничиваться, Ethernet карта, GSM модем, GPRS модем, LTE модем, 5G модем, модуль спутниковой связи, NFC модуль, Bluetooth и/или BLE модуль, Wi-Fi модуль и др. Дополнительно могут применяться также средства спутниковой навигации в составе устройства (500), например GPS, ГлОнАСС, BeiDou, Galileo.The networking tool (506) provides data transmission via an internal or external computer network, such as an Intranet, Internet, LAN, and the like. As one or more means (506), an Ethernet card, a GSM modem, a GPRS modem, an LTE modem, a 5G modem, a satellite communication module, an NFC module, a Bluetooth and / or BLE module, a Wi-Fi module, and etc. Additionally, satellite navigation aids can be used as part of the device (500), for example, GPS, GLONASS, BeiDou, Galileo.

Конкретный выбор элементов устройств (500) для реализации различных программно-аппаратных архитектурных решений может варьироваться с сохранением обеспечиваемого требуемого функционала от того или иного типа устройства. Представленные материалы изобретения раскрывают предпочтительные примеры реализации технического решения и не должны трактоваться как ограничивающие иные, частные примеры его воплощения, не выходящие за пределы испрашиваемой правовой охраны, которые являются очевидными для специалистов соответствующей области техники.The specific choice of device elements (500) for the implementation of various software and hardware architectural solutions can vary while maintaining the required functionality provided from a particular type of device. The presented materials of the invention disclose preferred examples of the implementation of the technical solution and should not be construed as limiting other, particular examples of its implementation, which do not go beyond the scope of the claimed legal protection, which are obvious to specialists in the relevant field of technology.

Claims

CLAIM

1. A computer-implemented method of searching for relevant news, containing the stages, which receive on the control server a set of news from at least one server of the news aggregator;

the analysis of the received set of news is carried out on the control server, which includes the lematization of the texts of each news from the said set of news;

processing the obtained lemmas of news texts using a machine learning model, which contains an established set of company data and a list of events, and for each event in the machine learning model a given set of lemmas is set;

defining news containing lemmas identifying specified events, and forming a connection between the identified events and at least one company;

generate a list of relevant news based on the analysis of the news set.

2. The method according to claim 1, characterized in that when a set of news is received, duplicate news is filtered.

3. A method according to claim 2, characterized in that the filtering is performed by calculating the Jaccard measure between news signatures.

4. A method according to claim 1, characterized in that the events assigned to company news are stored in a database.

5. The method according to claim 1, characterized in that the news aggregator updates the news list using information channels.

6. The method according to claim 5, characterized in that the information channels are websites on the Internet and / or messenger channels.

7. The method according to claim 1, characterized in that during the analysis of news, the event belongs to the main or subsidiary company.

8. The method according to claim 7, characterized in that the affiliation of the company mentioned in the news is determined using a decision tree algorithm.

9. The method according to claim 1, characterized in that during the lemmatization of news texts, they are cleared of punctuation marks, stop words and nominal entities.

10. The method according to claim 1, characterized in that during the lemmatization for each lemma of the text of the news, a statistical measure is calculated.

11. The method according to claim 10, characterized in that the machine learning algorithm is a logical regression classifying the belonging of the news to the event based on the analysis of the statistical measure of the lemmas.

12. The method according to claim 1, characterized in that for each news text the definition of frequency phrases with a length of 2 to 10 lemmas is performed.

13. The method of claim 1, wherein the machine learning algorithm is a gradient boosting trained to classify an event based on the number of sentences containing lemmas identifying the event from a search query.

14. The method according to claim 1, characterized in that after assigning an event from the company news, extraction of lemmas and / or sentences containing lemmas identifying said event is performed.

15. A device for searching relevant news, comprising at least one processor and at least one memory containing machine-readable instructions, which, when executed by at least one processor, execute the method according to any one of claims 1-14.

16. System for searching relevant news, containing at least one control server;

at least one news aggregator server configured to receive news data from at least one news source, the control server configured to receive news data from at least one news aggregator server;

analysis of the obtained data, during which the lemmatization of the texts of each news from the mentioned set of news is performed;

defining news containing lemmas identifying the specified events, and forming a connection of the identified events with at least one company;

formation of a list of relevant news based on the analysis of the set of news.