WO2023211304A1 - Système et procédé de collecte et de traitement d'actualités dans le réseau internet - Google Patents

Système et procédé de collecte et de traitement d'actualités dans le réseau internet Download PDF

Info

Publication number
WO2023211304A1
WO2023211304A1 PCT/RU2022/000146 RU2022000146W WO2023211304A1 WO 2023211304 A1 WO2023211304 A1 WO 2023211304A1 RU 2022000146 W RU2022000146 W RU 2022000146W WO 2023211304 A1 WO2023211304 A1 WO 2023211304A1
Authority
WO
WIPO (PCT)
Prior art keywords
news
processing
text
algorithm
database
Prior art date
Application number
PCT/RU2022/000146
Other languages
English (en)
Russian (ru)
Inventor
Михаил Юрьевич ШЕВЦОВ
Андрей Михайлович КОЗЛОВ
Александр Дмитриевич ИВАНОВ
Павел Сергеевич ЗУБИЦКИЙ
Илья Александрович МАЛЫШЕВ
Original Assignee
Публичное Акционерное Общество "Сбербанк России"
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Публичное Акционерное Общество "Сбербанк России" filed Critical Публичное Акционерное Общество "Сбербанк России"
Priority claimed from RU2022111786A external-priority patent/RU2795678C1/ru
Publication of WO2023211304A1 publication Critical patent/WO2023211304A1/fr

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data

Definitions

  • the claimed solution relates to the field of computer technology, in particular, to an automated system for collecting data on the Internet.
  • Application US 20070198459 Al discloses a system for online analysis of information sources, containing a module for collecting information from the network and an analytics module that analyzes retrospective changes in data within the analyzed news topic.
  • the claimed system allows us to solve a technical problem in terms of increasing the accuracy of the collected information by checking the collected information for semantically coherent text characterizing the news source.
  • the technical result is to increase the accuracy of news data collection by analyzing website news feeds for the presence of semantically coherent text in news sources.
  • the claimed technical result is achieved through the implementation of a system for collecting and processing news on the Internet, containing: an analyzer module configured to searching for domain names on the Internet containing news sources; analyzing the HTML code of web pages of the corresponding domain names to identify news feeds; determining the type of news feeds and the algorithm for processing the corresponding feed to extract links to text information from a news source; transfer of identified links to news feeds, their type and processing algorithm to the database; scraping module, configured to process data stored in the database, with the help of which the saved links to news feeds are processed using an algorithm for analyzing the markup of a web resource, defined by the analyzer module, when following a link to a web resource, checking links to duplication with stored information in the database, and obtaining HTML code for subsequent processing of text data; a parsing module configured to receive HTML code from the scraping module; extracting text information from HTML code using at least two algorithms for collecting text data, each of which selects an HTML node with the largest ratio of characters characterizing the coherent text of a
  • the presence of links, their number and signs of matches for keywords corresponding to the news source are determined.
  • the scraping module is configured to analyze tapes of the following types:
  • the claimed technical result is also achieved by implementing a method for collecting and processing news on the Internet, performed using a processor and containing the stages of: searching for domain names on the Internet containing news sources; analyze the HTML code of web pages of the corresponding domain names to identify news feeds; determine the type of news feeds and the algorithm for processing the corresponding feed to extract links to text information of the news source; transmit identified links to news feeds, their type and processing algorithm to the database; perform processing of data stored in the database, during which they process saved links to news feeds using an algorithm for analyzing the markup of a web resource, while following a link to a web resource, checking the link for duplication with the information stored in the database, and obtaining HTML code for subsequent processing of text data; Based on the received HTML code from the module, text information is extracted from the HTML code using at least two algorithms for collecting text data, each of which selects the HTML node with the largest ratio of characters characterizing the connected text of the news source to their total number ; process the results of extracting each algorithm by a machine learning model
  • FIG. 1 shows a conceptual diagram of the claimed solution.
  • FIG. Figure 2 shows an example of HTML code extracted from a resource by the analyzer module.
  • FIG. Figure 3 shows an example of extracting links from HTML code.
  • FIG. Figure 4 shows an example of recording a link to a news source in the database.
  • FIG. Figure 5 shows an example of a XPATH expression.
  • FIG. Figure 6 shows an example of HTML feed processing.
  • FIG. Figure 7 shows an example of extracted text from a news source.
  • FIG. 8 shows a general diagram of a computing device.
  • FIG. 1 shows a general diagram of the claimed system (130), which collects information from websites containing news sources (110).
  • the system (130) can be implemented on the basis of a single computing device, for example, a server, or it can be a software and hardware complex in which each of its elements is located on a separate computer, connected within a single functionality with other elements via an information network.
  • the system (130) contains a set of modules that implement the specified functionality.
  • the modules can be implemented structurally in the form of software and hardware solutions (for example, a system on a chip, microcontrollers, etc.) or in the form of software modules operating within a single software that implements the system operation algorithm (130) using a computing device.
  • the system (130) collects information from the Internet through an analyzer module (131) that connects to websites with news sources (110) through an information network (120).
  • the analyzer module (131) searches for domain names on the Internet containing news sources (software). After connecting to the sources, the module (131) analyzes the HTML code of the web pages of the corresponding domain names to identify news feeds. Analysis of news sources is carried out by analyzing the main page of a web resource, as well as all pages of the 1st nesting level. In FIG. 2 shows an example of extracting HTML code from the source (software) of the domain https://press.sber.ru.
  • the processing algorithm of the module (131) uses two types of algorithms rssfinder and htmlfinder, which provide analysis of web pages and identification of links to RSS feeds or HTML feeds.
  • An example of defining links to news feeds is shown in Fig. 3.
  • the module (131) determines the type of news feeds and the algorithm for processing the corresponding feed to extract a link to the text information of the news source.
  • the rssfinder algorithm works first, because RSS feeds are easier to process; if, as a result of the work, rssfinder did not reveal anything, then the htmlfinder algorithm is activated. In this case, there may be cases when the link is incorrect or the source is not available (no response from the server), in which case the type of tape is determined during the identification process and depends on which algorithm returned the values, and the very fact of receiving responses from the source server.
  • the presence of links, their number and signs of matches are also determined by keywords corresponding to the news source, for example, such as: “rss”, “feed”, “news”, “articles”, “news” “, “articles”, or excluding (".png", ".pdf', patterns: '.*login.*', '.*/([ l -]*[-_][ l -]*)+ $', etc.)
  • keywords corresponding to the news source for example, such as: “rss”, “feed”, “news”, “articles”, “news” ", “articles”, or excluding (".png", ".pdf', patterns: '.*login.*', '.*/([ l -]*[-_][ l -]*)+ $', etc.)
  • Identified links to news feeds, as well as their type (HTML or RSS) and the applicable processing algorithm for subsequent extraction of links to news feeds are transferred to the database (132).
  • the stored information in the database (132) is further processed using scraping (133) and parsing (134) modules.
  • the scraping module (133) ensures the processing of stored links to news feeds using the web resource markup analysis algorithm defined by the analyzer module (131), in which a link to the source web resource (110) is followed to check the link for duplication with the stored information in the database (132), as well as obtaining the HTML code for subsequent processing of text data by the parsing module (134).
  • An example of extracting links from HTML code is shown in Fig. 3.
  • the scraping module (133) performs continuous operation and iteratively processes the table of tape references from the database (132). In asynchronous mode, the module (133) operates three cycles that support processing of 3 types of feeds: RSS - RSS, Atom, JSON standards (type 1); HTML - regular HTML pages (2nd type); HTML pages processed using XPATH expressions (type 3), for which the path to news links is manually configured. An example of writing a XPATH expression is shown in Fig. 5.
  • Each of the loops processes part of the links corresponding to its algorithm, during which the link to the source (software) is accessed to analyze the resulting HTML code to extract links to news data.
  • An example of intermediate processing for HTML feed is shown in Fig. 6. All received links to news are checked for duplication by accessing the database (132); if the link is contained in the database (132), then it is excluded from processing, otherwise it is recorded in the database (132) and transferred for further processing.
  • the parsing module (134) processes the received HTML code from the scraping module (133). During the operation of the module (134), text information is extracted from the HTML code using at least two algorithms for collecting text data, each of which selects the HTML node with the largest ratio of characters characterizing the coherent text of the news source to their total number .
  • An HTML node is understood as a hierarchical node of HTML markup, for example, ⁇ head>, ⁇ body>, etc.
  • One of the algorithms used is based on measuring the number of non-whitespace characters in the source HTML node. Another algorithm analyzes HTML nodes based on the amount of useful text, and extracts text from the nodes that have gained more weight. By testing these algorithms on one data set, differences in sets of high-quality texts were identified.
  • the algorithms work in parallel and the evaluation of the results is compared by a machine learning model, for example, a neural network algorithm trained on examples of news sources, which are reference news texts.
  • the machine learning model used within the parsing module (134) analyzes the presence of characteristics inherent in sources that are not news sources. These kinds of characteristics, as a rule, are stop words and special characters (for example, telephone numbers, a sequence of numbers, etc.).
  • the results of the above algorithms identify the most semantically coherent text, which clearly characterizes the news source.
  • the resulting text is subsequently stored in a database (132) for subsequent provision to the user or transmission to an automated system for selecting news by keywords.
  • An example of the extracted text is shown in Fig. 7
  • the claimed system (130) can be implemented on the basis of a single computing device (200), for example, a server.
  • FIG. 8 shows a general view of such a computing device (200).
  • a computing device contains one or more processors (201), memory devices such as RAM (202) and ROM (203), I/O interfaces (204), and input devices connected by a common information exchange bus. /output (205), and a device for network communication (206).
  • processors 201
  • memory devices such as RAM (202) and ROM (203)
  • I/O interfaces 204
  • input devices connected by a common information exchange bus.
  • /output /output
  • 206 a device for network communication
  • the processor (201) may be selected from a variety of devices commonly used today, such as those from IntelTM, AMDTM, AppleTM, Samsung ExynosTM, MediaTEKTM, Qualcomm SnapdragonTM and etc.
  • a graphics processor for example, Nvidia, AMD, Graphcore, etc., can also be used as a processor (501).
  • RAM (202) is a random access memory and is designed to store machine-readable instructions executed by the processor (201) for performing the necessary logical data processing operations.
  • the RAM (202) typically contains executable operating system instructions and associated software components (applications, program modules, etc.).
  • the ROM (203) is one or more permanent storage devices, such as a hard disk drive (HDD), a solid state drive (SSD), flash memory (EEPROM, NAND, etc.), optical storage media ( CD-R/RW, DVD-R/RW, BlueRay Disc, MD), etc.
  • I/O interfaces To organize the operation of device components (200) and organize the operation of external connected devices, various types of I/O interfaces (204) are used. The choice of appropriate interfaces depends on the specific design of the computing device, which can be, but is not limited to: PCI, AGP, PS/2, IrDa, FireWire, LPT, COM, SATA, IDE, Lightning, USB (2.0, 3.0, 3.1, micro, mini, type C), TRS/Audio jack (2.5, 3.5, 6.35), HDMI, DVI, VGA, Display Port, RJ45, RS232, etc.
  • various means (205) of I/O information are used, for example, a keyboard, a display (monitor), a touch display, a touch pad, a joystick, a mouse, a light pen, stylus, touchpad, trackball, speakers, microphone, augmented reality tools, optical sensors, tablet, light indicators, projector, camera, biometric identification tools (retina scanner, fingerprint scanner, voice recognition module), etc.
  • the network communication means (206) allows the device (200) to transmit data via an internal or external computer network, for example, an Intranet, the Internet, a LAN, etc.
  • One or more means (206) may be used, but not limited to: Ethernet card, GSM modem, GPRS modem, LTE modem, 5G modem, satellite communication module, NFC module, Bluetooth and/or BLE module, Wi-Fi module and etc.
  • satellite navigation tools can also be used as part of the device (200), for example, GPS, GLONASS, BeiDou, Galileo.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

L'invention se rapporte au domaine des techniques informatiques. Ce système (130) effectue une recherche de noms de domaines dans le réseau Internet (120), comprenant des sources d'actualités (PO). On effectue une analyse du code HTML des pages web afin de découvrir des lignes d'actualités à l'aide dun module d'analyseur (131). On détermine le type des lignes d'actualités et un algorithme de traitement pour extraire des renvois à des informations textes de la source d'actualités. On transmet les renvois trouvés vers les lignes d'actualités, ainsi que leur type et l'algorithme de traitement vers une base de données (132). On traite les renvois sauvegardés vers les lignes d'actualités à l'aide d'un algorithme d'analyse de marque de ressources web dans un module de moissonnage (133), et on effectue une transition en fonction du renvoi vers les ressources web, on vérifie le renvoi en termes de duplication avec les informations stockées dans la base de données et on obtient le code HTML. Sur la base du code HTML obtenu et à l'aide d'un module d'analyse (134), on effectue une extraction des informations textes à l'aide d'algorithmes de collecte de données textes, qui effectuent chacun un choix des nœuds HTML présentant la relation la plus élevée entre les symboles caractérisant le texte lié de la source d'actualités, et leur nombre commun. Les résultats d'extraction de chaque algorithme sont traités par un modèle d'apprentissage machine afin d'analyser les sources qui ne sont pas des sources d'actualités. L'invention a pour but d'augmenter la précision de collecte et de traitement d'informations textes depuis une page web.
PCT/RU2022/000146 2022-04-29 2022-04-29 Système et procédé de collecte et de traitement d'actualités dans le réseau internet WO2023211304A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
RU2022111786 2022-04-29
RU2022111786A RU2795678C1 (ru) 2022-04-29 Система и способ сбора и обработки новостей в сети интернет

Publications (1)

Publication Number Publication Date
WO2023211304A1 true WO2023211304A1 (fr) 2023-11-02

Family

ID=88519360

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/RU2022/000146 WO2023211304A1 (fr) 2022-04-29 2022-04-29 Système et procédé de collecte et de traitement d'actualités dans le réseau internet

Country Status (1)

Country Link
WO (1) WO2023211304A1 (fr)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050114324A1 (en) * 2003-09-14 2005-05-26 Yaron Mayer System and method for improved searching on the internet or similar networks and especially improved MetaNews and/or improved automatically generated newspapers
US20070198459A1 (en) * 2006-02-14 2007-08-23 Boone Gary N System and method for online information analysis
RU2405197C2 (ru) * 2004-02-12 2010-11-27 Майкрософт Корпорейшн Веб-кролинг на основе теории статистических решений и прогнозирование изменения веб-страницы
US20150106157A1 (en) * 2013-10-15 2015-04-16 Adobe Systems Incorporated Text extraction module for contextual analysis engine
US20190213488A1 (en) * 2016-09-02 2019-07-11 Hithink Financial Services Inc. Systems and methods for semantic analysis based on knowledge graph

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050114324A1 (en) * 2003-09-14 2005-05-26 Yaron Mayer System and method for improved searching on the internet or similar networks and especially improved MetaNews and/or improved automatically generated newspapers
RU2405197C2 (ru) * 2004-02-12 2010-11-27 Майкрософт Корпорейшн Веб-кролинг на основе теории статистических решений и прогнозирование изменения веб-страницы
US20070198459A1 (en) * 2006-02-14 2007-08-23 Boone Gary N System and method for online information analysis
US20150106157A1 (en) * 2013-10-15 2015-04-16 Adobe Systems Incorporated Text extraction module for contextual analysis engine
US20190213488A1 (en) * 2016-09-02 2019-07-11 Hithink Financial Services Inc. Systems and methods for semantic analysis based on knowledge graph

Similar Documents

Publication Publication Date Title
AU2019263758B2 (en) Systems and methods for generating a contextually and conversationally correct response to a query
US9519686B2 (en) Confidence ranking of answers based on temporal semantics
JP5065420B2 (ja) ウェブ・サービス定義の品質を事前評価および精密化するための方法、システム、およびコンピュータ読み取り可能媒体
US20210209421A1 (en) Method and apparatus for constructing quality evaluation model, device and storage medium
US9760828B2 (en) Utilizing temporal indicators to weight semantic values
WO2018184518A1 (fr) Procédé et dispositif de traitement de données de microblogue, dispositif informatique et support d'informations
WO2009096523A1 (fr) Dispositif d'analyse d'informations, système de recherche, procédé d'analyse d'informations, et programme d'analyse d'informations
US10628749B2 (en) Automatically assessing question answering system performance across possible confidence values
Sleeman et al. Entity type recognition for heterogeneous semantic graphs
Martens et al. Extracting and analyzing context information in user-support conversations on twitter
CN111858903A (zh) 一种用于负面新闻预警的方法和装置
RU2795678C1 (ru) Система и способ сбора и обработки новостей в сети интернет
US20230119590A1 (en) Automatic identification of document sections to generate a searchable data structure
WO2023211304A1 (fr) Système et procédé de collecte et de traitement d'actualités dans le réseau internet
CN110688558A (zh) 网页搜索的方法、装置、电子设备和存储介质
EA044489B1 (ru) Система и способ сбора и обработки новостей в сети интернет
CN112733542B (zh) 主题的探测方法、装置、电子设备及存储介质
Ma et al. API prober–a tool for analyzing web API features and clustering web APIs
ÖZYİĞİT MUHASEBE ALANINA GÜNCEL YAKLAŞIMLAR: METİN MADENCİLİĞİ
CN113722421B (zh) 一种合同审计方法和系统,及计算机可读存储介质
RU2755606C2 (ru) Способ и система классификации данных для выявления конфиденциальной информации в тексте
KR101909537B1 (ko) 소셜 데이터 분류 시스템 및 방법
Orellana et al. Evaluating named entities recognition (NER) tools vs algorithms adapted to the extraction of locations
Van Hecke Computational stylometric approach to the Dead Sea Scrolls: towards a new research agenda
US20230274085A1 (en) Vector space model for form data extraction

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22940413

Country of ref document: EP

Kind code of ref document: A1