WO2023211304A1

WO2023211304A1 - System and method for collecting and processing news from the internet

Info

Publication number: WO2023211304A1
Application number: PCT/RU2022/000146
Authority: WO
Inventors: Михаил Юрьевич ШЕВЦОВ; Андрей Михайлович КОЗЛОВ; Александр Дмитриевич ИВАНОВ; Павел Сергеевич ЗУБИЦКИЙ; Илья Александрович МАЛЫШЕВ
Original assignee: Публичное Акционерное Общество "Сбербанк России"
Priority date: 2022-04-29
Filing date: 2022-04-29
Publication date: 2023-11-02

Abstract

The invention relates to the field of computer technologies. A system (130) performs a search of the Internet (120) for domain names containing news sources (110). The HTML code of the corresponding websites is analyzed to detect news feeds with the aid of an analysis module (131). The news feed type and a processing algorithm are determined to extract a link to the textual information of a news source. Identified links to news feeds, the news feed type and the processing algorithm are transmitted to a database (132). Saved links to news feeds are processed using an algorithm for analyzing the markup of a web resource in a scraping module (133), wherein a web resource is accessed via a link, the link is checked for duplication among the saved information in the database, and the HTML code is obtained. Using the obtained HTML code and a parsing module (134), textual information is extracted with the aid of text data collection algorithms, each of which selects the HTML codes with the greatest ratio of symbols characterizing cohesive text of a news source to the total number of symbols. The extraction results of each algorithm are processed by a machine learning module to analyze sources that are not news sources. The invention is directed toward providing more accurate collection and processing of textual information from websites.

Description

SYSTEM AND METHOD FOR COLLECTING AND PROCESSING NEWS ON THE INTERNET

TECHNICAL FIELD

[0001] The claimed solution relates to the field of computer technology, in particular, to an automated system for collecting data on the Internet.

BACKGROUND OF THE ART

[0002] Automated news gathering on the Internet is widely used today. Various methods of parsing information from news sources are often used, making it possible to download data from web resources for their subsequent processing.

[0003] Application US 20070198459 Al (Current Assignee Accenture Global Services Ltd, 08/23/2007) discloses a system for online analysis of information sources, containing a module for collecting information from the network and an analytics module that analyzes retrospective changes in data within the analyzed news topic.

[0004] The disadvantages of this solution are the lack of a mechanism for checking the semantic coherence of the text presented in a particular news source, which does not allow checking the quality of information posted on the network, as well as its compliance, as such, with the news source, and not with another type of data, for example , advertisement. As a result, this kind of solutions allows for the collection of data without preliminary analysis of it in relation to the news source, as a result of which the relevance and quality of the collected information decreases.

SUMMARY OF THE INVENTION

[0005] The claimed system allows us to solve a technical problem in terms of increasing the accuracy of the collected information by checking the collected information for semantically coherent text characterizing the news source.

[0006] The technical result is to increase the accuracy of news data collection by analyzing website news feeds for the presence of semantically coherent text in news sources.

[0007] The claimed technical result is achieved through the implementation of a system for collecting and processing news on the Internet, containing: an analyzer module configured to searching for domain names on the Internet containing news sources; analyzing the HTML code of web pages of the corresponding domain names to identify news feeds; determining the type of news feeds and the algorithm for processing the corresponding feed to extract links to text information from a news source; transfer of identified links to news feeds, their type and processing algorithm to the database; scraping module, configured to process data stored in the database, with the help of which the saved links to news feeds are processed using an algorithm for analyzing the markup of a web resource, defined by the analyzer module, when following a link to a web resource, checking links to duplication with stored information in the database, and obtaining HTML code for subsequent processing of text data; a parsing module configured to receive HTML code from the scraping module; extracting text information from HTML code using at least two algorithms for collecting text data, each of which selects an HTML node with the largest ratio of characters characterizing the coherent text of a news source to their total number; processing the results of extracting each algorithm by a machine learning model, wherein the model is configured to analyze the presence of characteristics inherent in sources that are not news sources, wherein the characteristics are at least stop words and special characters; detect semantically coherent text characterizing a news source; saving the extracted text to the database. [0008] In one of the particular implementation examples, the HTML code is analyzed for the main page of the web resource and for all pages of the 1st nesting level.

[0009] In another particular implementation example, the presence of links, their number and signs of matches for keywords corresponding to the news source are determined.

[0010] In another particular implementation example, the scraping module is configured to analyze tapes of the following types:

- RSS - RSS, Atom, JSON standards;

- HTML pages;

- HTML pages processed using XPATE expressions.

[UN] The claimed technical result is also achieved by implementing a method for collecting and processing news on the Internet, performed using a processor and containing the stages of: searching for domain names on the Internet containing news sources; analyze the HTML code of web pages of the corresponding domain names to identify news feeds; determine the type of news feeds and the algorithm for processing the corresponding feed to extract links to text information of the news source; transmit identified links to news feeds, their type and processing algorithm to the database; perform processing of data stored in the database, during which they process saved links to news feeds using an algorithm for analyzing the markup of a web resource, while following a link to a web resource, checking the link for duplication with the information stored in the database, and obtaining HTML code for subsequent processing of text data; Based on the received HTML code from the module, text information is extracted from the HTML code using at least two algorithms for collecting text data, each of which selects the HTML node with the largest ratio of characters characterizing the connected text of the news source to their total number ; process the results of extracting each algorithm by a machine learning model, and the model is configured to analyze the presence of characteristics inherent in sources that are not news sources, and the characteristics are at least stop words and special characters; detect semantically coherent text characterizing a news source; saving the extracted text to the database.

BRIEF DESCRIPTION OF THE DRAWINGS

[0012] In FIG. 1 shows a conceptual diagram of the claimed solution.

[0013] In FIG. Figure 2 shows an example of HTML code extracted from a resource by the analyzer module.

[0014] In FIG. Figure 3 shows an example of extracting links from HTML code.

[0015] In FIG. Figure 4 shows an example of recording a link to a news source in the database.

[0016] In FIG. Figure 5 shows an example of a XPATH expression.

[0017] In FIG. Figure 6 shows an example of HTML feed processing.

[0018] In FIG. Figure 7 shows an example of extracted text from a news source.

[0019] In FIG. 8 shows a general diagram of a computing device.

IMPLEMENTATION OF THE INVENTION

[0020] In FIG. 1 shows a general diagram of the claimed system (130), which collects information from websites containing news sources (110). The system (130) can be implemented on the basis of a single computing device, for example, a server, or it can be a software and hardware complex in which each of its elements is located on a separate computer, connected within a single functionality with other elements via an information network.

[0021] The system (130) contains a set of modules that implement the specified functionality. The modules can be implemented structurally in the form of software and hardware solutions (for example, a system on a chip, microcontrollers, etc.) or in the form of software modules operating within a single software that implements the system operation algorithm (130) using a computing device. [0022] The system (130) collects information from the Internet through an analyzer module (131) that connects to websites with news sources (110) through an information network (120).

[0023] The analyzer module (131) searches for domain names on the Internet containing news sources (software). After connecting to the sources, the module (131) analyzes the HTML code of the web pages of the corresponding domain names to identify news feeds. Analysis of news sources is carried out by analyzing the main page of a web resource, as well as all pages of the 1st nesting level. In FIG. 2 shows an example of extracting HTML code from the source (software) of the domain https://press.sber.ru.

[0024] The processing algorithm of the module (131) uses two types of algorithms rssfinder and htmlfinder, which provide analysis of web pages and identification of links to RSS feeds or HTML feeds. An example of defining links to news feeds is shown in Fig. 3. After identifying one or more news feeds, the module (131) determines the type of news feeds and the algorithm for processing the corresponding feed to extract a link to the text information of the news source. The rssfinder algorithm works first, because RSS feeds are easier to process; if, as a result of the work, rssfinder did not reveal anything, then the htmlfinder algorithm is activated. In this case, there may be cases when the link is incorrect or the source is not available (no response from the server), in which case the type of tape is determined during the identification process and depends on which algorithm returned the values, and the very fact of receiving responses from the source server.

[0025] During operation of the module (131), the presence of links, their number and signs of matches are also determined by keywords corresponding to the news source, for example, such as: “rss”, “feed”, “news”, “articles”, “news” ", "articles", or excluding (".png", ".pdf', patterns: '.*login.*', '.*/([ ^l -]*[-_][ ^l -]*)+ $', etc.) Identified links to news feeds, as well as their type (HTML or RSS) and the applicable processing algorithm for subsequent extraction of links to news feeds are transferred to the database (132). An example of a record in the database is presented in Fig. 4 .

[0026] The stored information in the database (132) is further processed using scraping (133) and parsing (134) modules. The scraping module (133) ensures the processing of stored links to news feeds using the web resource markup analysis algorithm defined by the analyzer module (131), in which a link to the source web resource (110) is followed to check the link for duplication with the stored information in the database (132), as well as obtaining the HTML code for subsequent processing of text data by the parsing module (134). An example of extracting links from HTML code is shown in Fig. 3.

[0027] The scraping module (133) performs continuous operation and iteratively processes the table of tape references from the database (132). In asynchronous mode, the module (133) operates three cycles that support processing of 3 types of feeds: RSS - RSS, Atom, JSON standards (type 1); HTML - regular HTML pages (2nd type); HTML pages processed using XPATH expressions (type 3), for which the path to news links is manually configured. An example of writing a XPATH expression is shown in Fig. 5.

[0028] Each of the loops processes part of the links corresponding to its algorithm, during which the link to the source (software) is accessed to analyze the resulting HTML code to extract links to news data. An example of intermediate processing for HTML feed is shown in Fig. 6. All received links to news are checked for duplication by accessing the database (132); if the link is contained in the database (132), then it is excluded from processing, otherwise it is recorded in the database (132) and transferred for further processing.

[0029] The parsing module (134) processes the received HTML code from the scraping module (133). During the operation of the module (134), text information is extracted from the HTML code using at least two algorithms for collecting text data, each of which selects the HTML node with the largest ratio of characters characterizing the coherent text of the news source to their total number . An HTML node is understood as a hierarchical node of HTML markup, for example, <head>, <body>, etc.

[0030] One of the algorithms used is based on measuring the number of non-whitespace characters in the source HTML node. Another algorithm analyzes HTML nodes based on the amount of useful text, and extracts text from the nodes that have gained more weight. By testing these algorithms on one data set, differences in sets of high-quality texts were identified. The algorithms work in parallel and the evaluation of the results is compared by a machine learning model, for example, a neural network algorithm trained on examples of news sources, which are reference news texts. The machine learning model used within the parsing module (134) analyzes the presence of characteristics inherent in sources that are not news sources. These kinds of characteristics, as a rule, are stop words and special characters (for example, telephone numbers, a sequence of numbers, etc.). Based on the model’s processing of the resulting The results of the above algorithms identify the most semantically coherent text, which clearly characterizes the news source. The resulting text is subsequently stored in a database (132) for subsequent provision to the user or transmission to an automated system for selecting news by keywords. An example of the extracted text is shown in Fig. 7

[0031] The claimed system (130) can be implemented on the basis of a single computing device (200), for example, a server. In FIG. 8 shows a general view of such a computing device (200).

[0032] In general, a computing device (200) contains one or more processors (201), memory devices such as RAM (202) and ROM (203), I/O interfaces (204), and input devices connected by a common information exchange bus. /output (205), and a device for network communication (206).

[0033] The processor (201) (or multiple processors, multi-core processor) may be selected from a variety of devices commonly used today, such as those from Intel™, AMD™, Apple™, Samsung Exynos™, MediaTEK™, Qualcomm Snapdragon™ and etc. A graphics processor, for example, Nvidia, AMD, Graphcore, etc., can also be used as a processor (501).

[0034] RAM (202) is a random access memory and is designed to store machine-readable instructions executed by the processor (201) for performing the necessary logical data processing operations. The RAM (202) typically contains executable operating system instructions and associated software components (applications, program modules, etc.).

[0035] The ROM (203) is one or more permanent storage devices, such as a hard disk drive (HDD), a solid state drive (SSD), flash memory (EEPROM, NAND, etc.), optical storage media ( CD-R/RW, DVD-R/RW, BlueRay Disc, MD), etc.

[0036] To organize the operation of device components (200) and organize the operation of external connected devices, various types of I/O interfaces (204) are used. The choice of appropriate interfaces depends on the specific design of the computing device, which can be, but is not limited to: PCI, AGP, PS/2, IrDa, FireWire, LPT, COM, SATA, IDE, Lightning, USB (2.0, 3.0, 3.1, micro, mini, type C), TRS/Audio jack (2.5, 3.5, 6.35), HDMI, DVI, VGA, Display Port, RJ45, RS232, etc.

[0037] To ensure user interaction with the computing device (500), various means (205) of I/O information are used, for example, a keyboard, a display (monitor), a touch display, a touch pad, a joystick, a mouse, a light pen, stylus, touchpad, trackball, speakers, microphone, augmented reality tools, optical sensors, tablet, light indicators, projector, camera, biometric identification tools (retina scanner, fingerprint scanner, voice recognition module), etc. [0038] The network communication means (206) allows the device (200) to transmit data via an internal or external computer network, for example, an Intranet, the Internet, a LAN, etc. One or more means (206) may be used, but not limited to: Ethernet card, GSM modem, GPRS modem, LTE modem, 5G modem, satellite communication module, NFC module, Bluetooth and/or BLE module, Wi-Fi module and etc.

[0039] Additionally, satellite navigation tools can also be used as part of the device (200), for example, GPS, GLONASS, BeiDou, Galileo.

[0040] The submitted application materials disclose preferred examples of implementation of a technical solution and should not be interpreted as limiting other, particular examples of its implementation that do not go beyond the scope of the requested legal protection, which are obvious to specialists in the relevant field of technology.

Claims

FORMULA A system for collecting and processing news on the Internet, containing: an analyzer module configured to search for domain names on the Internet containing news sources; analyzing the HTML code of web pages of the corresponding domain names to identify news feeds; determining the type of news feeds and the algorithm for processing the corresponding feed to extract links to text information from a news source; transfer of identified links to news feeds, their type and processing algorithm to the database; scraping module, configured to process data stored in the database, with the help of which the saved links to news feeds are processed using an algorithm for analyzing the markup of a web resource, defined by the analyzer module, when following a link to a web resource, checking links to duplication with stored information in the database, and obtaining HTML code for subsequent processing of text data; a parsing module configured to receive HTML code from the scraping module; extracting text information from HTML code using at least two algorithms for collecting text data, each of which selects an HTML node with the largest ratio of characters characterizing the coherent text of a news source to their total number; processing the results of extracting each algorithm by a machine learning model, wherein the model is configured to analyze the presence of characteristics inherent in sources that are not news sources, wherein the characteristics are at least stop words and special characters;

9 detect semantically coherent text characterizing a news source; saving the extracted text to the database.

2. The system according to claim 1, characterized by the fact that the HTML code is analyzed for the main page of the web resource and for all pages of the 1st nesting level.

3. The system according to claim 2, characterized by the fact that the presence of links, their number and signs of matches for keywords corresponding to the news source are determined.

4. The system according to claim 1, characterized in that the scraping module is designed to analyze the following types of tapes:

- RSS - RSS, Atom, JSON standards;

- HTML pages;

- HTML pages processed using XPATE expressions.

5. A method for collecting and processing news on the Internet, performed using a processor and containing the stages of: searching for domain names on the Internet containing news sources; analyze the HTML code of web pages of the corresponding domain names to identify news feeds; determine the type of news feeds and the algorithm for processing the corresponding feed to extract links to text information of the news source; transmit identified links to news feeds, their type and processing algorithm to the database; perform processing of data stored in the database, during which they process saved links to news feeds using an algorithm for analyzing the markup of a web resource, while following a link to a web resource, checking the link for duplication with the information stored in the database, and obtaining HTML code for subsequent processing of text data; based on the received HTML code from the module, text information is extracted from the HTML code using at least two algorithms for collecting text data, each of which selects the HTML node with the largest ratio of characters characterizing the connected text of the news source to their total number; processing the results of extracting each algorithm with a machine learning model, wherein the model is configured to analyze the presence of characteristics inherent in sources that are not news sources, wherein the characteristics are at least stop words and special characters; detect semantically coherent text characterizing a news source; saving the extracted text to the database.

eleven