WO2023211304A1 - Système et procédé de collecte et de traitement d'actualités dans le réseau internet - Google Patents
Système et procédé de collecte et de traitement d'actualités dans le réseau internet Download PDFInfo
- Publication number
- WO2023211304A1 WO2023211304A1 PCT/RU2022/000146 RU2022000146W WO2023211304A1 WO 2023211304 A1 WO2023211304 A1 WO 2023211304A1 RU 2022000146 W RU2022000146 W RU 2022000146W WO 2023211304 A1 WO2023211304 A1 WO 2023211304A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- news
- processing
- text
- algorithm
- database
- Prior art date
Links
- 238000012545 processing Methods 0.000 title claims abstract description 39
- 238000000034 method Methods 0.000 title claims description 15
- 238000007790 scraping Methods 0.000 claims abstract description 11
- 238000010801 machine learning Methods 0.000 claims abstract description 7
- 230000001427 coherent effect Effects 0.000 claims description 10
- 230000014509 gene expression Effects 0.000 claims description 5
- 238000012546 transfer Methods 0.000 claims description 2
- 238000005516 engineering process Methods 0.000 abstract description 3
- 238000013480 data collection Methods 0.000 abstract description 2
- 238000000605 extraction Methods 0.000 abstract description 2
- 238000004891 communication Methods 0.000 description 3
- 238000010586 diagram Methods 0.000 description 3
- 230000003287 optical effect Effects 0.000 description 2
- 230000004044 response Effects 0.000 description 2
- 241001025261 Neoraja caerulea Species 0.000 description 1
- 238000013528 artificial neural network Methods 0.000 description 1
- 230000003190 augmentative effect Effects 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 230000007423 decrease Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 239000000284 extract Substances 0.000 description 1
- 210000003128 head Anatomy 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 239000000463 material Substances 0.000 description 1
- 210000001525 retina Anatomy 0.000 description 1
- 239000007787 solid Substances 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
Definitions
- the claimed solution relates to the field of computer technology, in particular, to an automated system for collecting data on the Internet.
- Application US 20070198459 Al discloses a system for online analysis of information sources, containing a module for collecting information from the network and an analytics module that analyzes retrospective changes in data within the analyzed news topic.
- the claimed system allows us to solve a technical problem in terms of increasing the accuracy of the collected information by checking the collected information for semantically coherent text characterizing the news source.
- the technical result is to increase the accuracy of news data collection by analyzing website news feeds for the presence of semantically coherent text in news sources.
- the claimed technical result is achieved through the implementation of a system for collecting and processing news on the Internet, containing: an analyzer module configured to searching for domain names on the Internet containing news sources; analyzing the HTML code of web pages of the corresponding domain names to identify news feeds; determining the type of news feeds and the algorithm for processing the corresponding feed to extract links to text information from a news source; transfer of identified links to news feeds, their type and processing algorithm to the database; scraping module, configured to process data stored in the database, with the help of which the saved links to news feeds are processed using an algorithm for analyzing the markup of a web resource, defined by the analyzer module, when following a link to a web resource, checking links to duplication with stored information in the database, and obtaining HTML code for subsequent processing of text data; a parsing module configured to receive HTML code from the scraping module; extracting text information from HTML code using at least two algorithms for collecting text data, each of which selects an HTML node with the largest ratio of characters characterizing the coherent text of a
- the presence of links, their number and signs of matches for keywords corresponding to the news source are determined.
- the scraping module is configured to analyze tapes of the following types:
- the claimed technical result is also achieved by implementing a method for collecting and processing news on the Internet, performed using a processor and containing the stages of: searching for domain names on the Internet containing news sources; analyze the HTML code of web pages of the corresponding domain names to identify news feeds; determine the type of news feeds and the algorithm for processing the corresponding feed to extract links to text information of the news source; transmit identified links to news feeds, their type and processing algorithm to the database; perform processing of data stored in the database, during which they process saved links to news feeds using an algorithm for analyzing the markup of a web resource, while following a link to a web resource, checking the link for duplication with the information stored in the database, and obtaining HTML code for subsequent processing of text data; Based on the received HTML code from the module, text information is extracted from the HTML code using at least two algorithms for collecting text data, each of which selects the HTML node with the largest ratio of characters characterizing the connected text of the news source to their total number ; process the results of extracting each algorithm by a machine learning model
- FIG. 1 shows a conceptual diagram of the claimed solution.
- FIG. Figure 2 shows an example of HTML code extracted from a resource by the analyzer module.
- FIG. Figure 3 shows an example of extracting links from HTML code.
- FIG. Figure 4 shows an example of recording a link to a news source in the database.
- FIG. Figure 5 shows an example of a XPATH expression.
- FIG. Figure 6 shows an example of HTML feed processing.
- FIG. Figure 7 shows an example of extracted text from a news source.
- FIG. 8 shows a general diagram of a computing device.
- FIG. 1 shows a general diagram of the claimed system (130), which collects information from websites containing news sources (110).
- the system (130) can be implemented on the basis of a single computing device, for example, a server, or it can be a software and hardware complex in which each of its elements is located on a separate computer, connected within a single functionality with other elements via an information network.
- the system (130) contains a set of modules that implement the specified functionality.
- the modules can be implemented structurally in the form of software and hardware solutions (for example, a system on a chip, microcontrollers, etc.) or in the form of software modules operating within a single software that implements the system operation algorithm (130) using a computing device.
- the system (130) collects information from the Internet through an analyzer module (131) that connects to websites with news sources (110) through an information network (120).
- the analyzer module (131) searches for domain names on the Internet containing news sources (software). After connecting to the sources, the module (131) analyzes the HTML code of the web pages of the corresponding domain names to identify news feeds. Analysis of news sources is carried out by analyzing the main page of a web resource, as well as all pages of the 1st nesting level. In FIG. 2 shows an example of extracting HTML code from the source (software) of the domain https://press.sber.ru.
- the processing algorithm of the module (131) uses two types of algorithms rssfinder and htmlfinder, which provide analysis of web pages and identification of links to RSS feeds or HTML feeds.
- An example of defining links to news feeds is shown in Fig. 3.
- the module (131) determines the type of news feeds and the algorithm for processing the corresponding feed to extract a link to the text information of the news source.
- the rssfinder algorithm works first, because RSS feeds are easier to process; if, as a result of the work, rssfinder did not reveal anything, then the htmlfinder algorithm is activated. In this case, there may be cases when the link is incorrect or the source is not available (no response from the server), in which case the type of tape is determined during the identification process and depends on which algorithm returned the values, and the very fact of receiving responses from the source server.
- the presence of links, their number and signs of matches are also determined by keywords corresponding to the news source, for example, such as: “rss”, “feed”, “news”, “articles”, “news” “, “articles”, or excluding (".png", ".pdf', patterns: '.*login.*', '.*/([ l -]*[-_][ l -]*)+ $', etc.)
- keywords corresponding to the news source for example, such as: “rss”, “feed”, “news”, “articles”, “news” ", “articles”, or excluding (".png", ".pdf', patterns: '.*login.*', '.*/([ l -]*[-_][ l -]*)+ $', etc.)
- Identified links to news feeds, as well as their type (HTML or RSS) and the applicable processing algorithm for subsequent extraction of links to news feeds are transferred to the database (132).
- the stored information in the database (132) is further processed using scraping (133) and parsing (134) modules.
- the scraping module (133) ensures the processing of stored links to news feeds using the web resource markup analysis algorithm defined by the analyzer module (131), in which a link to the source web resource (110) is followed to check the link for duplication with the stored information in the database (132), as well as obtaining the HTML code for subsequent processing of text data by the parsing module (134).
- An example of extracting links from HTML code is shown in Fig. 3.
- the scraping module (133) performs continuous operation and iteratively processes the table of tape references from the database (132). In asynchronous mode, the module (133) operates three cycles that support processing of 3 types of feeds: RSS - RSS, Atom, JSON standards (type 1); HTML - regular HTML pages (2nd type); HTML pages processed using XPATH expressions (type 3), for which the path to news links is manually configured. An example of writing a XPATH expression is shown in Fig. 5.
- Each of the loops processes part of the links corresponding to its algorithm, during which the link to the source (software) is accessed to analyze the resulting HTML code to extract links to news data.
- An example of intermediate processing for HTML feed is shown in Fig. 6. All received links to news are checked for duplication by accessing the database (132); if the link is contained in the database (132), then it is excluded from processing, otherwise it is recorded in the database (132) and transferred for further processing.
- the parsing module (134) processes the received HTML code from the scraping module (133). During the operation of the module (134), text information is extracted from the HTML code using at least two algorithms for collecting text data, each of which selects the HTML node with the largest ratio of characters characterizing the coherent text of the news source to their total number .
- An HTML node is understood as a hierarchical node of HTML markup, for example, ⁇ head>, ⁇ body>, etc.
- One of the algorithms used is based on measuring the number of non-whitespace characters in the source HTML node. Another algorithm analyzes HTML nodes based on the amount of useful text, and extracts text from the nodes that have gained more weight. By testing these algorithms on one data set, differences in sets of high-quality texts were identified.
- the algorithms work in parallel and the evaluation of the results is compared by a machine learning model, for example, a neural network algorithm trained on examples of news sources, which are reference news texts.
- the machine learning model used within the parsing module (134) analyzes the presence of characteristics inherent in sources that are not news sources. These kinds of characteristics, as a rule, are stop words and special characters (for example, telephone numbers, a sequence of numbers, etc.).
- the results of the above algorithms identify the most semantically coherent text, which clearly characterizes the news source.
- the resulting text is subsequently stored in a database (132) for subsequent provision to the user or transmission to an automated system for selecting news by keywords.
- An example of the extracted text is shown in Fig. 7
- the claimed system (130) can be implemented on the basis of a single computing device (200), for example, a server.
- FIG. 8 shows a general view of such a computing device (200).
- a computing device contains one or more processors (201), memory devices such as RAM (202) and ROM (203), I/O interfaces (204), and input devices connected by a common information exchange bus. /output (205), and a device for network communication (206).
- processors 201
- memory devices such as RAM (202) and ROM (203)
- I/O interfaces 204
- input devices connected by a common information exchange bus.
- /output /output
- 206 a device for network communication
- the processor (201) may be selected from a variety of devices commonly used today, such as those from IntelTM, AMDTM, AppleTM, Samsung ExynosTM, MediaTEKTM, Qualcomm SnapdragonTM and etc.
- a graphics processor for example, Nvidia, AMD, Graphcore, etc., can also be used as a processor (501).
- RAM (202) is a random access memory and is designed to store machine-readable instructions executed by the processor (201) for performing the necessary logical data processing operations.
- the RAM (202) typically contains executable operating system instructions and associated software components (applications, program modules, etc.).
- the ROM (203) is one or more permanent storage devices, such as a hard disk drive (HDD), a solid state drive (SSD), flash memory (EEPROM, NAND, etc.), optical storage media ( CD-R/RW, DVD-R/RW, BlueRay Disc, MD), etc.
- I/O interfaces To organize the operation of device components (200) and organize the operation of external connected devices, various types of I/O interfaces (204) are used. The choice of appropriate interfaces depends on the specific design of the computing device, which can be, but is not limited to: PCI, AGP, PS/2, IrDa, FireWire, LPT, COM, SATA, IDE, Lightning, USB (2.0, 3.0, 3.1, micro, mini, type C), TRS/Audio jack (2.5, 3.5, 6.35), HDMI, DVI, VGA, Display Port, RJ45, RS232, etc.
- various means (205) of I/O information are used, for example, a keyboard, a display (monitor), a touch display, a touch pad, a joystick, a mouse, a light pen, stylus, touchpad, trackball, speakers, microphone, augmented reality tools, optical sensors, tablet, light indicators, projector, camera, biometric identification tools (retina scanner, fingerprint scanner, voice recognition module), etc.
- the network communication means (206) allows the device (200) to transmit data via an internal or external computer network, for example, an Intranet, the Internet, a LAN, etc.
- One or more means (206) may be used, but not limited to: Ethernet card, GSM modem, GPRS modem, LTE modem, 5G modem, satellite communication module, NFC module, Bluetooth and/or BLE module, Wi-Fi module and etc.
- satellite navigation tools can also be used as part of the device (200), for example, GPS, GLONASS, BeiDou, Galileo.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Databases & Information Systems (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Data Mining & Analysis (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
L'invention se rapporte au domaine des techniques informatiques. Ce système (130) effectue une recherche de noms de domaines dans le réseau Internet (120), comprenant des sources d'actualités (PO). On effectue une analyse du code HTML des pages web afin de découvrir des lignes d'actualités à l'aide dun module d'analyseur (131). On détermine le type des lignes d'actualités et un algorithme de traitement pour extraire des renvois à des informations textes de la source d'actualités. On transmet les renvois trouvés vers les lignes d'actualités, ainsi que leur type et l'algorithme de traitement vers une base de données (132). On traite les renvois sauvegardés vers les lignes d'actualités à l'aide d'un algorithme d'analyse de marque de ressources web dans un module de moissonnage (133), et on effectue une transition en fonction du renvoi vers les ressources web, on vérifie le renvoi en termes de duplication avec les informations stockées dans la base de données et on obtient le code HTML. Sur la base du code HTML obtenu et à l'aide d'un module d'analyse (134), on effectue une extraction des informations textes à l'aide d'algorithmes de collecte de données textes, qui effectuent chacun un choix des nœuds HTML présentant la relation la plus élevée entre les symboles caractérisant le texte lié de la source d'actualités, et leur nombre commun. Les résultats d'extraction de chaque algorithme sont traités par un modèle d'apprentissage machine afin d'analyser les sources qui ne sont pas des sources d'actualités. L'invention a pour but d'augmenter la précision de collecte et de traitement d'informations textes depuis une page web.
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
RU2022111786 | 2022-04-29 | ||
RU2022111786A RU2795678C1 (ru) | 2022-04-29 | Система и способ сбора и обработки новостей в сети интернет |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2023211304A1 true WO2023211304A1 (fr) | 2023-11-02 |
Family
ID=88519360
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/RU2022/000146 WO2023211304A1 (fr) | 2022-04-29 | 2022-04-29 | Système et procédé de collecte et de traitement d'actualités dans le réseau internet |
Country Status (1)
Country | Link |
---|---|
WO (1) | WO2023211304A1 (fr) |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20050114324A1 (en) * | 2003-09-14 | 2005-05-26 | Yaron Mayer | System and method for improved searching on the internet or similar networks and especially improved MetaNews and/or improved automatically generated newspapers |
US20070198459A1 (en) * | 2006-02-14 | 2007-08-23 | Boone Gary N | System and method for online information analysis |
RU2405197C2 (ru) * | 2004-02-12 | 2010-11-27 | Майкрософт Корпорейшн | Веб-кролинг на основе теории статистических решений и прогнозирование изменения веб-страницы |
US20150106157A1 (en) * | 2013-10-15 | 2015-04-16 | Adobe Systems Incorporated | Text extraction module for contextual analysis engine |
US20190213488A1 (en) * | 2016-09-02 | 2019-07-11 | Hithink Financial Services Inc. | Systems and methods for semantic analysis based on knowledge graph |
-
2022
- 2022-04-29 WO PCT/RU2022/000146 patent/WO2023211304A1/fr unknown
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20050114324A1 (en) * | 2003-09-14 | 2005-05-26 | Yaron Mayer | System and method for improved searching on the internet or similar networks and especially improved MetaNews and/or improved automatically generated newspapers |
RU2405197C2 (ru) * | 2004-02-12 | 2010-11-27 | Майкрософт Корпорейшн | Веб-кролинг на основе теории статистических решений и прогнозирование изменения веб-страницы |
US20070198459A1 (en) * | 2006-02-14 | 2007-08-23 | Boone Gary N | System and method for online information analysis |
US20150106157A1 (en) * | 2013-10-15 | 2015-04-16 | Adobe Systems Incorporated | Text extraction module for contextual analysis engine |
US20190213488A1 (en) * | 2016-09-02 | 2019-07-11 | Hithink Financial Services Inc. | Systems and methods for semantic analysis based on knowledge graph |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
AU2019263758B2 (en) | Systems and methods for generating a contextually and conversationally correct response to a query | |
US10102254B2 (en) | Confidence ranking of answers based on temporal semantics | |
JP5065420B2 (ja) | ウェブ・サービス定義の品質を事前評価および精密化するための方法、システム、およびコンピュータ読み取り可能媒体 | |
US20210209421A1 (en) | Method and apparatus for constructing quality evaluation model, device and storage medium | |
US9760828B2 (en) | Utilizing temporal indicators to weight semantic values | |
WO2018184518A1 (fr) | Procédé et dispositif de traitement de données de microblogue, dispositif informatique et support d'informations | |
WO2009096523A1 (fr) | Dispositif d'analyse d'informations, système de recherche, procédé d'analyse d'informations, et programme d'analyse d'informations | |
Martens et al. | Extracting and analyzing context information in user-support conversations on twitter | |
US10628749B2 (en) | Automatically assessing question answering system performance across possible confidence values | |
CN107301195A (zh) | 生成用于搜索内容的分类模型方法、装置和数据处理系统 | |
Sleeman et al. | Entity type recognition for heterogeneous semantic graphs | |
US20230119590A1 (en) | Automatic identification of document sections to generate a searchable data structure | |
US20230274085A1 (en) | Vector space model for form data extraction | |
RU2795678C1 (ru) | Система и способ сбора и обработки новостей в сети интернет | |
WO2023211304A1 (fr) | Système et procédé de collecte et de traitement d'actualités dans le réseau internet | |
EA044489B1 (ru) | Система и способ сбора и обработки новостей в сети интернет | |
US20180260460A1 (en) | Analytics engine selection management | |
Mastropaolo et al. | Towards Summarizing Code Snippets Using Pre-Trained Transformers | |
Özyiğit | MUHASEBE ALANINA GÜNCEL YAKLAŞIMLAR: METİN MADENCİLİĞİ | |
CN113722421B (zh) | 一种合同审计方法和系统,及计算机可读存储介质 | |
RU2755606C2 (ru) | Способ и система классификации данных для выявления конфиденциальной информации в тексте | |
Israeli et al. | Unsupervised discovery of non-trivial similarities between online communities | |
KR101909537B1 (ko) | 소셜 데이터 분류 시스템 및 방법 | |
CN112733542A (zh) | 主题的探测方法、装置、电子设备及存储介质 | |
Orellana et al. | Evaluating named entities recognition (NER) tools vs algorithms adapted to the extraction of locations |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 22940413 Country of ref document: EP Kind code of ref document: A1 |