WO2023284612A1 - 主题网页数据抓取方法、装置、设备及存储介质 - Google Patents

主题网页数据抓取方法、装置、设备及存储介质 Download PDF

Info

Publication number
WO2023284612A1
WO2023284612A1 PCT/CN2022/104188 CN2022104188W WO2023284612A1 WO 2023284612 A1 WO2023284612 A1 WO 2023284612A1 CN 2022104188 W CN2022104188 W CN 2022104188W WO 2023284612 A1 WO2023284612 A1 WO 2023284612A1
Authority
WO
WIPO (PCT)
Prior art keywords
link
links
content
target
relevance
Prior art date
Application number
PCT/CN2022/104188
Other languages
English (en)
French (fr)
Inventor
史延涛
谢永恒
火一莽
Original Assignee
北京锐安科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京锐安科技有限公司 filed Critical 北京锐安科技有限公司
Publication of WO2023284612A1 publication Critical patent/WO2023284612A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9532Query formulation
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • the embodiments of the present application relate to the field of computer technology, for example, to methods, devices, devices and storage media for capturing subject webpage data.
  • the Internet is a huge data collection, and the data of network information resources is increasing exponentially. How to effectively divide the huge data into relevant and irrelevant data according to the user's search query, and display the relevant data is the current research direction.
  • the search strategy in the related art has the problems of inaccurate automatic search and slow crawling speed of web page data.
  • the embodiment of the present application provides a method, device, device, and storage medium for capturing subject webpage data, which can optimize the theme webpage data capturing scheme of related technologies.
  • the embodiment of the present application provides a method for crawling topic webpage data, including: determining the target topic according to the search content input by the user, and selecting from the link queue corresponding to the target topic based on the preset search strategy Selecting the link to be captured; obtaining the webpage content corresponding to the link to be captured; screening the target link from the links to be captured according to the content relevance and link relevance, and feeding back the target link as a search result, wherein, the content correlation is determined according to the webpage content and the target topic, and the link correlation is determined according to the link to be captured and the target topic.
  • the embodiment of the present application provides a subject webpage data capture device, including: a link selection module to be captured, configured to determine the target subject according to the search content input by the user, and select from the A link to be captured is selected from the queue of links to be captured corresponding to the target topic; the webpage content acquisition module is configured to obtain the webpage content corresponding to the link to be captured; the target link screening module is configured to be based on content relevance and link Relevance selects target links from the links to be grabbed, and feeds back the target links as search results, wherein the content relevance is determined according to the webpage content and the target topic, and the link relevance Determine according to the link to be crawled and the target topic.
  • the embodiment of the present application provides a computer device, including a memory, a processor, and a computer program stored in the memory and operable on the processor.
  • the processor executes the computer program, it implements the The subject web page data grabbing method provided in the embodiment.
  • the embodiment of the present application provides a computer-readable storage medium, on which a computer program is stored, and when the program is executed by a processor, the subject webpage data grabbing method provided in the embodiment of the present application is implemented.
  • FIG. 1 is a schematic flow diagram of a subject web page data grabbing method provided by an embodiment of the present application
  • FIG. 2 is a schematic flow diagram of another subject webpage data grabbing method provided by the embodiment of the present application.
  • FIG. 3 is a structural block diagram of a subject web page data grabbing device provided in an embodiment of the present application.
  • FIG. 4 is a structural block diagram of a computer device provided by an embodiment of the present application.
  • Fig. 1 is a schematic flow diagram of a method for capturing subject webpage data provided by an embodiment of the present application.
  • the method can be executed by a device for capturing subject webpage data, wherein the device can be implemented by at least one of software and hardware, and can generally be integrated in In computer equipment such as servers.
  • the method includes:
  • S110 Determine a target topic according to the search content input by the user, and select a link to be captured from a queue of links to be captured corresponding to the target topic based on a preset search strategy.
  • Determining the target topic according to the search content input by the user can be understood as the text information entered by the user when searching on the search engine, and determining the target topic according to the current text information, the current text information can be directly determined as the target topic, or The corresponding target topic is obtained after semantic analysis of the current text information.
  • the target subject may be information such as a word, a sentence, or a piece of text.
  • the search engine When a user inputs search content in an input box of a search engine, the search engine will display a web page interface related to the target topic.
  • a web page interface related to the target topic What needs to be known is that a large number of webpage links (Uniform Resource Locator, URL) about data information are stored in the server, and each webpage interface is in one-to-one correspondence with the webpage links. Therefore, before displaying the webpage interface related to the target topic, the server needs to determine which webpage interfaces related to the target topic are to be displayed.
  • URL Uniform Resource Locator
  • a search engine when used for searching, in order to facilitate the management of a large number of generated webpage links related to data information, separate management may be performed according to the status of the webpage links. For example, if the current link has been successfully captured within the historical time period, it will be stored in the captured queue; if the current link has not been captured, it will be stored in the waiting queue; if the current link has been captured within the historical time period However, if the capture fails, it will be stored in the error queue.
  • the basis for judging the success or failure of the above link capture can be whether the corresponding web interface is successfully displayed after the link is captured, and if the corresponding web interface is successfully displayed, it is considered that the current link is successfully captured; If the crawling times out during link capture, that is, the corresponding web interface is not displayed within the preset time, or the returned result is empty, that is, the corresponding web interface has no content, it is considered that the current link capture fails.
  • the way to select the link to be grabbed from the queue of links to be grabbed based on the preset search strategy can be as follows: When relevant, they can be used as links to be crawled. Optionally, there may be one or more links to be captured.
  • the preset search strategy may be that the links to be captured contain information entries related to "weather”, and the web interface corresponding to the links to be captured may be "A City one week weather forecast”, “City A weather forecast for the next 15 days” and “Weather-Baidu Encyclopedia”, etc.
  • the method for obtaining webpage content may be: extracting important information links and texts in the current webpage by means of Hyper Text Markup Language (HTML); it is also possible to set relevant computer program codes inside the server , the target topic can be parsed into keywords or keyword information through codes, so as to extract webpage content related to keywords or keyword information.
  • HTML Hyper Text Markup Language
  • multiple links to be captured may be analyzed from two dimensions of content relevance and link relevance, and the target link may be obtained through comprehensive judgment.
  • the content correlation is determined according to the webpage content and the target topic
  • the link correlation is determined according to the link to be captured and the target topic.
  • the method of determining content relevance according to the content of the webpage and the target topic may be to extract keywords or keyword information in the webpage content and compare them with keywords or keywords of the target topic, or to Count the keywords or keyword information in the content of the web page corresponding to the link, and sort the statistics in order from high to low. The more keywords or keywords, the higher the degree of relevance, so as to filter out the content with higher degree of relevance link to be fetched.
  • the link relevance can be determined through the links to be captured and the target topic, and the way to determine the link relevance can be the keyword or keyword information carried by the link address Match and compare with the keywords or keywords of the target topic, or use search strategies such as category relationship and complex relationship calculation to determine the degree of relevance to the topic, so that links can be obtained from links to be crawled with high content relevance Links to be captured with higher relevance, and the link to be captured that ranks first in the order of the current link's higher relevance as the target link.
  • a web page interface related to the target link can be displayed to the user.
  • the Webpage text content and webpage links are used in combination to learn from each other, so as to calculate the correlation between the page content and the topic, and to judge and screen out the pages related to the topic as much as possible to enhance the accuracy rate.
  • the topic web page data grabbing method provided in the embodiment of the present application first determines the target topic according to the search content input by the user, and selects the link to be grabbed from the queue of links to be grabbed corresponding to the target topic based on the preset search strategy; then according to The link to be crawled obtains the corresponding web page content; finally, the target link is screened from the links to be crawled according to the content relevance and link relevance, and the target link is fed back as a search result.
  • the embodiment of the present application has been modified on the basis of the above embodiments, and the step of obtaining the web page content corresponding to the link to be captured has been modified, including: simulating the client to send the access request corresponding to the link to be captured to the corresponding server, and according to The received access response downloads the webpage file corresponding to the link to be captured; the webpage file is parsed to extract the webpage content in the webpage file, wherein the webpage content includes link information and text information.
  • the advantage of this setting is that by downloading the webpage file corresponding to the link to be crawled, the corresponding webpage content can be accurately analyzed.
  • the step of screening the target links from the links to be grabbed according to the content relevance and link relevance is also changed, including: for all the links to be captured, determine the content relevance according to the text information and the target topic in the web page content , based on the judgment result that the content relevance does not meet the preset content relevance requirements, the corresponding links to be captured are stored in the captured queue; for the links to be captured that meet the preset content relevance requirements, according to the content of the webpage
  • the link information and the target topic determine the link relevance, and based on the judgment result that the link relevance does not meet the preset link relevance requirements, the corresponding links to be captured are stored in the captured queue; the pending links that meet the preset link relevance requirements are Crawling links are sorted according to content relevance and link relevance, and target links are filtered out according to the sorting results.
  • the advantage of this setting is that the accuracy of obtaining target links can be improved by selecting links to be captured that meet the two dimensions of content relevance and link relevance as target links.
  • Fig. 2 is a schematic flow chart of another method for capturing subject webpage data provided by the embodiment of the present application.
  • the method is described by taking webpage search as an application scenario as an example.
  • the method includes the following steps:
  • S210 Determine a target topic according to the search content input by the user, and select a link to be captured from a queue of links to be captured corresponding to the target topic based on a preset search strategy.
  • the simulated client sends an access request corresponding to the link to be captured to the corresponding server, and downloads the webpage file corresponding to the link to be captured according to the received access response.
  • the access request may include the access request method of the link to be captured, the access request identifier, and the communication protocol, etc.
  • the server side After the server side receives the access request and responds, it downloads the webpage file corresponding to the link to be captured, thereby completing the automatic capture of the webpage file corresponding to the link to be captured according to the target theme.
  • a timeout mechanism is set in the webpage acquisition module, and webpages exceeding a certain crawling time will be discarded.
  • simulated visits may be performed sequentially for each link to be captured, and webpage files corresponding to the link to be captured are downloaded respectively.
  • a unified simulated access can also be performed for all current links to be captured, and webpage files corresponding to multiple links to be captured can be downloaded and obtained.
  • the server respectively parses each downloaded webpage file, so as to extract the webpage content in the webpage file.
  • the web page content includes link information and text information.
  • the link information may be a webpage link or a webpage address corresponding to the current webpage, or may be a hyperlink in a webpage corresponding to the link to be captured.
  • the text information is the text content included in the current webpage, which may be text title information, a piece of text information, or all the text information included in the webpage content.
  • the text information in the corresponding web content is extracted, and the content correlation with the target topic is calculated. After calculation for each link to be captured, the text information in the corresponding web content and the target can be obtained.
  • the relative value of the topic's relevance For example, if the content correlation between the current link to be captured and the target topic is 20%, 50%, or 80%, the relevant value of the correlation can also be divided into relevant levels, for example, the value below 10% can be divided into different levels. Correlation, values between 10%-40% are classified as general correlations, 40%-70% are classified as moderate correlations, and values above 70% are classified as severe correlations, etc.
  • the preset content relevance requirement may be to select links to be captured with a content relevance value of more than 40% or a relevance level of moderate relevance and severe relevance for analysis. Since the content correlation calculation has been performed on all the links to be captured, it can be considered that all the links to be captured are connections that have been successfully captured, and the corresponding content correlation value can be below 40% or the correlation level is irrelevant And generally related links to be crawled are stored in the crawled queue.
  • determining the content relevance according to the text information in the webpage content and the target topic includes: preprocessing the text information in the webpage content to obtain machine language model data, and determining the content relevance according to the machine language model data and the target topic , wherein the preprocessing includes at least one of text segmentation, stop word removal and stemming.
  • Text segmentation algorithm is beneficial to text mining. According to the text information in the current webpage content, it can be successfully segmented into words, and the effect of semantic recognition can be achieved.
  • Remove stop words Stop Words
  • Words In the text information in the current web page content, in order to save the recognition efficiency of keywords or keywords in the text information, some words or words will be automatically filtered out before or after the text information is recognized. Words, for example, can be determiners, quantifiers, or prepositions.
  • the stemming algorithm is to remove the plurals of some nouns and the different tenses of verbs in the remaining words after segmenting the text information in the current web page content.
  • keywords can also be extracted from the text information in the webpage content, and the semantic similarity between the extracted keywords and the target topic can be calculated.
  • the frequency statistics of words are carried out, and the content relevance is determined according to the frequency statistics results and semantic similarity.
  • calculating the semantic similarity with the target topic can be obtained by calculating the keywords and the target topic using various strategies such as matching strategy, category relationship strategy, and complex relationship calculation. For example, when the target topic input by the user is "Travel Strategy of province A”, when performing semantic similarity calculation, about “Province A-Baidu Encyclopedia”, “Self-driving Travel Strategy of City A” and “Must-Visit Sightseeing Spots in province A” " and so on are similar to the target topic, then the corresponding similarity value can be obtained.
  • the current statistical values can be sorted sequentially, and the text in the webpage content with the larger the frequency statistical value and the higher the semantic similarity The more relevant the information is to the content of the target topic.
  • the links to be grabbed corresponding to the webpage contents whose content relevance is sorted in the first few places can be selected to judge the link relevance in the next dimension, and the content relevance can also be selected to be greater than a certain
  • the link to be crawled corresponding to the web page content with a numerical value is judged on link relevance in the next dimension.
  • the links to be captured that meet the preset content relevancy requirements determine the link relevancy according to the link information in the webpage content and the target topic, and based on the judgment result that the link relevancy does not meet the preset link relevancy requirements, the corresponding Links to be fetched are stored in the fetched queue.
  • the link relevance can be determined according to the link information and the target topic in the web page content.
  • Links can be composed of information such as protocol type, host name, path, and file name, and the relevance can be judged based on entry information related to keywords carried in the link.
  • the process of judging whether the link correlation meets the preset link correlation requirement is the same as the process of judging whether the content correlation meets the requirement, and will not be repeated here.
  • the links to be captured that do not meet the preset link relevance requirements after re-screening are also stored in the captured queue.
  • the second quantity of links to be grasped that meets the link relevance is obtained based on the content relevance.
  • the second number is smaller than the first number, and the target connection may be filtered out from the links to be captured satisfying the second number.
  • An optional solution is to sort according to content relevance and link relevance, and to filter out target links based on the sorting results includes: determining the comprehensive relevance of each link according to content relevance and link relevance; The lower order is sorted, and the links whose comprehensive correlation degree is greater than the first preset comprehensive correlation degree threshold or the links whose comprehensive correlation degree sorting number is smaller than the first preset serial number are determined as target links.
  • the comprehensive correlation degree corresponding to each link according to the content correlation degree and the link correlation degree it can be obtained by adding the value of the content correlation degree and the value of the link correlation degree respectively obtained according to the current link, or it can be obtained according to the current.
  • the content correlation degree and the link correlation degree assign weights (for example, the content correlation degree accounts for 60%, and the link correlation degree accounts for 40%), and so on.
  • a link whose comprehensive correlation degree is greater than a first preset comprehensive correlation degree threshold or a link whose comprehensive correlation degree ranking sequence number is smaller than the first preset sequence number can be determined as a target link.
  • the embodiment of this application also provides an alternative solution, according to the content correlation and link correlation
  • the target links are screened out, which may also include: sorting the links whose comprehensive relevance is less than or equal to the first preset comprehensive relevance threshold and greater than the second preset comprehensive relevance threshold, or sorting the comprehensive relevance
  • a link whose serial number is greater than or equal to the first preset serial number and smaller than the second preset serial number is determined as a candidate link; a new target topic is determined according to the new search content input by the user; based on the judgment result that the new target topic is the same as the target topic , select new target links from the candidate links, and feed back the new target links as the current search results.
  • the links whose comprehensive relevance is less than or equal to the first preset comprehensive relevance threshold and greater than the second preset comprehensive relevance threshold can be sorted, or integrated Links whose relevance ranking sequence numbers are greater than or equal to the first preset sequence number and smaller than the second preset sequence number (for example, 10) are determined as candidate links.
  • the candidate links Based on the current target topic, filter the candidate links again, filter out the new target link, and feed back the new target link as the search result of this time.
  • Another alternative is to determine the link that ranks first in the calculated comprehensive correlation value as the target link for feedback. If the user is not satisfied with the content of the corresponding web page, the comprehensive correlation can also be ranked first. The second link is used as a new target link for feedback and so on.
  • the current link can be stored in the crawled queue, and the information contained in the webpage content corresponding to the target link can be stored in the form of a file or database, thereby completing the retrieval function for the search engine be prepared.
  • Selecting the link to be grabbed from the queue of links to be grabbed corresponding to the target topic based on the preset search strategy includes: selecting the link to be grabbed corresponding to the target topic based on the preset search strategy Select candidate links to be grabbed in the queue; judge whether the candidate links to be grabbed contain the target candidate links to be grabbed, based on the judgment result that the candidate links to be grabbed contain the target candidate links to be grabbed, put the candidate links to be grabbed
  • the included target candidate links to be grabbed are filtered out to obtain the links to be grabbed.
  • the target candidate links to be grabbed include the candidate links to be grabbed that are determined to be the links to be grabbed more than a preset number of times threshold within the latest preset time period.
  • the subject webpage data capture method preprocesses the text information in the webpage content by optimizing and rationally formulating search strategies, converts the text content into a machine language model, and analyzes the webpage through the links to be captured , screening, etc., solves the judgment of the correlation between the target link and the target topic, and the content of the target page and the target topic, and improves the precision rate, recall rate and efficiency of the search engine when searching according to the target topic.
  • the automatic indexing system Before the search engine crawls the target link, by judging the content relevance and link relevance of the link to be crawled and the target topic, the automatic indexing system can filter out as many web pages related to the topic as possible, reducing the modeling of irrelevant web pages, Therefore, the results returned when the target topic is automatically indexed have a high accuracy rate. Compared with the search method in the related art, it can accurately obtain the characteristics of effective information.
  • Fig. 3 is a structural block diagram of a subject webpage data grabbing device provided by an embodiment of the present application.
  • the device can be realized by at least one of software and hardware, and can generally be integrated in a computer device such as a server.
  • the fetching method is used to fetch topic web page data.
  • this device comprises: to be grabbed link selection module 31, webpage content acquisition module 32 and target link screening module 33, wherein:
  • the link selection module 31 to be grabbed is configured to determine the target topic according to the search content input by the user, and select the link to be grabbed from the queue of links to be grabbed corresponding to the target topic based on a preset search strategy;
  • the webpage content acquisition module 32 is configured to acquire the webpage content corresponding to the link to be grabbed
  • the target link screening module 33 is configured to filter the target link from the links to be grabbed according to the content relevance and link relevance, and feed back the target link as a search result, wherein the content relevance is determined according to the webpage content and the target topic, Link relevance is determined based on the link to be crawled and the target topic.
  • the subject web page data capture device provided in the embodiment of the present application first determines the target subject according to the search content input by the user, and selects the link to be captured from the queue of links to be captured corresponding to the target subject based on the preset search strategy; then according to The link to be crawled obtains the corresponding web page content; finally, the target link is screened from the links to be crawled according to the content relevance and link relevance, and the target link is fed back as a search result.
  • the webpage content acquisition module 32 includes: a webpage file download unit and a webpage content extraction unit;
  • the webpage file downloading unit is configured to simulate the client sending an access request corresponding to the link to be captured to the corresponding server, and download the webpage file corresponding to the link to be captured according to the received access response.
  • the webpage content extraction unit is configured to analyze the webpage file to extract the webpage content in the webpage file, wherein the webpage content includes link information and text information.
  • the target link screening module 33 includes: a content relevance determination unit, a link relevance determination unit, a captured link storage unit, and a target link screening unit;
  • the content relevance determination unit is configured to determine the content relevance of all the links to be captured according to the text information and target topics in the webpage content, and based on the judgment result that the content relevance does not meet the preset content relevance requirements, the corresponding Links to be fetched are stored in the fetched queue.
  • the link correlation determining unit is configured as a link correlation determining unit, and is configured to determine the link correlation according to the link information and the target topic in the webpage content for the link to be grabbed that meets the preset content correlation requirement, and based on the link correlation If the judgment result does not meet the preset link relevance requirement, the corresponding link to be captured is stored in the captured queue.
  • the target link screening unit is configured to sort the links to be grabbed that meet the preset link correlation requirements according to the content correlation and link correlation, and filter out the target links according to the sorting results.
  • the link relevance determining unit is also configured to implement at least one of the following steps: preprocessing the text information in the webpage content to obtain machine language model data, and determining the content relevance according to the machine language model data and the target topic, wherein, the preprocessing includes at least one of text segmentation, stop word removal, and stemming; keywords are extracted from the text information in the webpage content, and the semantic similarity between the extracted keywords and the target topic is calculated, and the extracted The frequency of keywords is counted, and the content relevance is determined according to the frequency statistics and semantic similarity.
  • the target link screening unit includes: a comprehensive correlation determination subunit and a target link determination subunit;
  • the comprehensive correlation degree determination unit is configured to determine the comprehensive correlation degree corresponding to each link according to the content correlation degree and the link correlation degree.
  • the target link determination unit is configured to sort in descending order of comprehensive relevance, and sort the links whose comprehensive relevance is greater than the first preset comprehensive relevance threshold or the links whose comprehensive relevance rank is smaller than the first preset serial number identified as the target link.
  • the target link screening unit further includes: a candidate link determination subunit, a target topic determination subunit, and a target link feedback subunit;
  • the candidate link determination subunit is configured to sort the links whose comprehensive correlation degree is less than or equal to the first preset comprehensive correlation degree threshold and greater than the second preset comprehensive correlation degree threshold, or whose comprehensive correlation degree sorting number is greater than or equal to the first preset comprehensive correlation degree threshold.
  • a link whose serial number is set and smaller than the second preset serial number is determined as a candidate link.
  • the target topic determination subunit is configured to determine a new target topic according to the new search content input by the user.
  • the target link feedback subunit is configured to screen the new target link from the candidate links based on the judgment result that the new target topic is the same as the target topic, and feed back the new target link as the current search result.
  • the link selection module 31 to be grabbed includes: a link selection unit to be grabbed by a candidate and a link filtering unit to be grabbed by a target candidate;
  • the candidate to-be-grabbed link selection unit is configured to select candidate to-be-grabbed links from the queue of to-be-grabbed links corresponding to the target topic based on a preset search strategy.
  • the target candidate link to be grabbed filtering unit is set to judge whether the candidate link to be grabbed contains the target candidate link to be grabbed, based on the judgment result that the candidate link to be grabbed contains the target candidate link to be grabbed, the candidate to be grabbed.
  • the target candidate links contained in the captured links are filtered out to obtain the links to be captured; among them, the target candidate links to be captured include that the number of times the target candidate links to be captured has exceeded the preset Candidate links to be crawled for times threshold.
  • the subject web page data capture device provided in the embodiment of the present application can execute the subject web page data capture method provided in any embodiment of the present application, and has corresponding functional modules and beneficial effects for executing the method.
  • FIG. 4 is a structural block diagram of a computer device provided by an embodiment of the present application.
  • the computer device 40 may include: a memory 41, a processor 42, and a computer program stored on the memory 41 and operable by the processor 42.
  • the processor 42 executes the computer program, the subject matter described in the embodiment of the present application is realized Web page data capture method.
  • the computer equipment provided in the embodiments of the present application can execute the subject webpage data capture method provided in any embodiment of the present application, and has corresponding functional modules and beneficial effects for executing the method.
  • the embodiment of the present application also provides a storage medium containing computer-executable instructions, the computer-executable instructions are used to execute the subject webpage data grabbing method when executed by a computer processor, the method comprising:
  • a storage medium refers to any of various types of memory devices or storage devices.
  • the term "storage medium” may include: installation media, such as Compact Disc Read Only Memory (CD-ROM), floppy disk or tape drive; computer system memory or random access memory (Random Access Memory, RAM), such as dynamic RAM (Dynamic RAM, DRAM), double data rate RAM (Double Data Rate RAM, DDRRAM), static RAM (Static RAM, SRAM), extended data output RAM (Extended Data Out RAM, EDORAM), Lambas (Rambus) RAM, etc.; non-volatile memory, such as flash memory, magnetic media (eg hard disk or optical storage); registers or other similar types of memory elements, etc.
  • the storage medium may also include other types of memory or combinations thereof.
  • the storage medium may be located in a first computer system in which the program is executed, or may be located in a different second computer system connected to the first computer system through a network such as the Internet.
  • the second computer system may provide program instructions for execution by the first computer.
  • the term "storage medium" may include two or more storage media that may reside in different locations, such as in different computer systems connected by a network.
  • a storage medium may store program instructions (eg, implemented as a computer program) that are executable by one or more processors.
  • a storage medium containing computer-executable instructions provided in the embodiments of the present application the computer-executable instructions are not limited to the subject webpage data crawling operation as described above, and can also execute the subject provided in any embodiment of the present application Relevant operations in the webpage data fetching method.
  • the theme webpage data capture device, equipment and storage medium provided in the above embodiments can execute the subject webpage data capture method provided in any embodiment of the present application, and have corresponding functional modules and beneficial effects for executing the method.
  • the subject webpage data capture method provided in any embodiment of the present application.

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

本申请实施例公开了主题网页数据抓取方法、装置、设备及存储介质,该方法包括:根据用户输入的搜索内容确定目标主题,并基于预设搜索策略从目标主题对应的待抓取链接队列中选取待抓取链接;获取待抓取链接对应的网页内容;根据内容相关度和链接相关度从待抓取链接中筛选目标链接,并将目标链接作为搜索结果进行反馈。

Description

主题网页数据抓取方法、装置、设备及存储介质
本公开要求在2021年7月14日提交中国专利局、申请号为202110793519.X的中国专利申请的优先权,该申请的全部内容通过引用结合在本申请中。
技术领域
本申请实施例涉及计算机技术领域,例如涉及主题网页数据抓取方法、装置、设备及存储介质。
背景技术
互联网作为一个庞大的数据集合,网络信息资源数据呈指数增加,如何有效地根据用户的搜索查询将庞大的数据分为相关和不相关数据,并将相关的数据进行展示,是现在的研究方向。
当用户使用相关技术中的搜索引擎进行检索时,只能提供粗略的检索结果,且相关技术中的基于网页内容评价的搜索策略往往会忽略网页间链接的相关性,而基于链接分析的搜索策略忽略了网页正文内容,容易造成“主题漂移”现象。
相关技术中的的搜索策略存在自动搜索不精准,抓取网页数据速度较慢的问题。
发明内容
本申请实施例提供了主题网页数据抓取方法、装置、设备及存储介质,可以优化相关技术的主题网页数据抓取方案。
第一方面,本申请实施例提供了一种主题网页数据抓取方法,包括:根据用户输入的搜索内容确定目标主题,并基于预设搜索策略从所述目标主题对应的待抓取链接队列中选取待抓取链接;获取所述待抓取链接对应的网页内容;根据内容相关度和链接相关度从所述待抓取链接中筛选目标链接,并将所述目标链接作为搜索结果进行反馈,其中,所述内容相关度根据所述网页内容和所述目标主题确定,所述链接相关度根据所述待抓取链接和所述目标主题确定。
第二方面,本申请实施例提供了一种主题网页数据抓取装置,包括:待抓取链接选取模块,被设置为根据用户输入的搜索内容确定目标主题,并基于预 设搜索策略从所述目标主题对应的待抓取链接队列中选取待抓取链接;网页内容获取模块,被设置为获取所述待抓取链接对应的网页内容;目标链接筛选模块,被设置为根据内容相关度和链接相关度从所述待抓取链接中筛选目标链接,并将所述目标链接作为搜索结果进行反馈,其中,所述内容相关度根据所述网页内容和所述目标主题确定,所述链接相关度根据所述待抓取链接和所述目标主题确定。
第三方面,本申请实施例提供了一种计算机设备,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,所述处理器执行所述计算机程序时实现如本申请实施例提供的主题网页数据抓取方法。
第四方面,本申请实施例提供了一种计算机可读存储介质,其上存储有计算机程序,该程序被处理器执行时实现如本申请实施例提供的主题网页数据抓取方法。
附图说明
图1为本申请实施例提供的一种主题网页数据抓取方法的流程示意图;
图2为本申请实施例提供的又一种主题网页数据抓取方法的流程示意图;
图3为本申请实施例提供的一种主题网页数据抓取装置的结构框图;
图4为本申请实施例提供的一种计算机设备的结构框图。
具体实施方式
下面结合附图并通过具体实施方式来说明本申请的技术方案。可以理解的是,此处所描述的实施例仅仅用于解释本申请。另外还需要说明的是,为了便于描述,附图中仅示出了与本申请相关的部分而非全部结构。
在讨论示例性实施例之前应当提到的是,一些示例性实施例被描述成作为流程图描绘的处理或方法。虽然流程图将各步骤描述成顺序的处理,但是其中的许多步骤可以被并行地、并发地或者同时实施。此外,各步骤的顺序可以被重新安排。当其操作完成时所述处理可以被终止,但是还可以具有未包括在附图中的附加步骤。所述处理可以对应于方法、函数、规程、子例程、子程序等。
实施例一
图1为本申请实施例提供的一种主题网页数据抓取方法的流程示意图,该方法可以由主题网页数据抓取装置执行,其中该装置可由软件和硬件中至少之一实现,一般可集成在服务器等计算机设备中。如图1所示,该方法包括:
S110、根据用户输入的搜索内容确定目标主题,并基于预设搜索策略从目标主题对应的待抓取链接队列中选取待抓取链接。
根据用户输入的搜索内容确定目标主题可以理解为,用户需要在搜索引擎上进行搜索时所输入的文字信息,并根据当前文字信息确定目标主题,可以将当前文字信息直接确定为目标主题,也可对当前文字信息进行语义分析后得到相应的目标主题。该目标主题可以为词语、句子或者一段文字等信息。
当用户在搜索引擎的输入框中输入搜索内容时,搜索引擎会展示出与目标主题相关的网页界面。需要知道的是,在服务器中存储有大量关于数据信息的网页链接(Uniform Resource Locator,URL),且每一网页界面与网页链接一一对应。因此,在对与目标主题相关的网页界面展示之前,服务器需要判断将与目标主题相关的哪些网页界面进行展示。
在一实施例中,当使用搜索引擎进行搜索时,为便于对产生的大量关于数据信息的网页链接进行管理,可根据网页链接的状态进行分开管理。例如,若当前链接历史时间段内被成功抓取过,则存放入已抓取队列;若当前链接未被抓取过,则存放入待抓取队列;若当前链接历史时间段内被抓取过,但抓取失败,则存放入错误队列。
需要说明的是,判断上述链接抓取成功或者抓取失败的依据可以为,链接被抓取后是否成功展示对应的网页界面,若成功展示对应的网页界面,则认为当前链接被成功抓取;若在链接抓取时抓取超时,即在预设时间内没有展示对应的网页界面,或是返回结果为空,即对应的网页界面无内容,则认为当前链接抓取失败。
当用户进行目标主题的搜索时,基于预设搜索策略从待抓取链接队列中选取待抓取链接的方式可以为,当判断到网页链接对应的关键字或关键词的词条信息与目标主题相关时,均可作为待抓取链接。可选地,待抓取链接可以为一个或者多个。
示例性的,若目标主题为“天气预报”时,则预设搜索策略可以为待抓取链接中包含有与“天气”有关信息词条,则待抓取链接对应的网页界面可以为“A市一周天气预报”、“A市未来15天天气预报”以及“天气-百度百科”等。
S120、获取待抓取链接对应的网页内容。
通过对待抓取链接对应的网页界面进行解析,可获得对应的网页内容。可选地,获得网页内容的方式可以为:采用超文本标记语言(Hyper Text Markup Language,HTML)的方式对当前网页中的重要信息链接以及文本进行提取;还可在服务器内部设置相关计算机程序代码,可通过代码将目标主题解析为关键字或关键词信息,从而提取与关键字或关键词信息有关的网页内容。
S130、根据内容相关度和链接相关度从待抓取链接中筛选目标链接,并将目标链接作为搜索结果进行反馈。
可选地,可以对多个待抓取链接分别从内容相关度以及链接相关度两个维度分析,综合判断得到目标链接。其中,内容相关度根据网页内容和目标主题确定,链接相关度根据待抓取链接和目标主题确定。
在一实施例中,根据网页内容和目标主题确定内容相关度的方式可以为,通过提取网页内容中的关键字或关键词信息与目标主题的关键字或关键词进行比对,或者对待抓取链接对应的网页内容中的关键字或关键词信息进行统计,将统计数量由高到低进行依次排序,关键字或关键词越多的,相关度越高,从而筛选得到内容相关度较高的待抓取链接。
在筛选得到内容相关度较高的待抓取链接的基础上,可以通过待抓取链接和目标主题确定链接相关度,确定链接相关度的方式可以为通过链接地址携带的关键字或关键词信息与目标主题的关键字或关键词进行匹配比对,或通过类别关系及复杂关系计算等搜索策略,确定与主题的相关度高低,从而可在内容相关度较高的待抓取链接中得到链接相关度较高的待抓取链接,并将当前链接相关度较高顺序排位在顺位第一的待抓取链接当作目标链接。
从而将目标链接进行反馈后,可向用户展示与目标链接相关的网页界面。本申请实施例通过判断待抓取链接与目标主题的内容相关度和链接相关度,并通过结合基于内容评价算法和基于链接分析算法从页面内容和页面间的链接关系两个方面进行考虑,将网页文本内容和网页链接结合使用、取长补短,从而 计算出页面内容与主题间的相关性,尽可能择优判断与筛选出与主题相关的页面,增强准确率。
本申请实施例中提供的主题网页数据抓取方法,首先根据用户输入的搜索内容确定目标主题,并基于预设搜索策略从目标主题对应的待抓取链接队列中选取待抓取链接;然后根据待抓取链接获取对应的网页内容;最后根据内容相关度和链接相关度从待抓取链接中筛选目标链接,并将目标链接作为搜索结果进行反馈。采用上述技术方案,通过将网页内容和网页链接结合,判断内容相关度和链接相关度,进而从待抓取链接中筛选出目标链接,可以达到提高搜索精准度,提升搜索效率的技术效果。
实施例二
本申请实施例在上述实施例的基础上进行了改动,改动了获取待抓取链接对应的网页内容步骤,包括:模拟客户端向对应的服务端发送待抓取链接对应的访问请求,并根据接收到的访问响应下载待抓取链接对应的网页文件;对网页文件进行解析,以提取网页文件中的网页内容,其中,网页内容中包括链接信息和文本信息。这样设置的好处在于通过对待抓取链接对应的网页文件进行下载,可精准解析对应的网页内容。
本实施例中,还改动了根据内容相关度和链接相关度从待抓取链接中筛选目标链接步骤,包括:对于所有待抓取链接,根据网页内容中的文本信息和目标主题确定内容相关度,基于内容相关度不满足预设内容相关度要求的判断结果,将对应的待抓取链接存入已抓取队列;对于满足预设内容相关度要求的待抓取链接,根据网页内容中的链接信息和目标主题确定链接相关度,基于链接相关度不满足预设链接相关度要求的判断结果,将对应的待抓取链接存入已抓取队列;将满足预设链接相关度要求的待抓取链接按照内容相关度和链接相关度进行排序,根据排序结果筛选出目标链接。这样设置的好处在于通过筛选满足内容相关度以及链接相关度两个维度的待抓取链接作为目标链接,可提高获取目标链接的精准度。
图2为本申请实施例提供的又一种主题网页数据抓取方法的流程示意图,该方法以网页搜索作为应用场景为例进行说明,该方法包括如下步骤:
S210、根据用户输入的搜索内容确定目标主题,并基于预设搜索策略从目标主题对应的待抓取链接队列中选取待抓取链接。
S220、模拟客户端向对应的服务端发送待抓取链接对应的访问请求,并根据接收到的访问响应下载待抓取链接对应的网页文件。
在获取网页内容之前,需要在服务器内部模拟客户端向对应的服务端发送待抓取链接对应的访问请求,该访问请求可包括对待抓取链接的访问请求方法、访问请求标识及当前服务器内的通信协议等。在服务器端接收到该访问请求并进行响应后,对应下载待抓取链接对应的网页文件,从而完成根据目标主题对待抓取链接对应网页文件的自动抓取工作。同时,为了确保对待抓取链接的正常工作和效率,防止抓取同一网页,在网页获取模块中设定超时机制,超过一定抓取时间的网页将被舍弃。
可选地,可针对每个待抓取链接依次进行模拟访问,分别下载待抓取链接对应的网页文件。为增快网页内容获取效率,还可针对当前所有的待抓取链接,进行统一模拟访问,则可下载获得多个待抓取链接对应的网页文件。
S230、对网页文件进行解析,以提取网页文件中的网页内容。
服务器分别对下载的每一网页文件进行解析,从而提取网页文件中的网页内容。其中,网页内容中包括链接信息和文本信息。
在一实施例中,链接信息可以为当前网页对应的网页链接或网页地址,也可以为待抓取链接对应的网页里的超链接。文本信息为当前网页中所包含的文本内容,可以为文本标题信息、一段文字信息或者网页内容中包含的全部文字信息等。
S240、对于所有待抓取链接,根据网页内容中的文本信息和目标主题确定内容相关度,基于内容相关度不满足预设内容相关度要求的判断结果,将对应的待抓取链接存入已抓取队列。
对于所有待抓取链接,提取对应的网页内容中的文本信息,并计算与目标主题的内容相关度,针对每一待抓取链接计算后,均可得到对应的网页内容中的文本信息与目标主题的相关度的相关数值。例如,当前待抓取链接与目标主题的内容相关度为20%、50%或80%等,也可将该相关度的相关数值划分为相关等级,例如将数值在10%以下的划分为不相关、数值在10%-40%划分为一般相关、 40%-70%划分为中度相关以及70%以上划分为重度相关等。
相应地,预设内容相关度要求可以为选取内容相关度数值在40%以上,或者相关度等级为中度相关以及重度相关的待抓取连接进行分析。由于对所有待抓取链接已进行内容相关度计算,则可认为所有待抓取链接均为已成功抓取的连接,则可将对应内容相关度数值在40%以下或者相关度等级为不相关以及一般相关的待抓取链接存入已抓取队列。
需要说明的是,本申请内容相关度的数值或者相关度等级的设定可以根据开发人员的实际需求而定。
可选地,根据网页内容中的文本信息和目标主题确定内容相关度,包括:对网页内容中的文本信息进行预处理,得到机器语言模型数据,根据机器语言模型数据和目标主题确定内容相关度,其中,预处理包括文本分词、去除停用词和词干化中的至少一个。
对网页内容中的文本信息进行预处理时,可使用文本分词、去除停用词和词干化中的至少一个算法进行预处理。文本分词算法是有利于文本的挖掘。可根据当前网页内容中的文本信息,将其成功的进行分词,可以达到识别语义的效果。去除停用词(Stop Words)在当前网页内容中的文本信息中,为节省对文本信息中关键字或关键词的识别效率,在对文本信息进行识别之前或之后会自动过滤掉某些字或词,例如,可以为限定词、量词或者介词等。词干化算法为对当前网页内容中的文本信息进行分词之后,将剩余词语的一些名词的复数去掉,动词的不同时态去掉等。
除可使用以上算法对网页内容中的文本信息进行预处理外,还可通过从网页内容中的文本信息中抽取关键词,计算所抽取的关键词与目标主题的语义相似度,针对抽取的关键词进行频率统计,根据频率统计结果和语义相似度确定内容相关度。
可选地,计算与目标主题的语义相似度可将关键词与目标主题使用匹配策略、类别关系策略和复杂关系计算等多种策略计算获得。例如,当用户输入的目标主题为“A省旅游攻略”时,则在进行语义相似度计算时,关于“A省-百度百科”、“A省自驾游攻略”以及“A省旅游必去景点”等均与目标主题相似,则可对应得到相似度数值。
在一实施例中,对当前网页内容中的文本信息抽取的关键词进行频率统计后,可针对当前统计数值进行依次排序,则频率统计数值越大,语义相似度越高的网页内容中的文本信息与目标主题内容相关度越高。
可选地,可选取内容相关度排序在顺序前几位(例如,前10位)的网页内容对应的待抓取链接进行下一维度关于链接相关度的判断,也可选取内容相关度大于一定数值(例如,大于百分之七十)的网页内容对应的待抓取链接进行下一维度关于链接相关度的判断。
S250、对于满足预设内容相关度要求的待抓取链接,根据网页内容中的链接信息和目标主题确定链接相关度,基于链接相关度不满足预设链接相关度要求的判断结果,将对应的待抓取链接存入已抓取队列。
在满足预设内容相关度要求的待抓取链接的基础上,可以根据网页内容中的链接信息和目标主题确定链接相关度。链接可以由:协议类型,主机名和路径及文件名等信息组成,则可通过链接中携带的关键字相关的词条信息进行相关度判断。
在一实施例中,链接相关度是否满足预设链接相关度要求的判断过程与判断是否满足内容相关度要求的过程相同,在此不再赘述。经过判断后,将再次筛选后不满足预设链接相关度要求的待抓取链接也存入已抓取队列中。
S260、将满足预设链接相关度要求的待抓取链接,按照内容相关度和链接相关度进行排序,根据排序结果筛选出目标链接。
根据S240判断得到符合内容相关度的第一数量的待抓取链接,根据S250在符合内容相关度的基础上得到符合链接相关度的第二数量的待抓取链接。可选地,第二数量小于第一数量,可以从满足第二数量的待抓取链接中筛选出目标连接。
一种可选方案,按照内容相关度和链接相关度进行排序,根据排序结果筛选出目标链接包括:根据内容相关度和链接相关度确定各链接对应的综合相关度;按照综合相关度由高至低的顺序进行排序,将综合相关度大于第一预设综合相关度阈值的链接或者综合相关度排序序号小于第一预设序号的链接确定为目标链接。
可选地,根据内容相关度和链接相关度确定各链接对应的综合相关度时, 可根据当前链接分别得到的内容相关度的数值和链接相关度的数值进行加和得到,也可根据为当前内容相关度与链接相关度分配权重(例如,内容相关度占比60%,链接相关度占比40%)得到等。
从而,可将综合相关度大于第一预设综合相关度阈值的链接或者综合相关度排序序号小于第一预设序号的链接确定为目标链接。
相应地,也可直接将计算得到的综合相关度数值顺序排位第一的链接确定为目标链接。
另一种可选方案,在将通过上述方案得到的目标链接进行反馈后,用户在得到对应的网页内容不满意时,本申请实施例还提供一种可选方案,按照内容相关度和链接相关度进行排序,根据排序结果筛选出目标链接,还可以包括:将综合相关度小于或等于第一预设综合相关度阈值且大于第二预设综合相关度阈值的链接,或者,综合相关度排序序号大于或等于第一预设序号且小于第二预设序号的链接,确定为候选链接;根据用户输入的新的搜索内容确定新的目标主题;基于新的目标主题与目标主题相同的判断结果,从候选链接中筛选新的目标链接,并将新的目标链接作为本次的搜索结果进行反馈。
即在对待候选链接进行综合相关度排序后,可将满足综合相关度小于或等于第一预设综合相关度阈值且大于第二预设综合相关度阈值(例如70%)的链接,或者,综合相关度排序序号大于或等于第一预设序号且小于第二预设序号(例如10)的链接,确定为候选链接。重新根据当前目标主题,从候选链接中进行筛选,筛选出新的目标链接,并将新的目标链接作为本次的搜索结果进行反馈。
另一种可选方案,在将计算得到的综合相关度数值顺序排位第一的链接确定为目标链接进行反馈,若用户在得到对应的网页内容不满意时,也可将综合相关度顺序排位第二的链接作为新的目标链接进行反馈等。
S270、将目标链接作为搜索结果进行反馈。
当将目标链接作为搜索结果进行反馈后,则可将当前链接存入已抓取队列,并将目标链接对应的网页内容包含的信息通过文件或数据库的形式存储起来,从而为搜索引擎完成检索功能做好准备。
本申请实施例还提供了一种可选方案,基于预设搜索策略从目标主题对应的待抓取链接队列中选取待抓取链接包括:基于预设搜索策略从目标主题对应 的待抓取链接队列中选取候选待抓取链接;判断候选待抓取链接中是否包含目标候选待抓取链接,基于候选待抓取链接中包含目标候选待抓取链接的判断结果,将候选待抓取链接中包含的目标候选待抓取链接进行滤除,得到待抓取链接。其中,目标候选待抓取链接包括在最近的预设时长内被确定为待抓取链接的次数超过预设次数阈值的候选待抓取链接。
当用户基于与目标主题相似的内容再次进行检索时,在数据安全角度,为了确保自动搜索正常工作和工作效率,若多次抓取同一网页,会产生相应的预警机制,因此,需要对当前抓取次数已经超过预设次数阈值的目标候选待抓取链接进行过滤。
首先基于预设搜索策略从目标主题对应的待抓取链接队列中选取候选待抓取链接;然后判断候选待抓取链接中是否包含目标候选待抓取链接,其中,目标候选待抓取链接可以理解为曾经已被抓取过但其对应的网页内容用户不满意,或者当前抓取次数已经超过预设次数阈值的链接。因此,需要对候选待抓取链接中包含的目标候选待抓取链接进行滤除,将除过目标候选待抓取链接后,剩余的链接称为待抓取链接。
本申请实施例提供的主题网页数据抓取方法,通过优化与合理制定搜索策略,对网页内容中的文本信息进行预处理,将文本内容转换为机器语言模型,并通过对待抓取链接进行网页分析、筛选等,解决了目标链接与目标主题以及目标页面内容与目标主题相关性的判断,提高了搜索引擎根据目标主题搜索时的查准率、查全率及有效率。在搜索引擎抓取目标链接之前,通过判断待抓取链接与目标主题的内容相关度和链接相关度,使自动索引系统尽可能多地筛选出和主题相关的网页界面,减少无关网页建模,从而使目标主题进行自动索引时返回的结果具有较高的准确率。相比较相关技术中的搜索方法,能够精准地获取有效信息的特性。
实施例三
图3为本申请实施例提供的一种主题网页数据抓取装置的结构框图,该装置可由软件和硬件中至少之一实现,一般可集成在服务器等计算机设备中,可通过执行主题网页数据抓取方法来进行主题网页数据抓取。如图3所示,该装置包 括:待抓取链接选取模块31、网页内容获取模块32和目标链接筛选模块33,其中:
待抓取链接选取模块31,被设置为根据用户输入的搜索内容确定目标主题,并基于预设搜索策略从目标主题对应的待抓取链接队列中选取待抓取链接;
网页内容获取模块32,被设置为获取待抓取链接对应的网页内容;
目标链接筛选模块33,被设置为根据内容相关度和链接相关度从待抓取链接中筛选目标链接,并将目标链接作为搜索结果进行反馈,其中,内容相关度根据网页内容和目标主题确定,链接相关度根据待抓取链接和目标主题确定。
本申请实施例中提供的主题网页数据抓取装置,首先根据用户输入的搜索内容确定目标主题,并基于预设搜索策略从目标主题对应的待抓取链接队列中选取待抓取链接;然后根据待抓取链接获取对应的网页内容;最后根据内容相关度和链接相关度从待抓取链接中筛选目标链接,并将目标链接作为搜索结果进行反馈。采用上述技术方案,通过将网页内容和网页链接结合,判断内容相关度和链接相关度,进而从待抓取链接中筛选出目标链接,可以达到提高搜索精准度,提升搜索效率的技术效果。
可选地,网页内容获取模块32包括:网页文件下载单元和网页内容提取单元;
网页文件下载单元,被设置为模拟客户端向对应的服务端发送待抓取链接对应的访问请求,并根据接收到的访问响应下载待抓取链接对应的网页文件。
网页内容提取单元,被设置为对网页文件进行解析,以提取网页文件中的网页内容,其中,网页内容中包括链接信息和文本信息。
可选地,目标链接筛选模块33包括:内容相关度确定单元、链接相关度确定单元、抓取链接存储单元和目标链接筛选单元;
内容相关度确定单元,被设置为对于所有待抓取链接,根据网页内容中的文本信息和目标主题确定内容相关度,基于内容相关度不满足预设内容相关度要求的判断结果,将对应的待抓取链接存入已抓取队列。
链接相关度确定单元,被设置为链接相关度确定单元,被设置为对于满足预设内容相关度要求的待抓取链接,根据网页内容中的链接信息和目标主题确 定链接相关度,基于链接相关度不满足预设链接相关度要求的判断结果,将对应的待抓取链接存入已抓取队列。
目标链接筛选单元,被设置为将满足预设链接相关度要求的待抓取链接,按照内容相关度和链接相关度进行排序,根据排序结果筛选出目标链接。
可选地,链接相关度确定单元,还被设置为实现如下至少一个步骤:对网页内容中的文本信息进行预处理,得到机器语言模型数据,根据机器语言模型数据和目标主题确定内容相关度,其中,预处理包括文本分词、去除停用词和词干化中的至少一个;从网页内容中的文本信息中抽取关键词,计算所抽取的关键词与目标主题的语义相似度,针对抽取的关键词进行频率统计,根据频率统计结果和语义相似度确定内容相关度。
可选地,目标链接筛选单元包括:综合相关度确定子单元和目标链接确定子单元;
综合相关度确定单元,被设置为根据内容相关度和链接相关度确定各链接对应的综合相关度。
目标链接确定单元,被设置为按照综合相关度由高至低的顺序进行排序,将综合相关度大于第一预设综合相关度阈值的链接或者综合相关度排序序号小于第一预设序号的链接确定为目标链接。
可选地,目标链接筛选单元还包括:候选链接确定子单元、目标主题确定子单元和目标链接反馈子单元;
候选链接确定子单元,被设置为将综合相关度小于或等于第一预设综合相关度阈值且大于第二预设综合相关度阈值的链接,或者,综合相关度排序序号大于或等于第一预设序号且小于第二预设序号的链接,确定为候选链接。
目标主题确定子单元,被设置为根据用户输入的新的搜索内容确定新的目标主题。
目标链接反馈子单元,被设置为基于新的目标主题与目标主题相同的判断结果,从候选链接中筛选新的目标链接,并将新的目标链接作为本次的搜索结果进行反馈。
可选地,待抓取链接选取模块31包括:候选待抓取链接选取单元和目标候 选待抓取链接滤除单元;
候选待抓取链接选取单元,被设置为基于预设搜索策略从目标主题对应的待抓取链接队列中选取候选待抓取链接。
目标候选待抓取链接滤除单元,被设置为判断候选待抓取链接中是否包含目标候选待抓取链接,基于候选待抓取链接中包含目标候选待抓取链接的判断结果,将候选待抓取链接中包含的目标候选待抓取链接进行滤除,得到待抓取链接;其中,目标候选待抓取链接包括在最近的预设时长内被确定为待抓取链接的次数超过预设次数阈值的候选待抓取链接。
本申请实施例提供的主题网页数据抓取装置,可执行本申请任意实施例所提供的主题网页数据抓取方法,具备执行该方法相应的功能模块和有益效果。
实施例四
本申请实施例提供了一种计算机设备,该计算机设备中可集成本申请实施例提供的主题网页数据抓取装置。图4为本申请实施例提供的一种计算机设备的结构框图。计算机设备40可以包括:存储器41,处理器42及存储在存储器41上并可在处理器42运行的计算机程序,所述处理器42执行所述计算机程序时实现如本申请实施例所述的主题网页数据抓取方法。
本申请实施例提供的计算机设备,可执行本申请任意实施例所提供的主题网页数据抓取方法,具备执行该方法相应的功能模块和有益效果。
实施例五
本申请实施例还提供一种包含计算机可执行指令的存储介质,所述计算机可执行指令在由计算机处理器执行时用于执行主题网页数据抓取方法,该方法包括:
根据用户输入的搜索内容确定目标主题,并基于预设搜索策略从目标主题对应的待抓取链接队列中选取待抓取链接;
获取待抓取链接对应的网页内容;
根据内容相关度和链接相关度从待抓取链接中筛选目标链接,并将目标链 接作为搜索结果进行反馈,其中,内容相关度根据网页内容和目标主题确定,链接相关度根据待抓取链接和目标主题确定。
存储介质是指任何的各种类型的存储器设备或存储设备。术语“存储介质”可以包括:安装介质,例如只读光盘(Compact Disc Read Only Memory,CD-ROM)、软盘或磁带装置;计算机系统存储器或随机存取存储器(Random Access Memory,RAM),诸如动态RAM(Dynamic RAM,DRAM)、双倍数据速率RAM(Double Data Rate RAM,DDRRAM)、静态RAM(Static RAM,SRAM)、扩展数据输出RAM(Extended Data Out RAM,EDORAM),兰巴斯(Rambus)RAM等;非易失性存储器,诸如闪存、磁介质(例如硬盘或光存储);寄存器或其它相似类型的存储器元件等。存储介质可以还包括其它类型的存储器或其组合。另外,存储介质可以位于程序在其中被执行的第一计算机系统中,或者可以位于不同的第二计算机系统中,第二计算机系统通过网络(诸如因特网)连接到第一计算机系统。第二计算机系统可以提供程序指令给第一计算机执行。术语“存储介质”可以包括可以驻留在不同位置中(例如在通过网络连接的不同计算机系统中)的两个或更多存储介质。存储介质可以存储可由一个或多个处理器执行的程序指令(例如可以实现为计算机程序)。
当然,本申请实施例所提供的一种包含计算机可执行指令的存储介质,其计算机可执行指令不限于如上所述的主题网页数据抓取操作,还可以执行本申请任意实施例所提供的主题网页数据抓取方法中的相关操作。
上述实施例中提供的主题网页数据抓取装置、设备及存储介质可执行本申请任意实施例所提供的主题网页数据抓取方法,具备执行该方法相应的功能模块和有益效果。未在上述实施例中详尽描述的技术细节,可参见本申请任意实施例所提供的主题网页数据抓取方法。

Claims (11)

  1. 一种主题网页数据抓取方法,包括:
    根据用户输入的搜索内容确定目标主题,并基于预设搜索策略从所述目标主题对应的待抓取链接队列中选取待抓取链接;
    获取所述待抓取链接对应的网页内容;
    根据内容相关度和链接相关度从所述待抓取链接中筛选目标链接,并将所述目标链接作为搜索结果进行反馈,其中,所述内容相关度根据所述网页内容和所述目标主题确定,所述链接相关度根据所述待抓取链接和所述目标主题确定。
  2. 根据权利要求1所述的方法,其中,所述获取所述待抓取链接对应的网页内容,包括:
    模拟客户端向对应的服务端发送所述待抓取链接对应的访问请求,并根据接收到的访问响应下载所述待抓取链接对应的网页文件;
    对所述网页文件进行解析,以提取所述网页文件中的网页内容,其中,所述网页内容中包括链接信息和文本信息。
  3. 根据权利要求1所述的方法,其中,所述根据内容相关度和链接相关度从所述待抓取链接中筛选目标链接,包括:
    对于所有所述待抓取链接,根据所述网页内容中的文本信息和所述目标主题确定内容相关度,基于所述内容相关度不满足预设内容相关度要求的判断结果,将对应的所述待抓取链接存入已抓取队列;
    对于满足预设内容相关度要求的所述待抓取链接,根据所述网页内容中的链接信息和所述目标主题确定链接相关度,基于所述链接相关度不满足预设链接相关度要求的判断结果,将对应的所述待抓取链接存入已抓取队列;
    将满足所述预设链接相关度要求的所述待抓取链接,按照所述内容相关度 和所述链接相关度进行排序,根据排序结果筛选出目标链接。
  4. 根据权利要求3所述的方法,其中,所述根据所述网页内容中的文本信息和所述目标主题确定内容相关度,包括如下至少一个步骤:
    对所述网页内容中的文本信息进行预处理,得到机器语言模型数据,根据所述机器语言模型数据和所述目标主题确定内容相关度,其中,所述预处理包括文本分词、去除停用词和词干化中的至少一个;
    从所述网页内容中的文本信息中抽取关键词,计算所抽取的关键词与所述目标主题的语义相似度,针对所述抽取的关键词进行频率统计,根据频率统计结果和所述语义相似度确定内容相关度。
  5. 根据权利要求3所述的方法,其中,所述按照所述内容相关度和所述链接相关度进行排序,根据排序结果筛选出目标链接,包括:
    根据所述内容相关度和所述链接相关度确定各链接对应的综合相关度;
    按照所述综合相关度由高至低的顺序进行排序,将所述综合相关度大于第一预设综合相关度阈值的链接或者所述综合相关度排序序号小于第一预设序号的链接确定为目标链接。
  6. 根据权利要求5所述的方法,还包括:
    将所述综合相关度小于或等于所述第一预设综合相关度阈值且大于第二预设综合相关度阈值的链接确定为候选链接;
    根据用户输入的新的搜索内容确定新的目标主题;
    基于所述新的目标主题与所述目标主题相同的判断结果,从所述候选链接中筛选新的目标链接,并将所述新的目标链接作为本次的搜索结果进行反馈。
  7. 根据权利要求5所述的方法,还包括:
    将所述综合相关度排序序号大于或等于所述第一预设序号且小于第二预设 序号的链接确定为候选链接;
    根据用户输入的新的搜索内容确定新的目标主题;
    基于所述新的目标主题与所述目标主题相同的判断结果,从所述候选链接中筛选新的目标链接,并将所述新的目标链接作为本次的搜索结果进行反馈。
  8. 根据权利要求1-7任一所述的方法,其中,所述基于预设搜索策略从所述目标主题对应的待抓取链接队列中选取待抓取链接,包括:
    基于预设搜索策略从所述目标主题对应的待抓取链接队列中选取候选待抓取链接;
    判断所述候选待抓取链接中是否包含目标候选待抓取链接,基于所述候选待抓取链接中包含目标候选待抓取链接的判断结果,将所述候选待抓取链接中包含的目标候选待抓取链接进行滤除,得到待抓取链接;其中,所述目标候选待抓取链接包括在最近的预设时长内被确定为待抓取链接的次数超过预设次数阈值的候选待抓取链接。
  9. 一种主题网页数据抓取装置,包括:
    待抓取链接选取模块,被设置为根据用户输入的搜索内容确定目标主题,并基于预设搜索策略从所述目标主题对应的待抓取链接队列中选取待抓取链接;
    网页内容获取模块,被设置为获取所述待抓取链接对应的网页内容;
    目标链接筛选模块,被设置为根据内容相关度和链接相关度从所述待抓取链接中筛选目标链接,并将所述目标链接作为搜索结果进行反馈,其中,所述内容相关度根据所述网页内容和所述目标主题确定,所述链接相关度根据所述待抓取链接和所述目标主题确定。
  10. 一种计算机设备,包括存储器、处理器及存储在存储器上并可在处理器 上运行的计算机程序,所述处理器执行所述计算机程序时实现如权利要求1-8任一项所述的方法。
  11. 一种计算机可读存储介质,其上存储有计算机程序,该程序被处理器执行时实现如权利要求1-8任一项所述的方法。
PCT/CN2022/104188 2021-07-14 2022-07-06 主题网页数据抓取方法、装置、设备及存储介质 WO2023284612A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110793519.X 2021-07-14
CN202110793519.XA CN113449168B (zh) 2021-07-14 2021-07-14 主题网页数据抓取方法、装置、设备及存储介质

Publications (1)

Publication Number Publication Date
WO2023284612A1 true WO2023284612A1 (zh) 2023-01-19

Family

ID=77816136

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/104188 WO2023284612A1 (zh) 2021-07-14 2022-07-06 主题网页数据抓取方法、装置、设备及存储介质

Country Status (2)

Country Link
CN (1) CN113449168B (zh)
WO (1) WO2023284612A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116701813A (zh) * 2023-08-04 2023-09-05 北控水务(中国)投资有限公司 一种数据检索方法、系统、终端及存储介质
CN117874319A (zh) * 2024-03-11 2024-04-12 江西顶易科技发展有限公司 基于搜索引擎的信息挖掘方法、装置及计算机设备

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113449168B (zh) * 2021-07-14 2024-02-20 北京锐安科技有限公司 主题网页数据抓取方法、装置、设备及存储介质
CN115525730B (zh) * 2022-02-27 2024-04-19 山东视角数字技术有限公司 基于页面赋权的网页内容提取方法、装置及电子设备

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103714140A (zh) * 2013-12-23 2014-04-09 北京锐安科技有限公司 一种基于主题网络爬虫的搜索方法及装置
CN108959413A (zh) * 2018-06-07 2018-12-07 吉林大学 一种主题网页爬取方法及主题爬虫系统
CN110569430A (zh) * 2019-08-13 2019-12-13 河北上通云天网络科技有限公司 一种移动端网络爬虫系统
CN112084390A (zh) * 2020-09-07 2020-12-15 广东赛博威信息科技有限公司 一种电商平台中利用自动结构化爬虫搜索的方法
CN113449168A (zh) * 2021-07-14 2021-09-28 北京锐安科技有限公司 主题网页数据抓取方法、装置、设备及存储介质

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102073730B (zh) * 2011-01-14 2012-09-26 哈尔滨工程大学 一种主题网络爬虫系统的构建方法
CN102646129B (zh) * 2012-03-09 2013-12-04 武汉大学 一种主题相关的分布式网络爬虫系统
CN103841173A (zh) * 2012-11-27 2014-06-04 大连灵动科技发展有限公司 一种垂直网络蜘蛛

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103714140A (zh) * 2013-12-23 2014-04-09 北京锐安科技有限公司 一种基于主题网络爬虫的搜索方法及装置
CN108959413A (zh) * 2018-06-07 2018-12-07 吉林大学 一种主题网页爬取方法及主题爬虫系统
CN110569430A (zh) * 2019-08-13 2019-12-13 河北上通云天网络科技有限公司 一种移动端网络爬虫系统
CN112084390A (zh) * 2020-09-07 2020-12-15 广东赛博威信息科技有限公司 一种电商平台中利用自动结构化爬虫搜索的方法
CN113449168A (zh) * 2021-07-14 2021-09-28 北京锐安科技有限公司 主题网页数据抓取方法、装置、设备及存储介质

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116701813A (zh) * 2023-08-04 2023-09-05 北控水务(中国)投资有限公司 一种数据检索方法、系统、终端及存储介质
CN117874319A (zh) * 2024-03-11 2024-04-12 江西顶易科技发展有限公司 基于搜索引擎的信息挖掘方法、装置及计算机设备
CN117874319B (zh) * 2024-03-11 2024-05-17 江西顶易科技发展有限公司 基于搜索引擎的信息挖掘方法、装置及计算机设备

Also Published As

Publication number Publication date
CN113449168B (zh) 2024-02-20
CN113449168A (zh) 2021-09-28

Similar Documents

Publication Publication Date Title
WO2023284612A1 (zh) 主题网页数据抓取方法、装置、设备及存储介质
CN113711207B (zh) 用于改进的搜索查询相关性的无监督实体和意图标识
US8719262B1 (en) Identification of semantic units from within a search query
US7636714B1 (en) Determining query term synonyms within query context
Jijkoun et al. Retrieving answers from frequently asked questions pages on the web
KR100544514B1 (ko) 검색 쿼리 연관성 판단 방법 및 시스템
US9361386B2 (en) Clarification of submitted questions in a question and answer system
US7949648B2 (en) Compiling and accessing subject-specific information from a computer network
KR101443475B1 (ko) 검색 제안 클러스터링 및 프리젠테이션
US20150095300A1 (en) System and method for mark-up language document rank analysis
WO2017097231A1 (zh) 话题处理方法及装置
KR20160124079A (ko) 인-메모리 데이터베이스 탐색을 위한 시스템 및 방법
CN111522905A (zh) 一种基于数据库的文档搜索方法和装置
CN1512388A (zh) 根据机器可读词典建立概念知识的计算机系统及方法
CN108520007B (zh) 万维网网页信息提取方法、存储介质及计算机设备
CN110889023A (zh) 一种elasticsearch的分布式多功能搜索引擎
Kantorski et al. Automatic filling of hidden web forms: A survey
CN112818200A (zh) 基于静态网站的数据爬取及事件分析方法及系统
CN106326236A (zh) 一种网页内容识别方法和系统
WO2018205391A1 (zh) 信息检索准确性评估方法、系统、装置及计算机可读存储介质
CN108090200A (zh) 一种排序型隐藏网数据库数据的获取方法
JP4621680B2 (ja) 定義付けシステムおよび方法
Ganguly et al. Performance optimization of focused web crawling using content block segmentation
KR20040098889A (ko) 웹사이트 검색 서비스 제공 방법 및 그 시스템
KR100931772B1 (ko) 웹사이트 검색 서비스 제공 방법 및 그 시스템

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22841249

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE