WO2020024403A1 - Dispositif et procédé d'exploration de données de corpus cible, et support d'informations - Google Patents

Dispositif et procédé d'exploration de données de corpus cible, et support d'informations Download PDF

Info

Publication number
WO2020024403A1
WO2020024403A1 PCT/CN2018/107489 CN2018107489W WO2020024403A1 WO 2020024403 A1 WO2020024403 A1 WO 2020024403A1 CN 2018107489 W CN2018107489 W CN 2018107489W WO 2020024403 A1 WO2020024403 A1 WO 2020024403A1
Authority
WO
WIPO (PCT)
Prior art keywords
rule
list
target
target information
data
Prior art date
Application number
PCT/CN2018/107489
Other languages
English (en)
Chinese (zh)
Inventor
吴壮伟
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2020024403A1 publication Critical patent/WO2020024403A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • the present application relates to the field of data processing, and in particular, to a method for crawling target corpus data, an electronic device, and a computer-readable storage medium.
  • the current crawling method for informational websites is to customize one-to-one crawlers based on the website, and the informational corpus basically has fixed templates for comprehensive development. If you customize and develop for all informational webpage templates, it is prone to poor scalability and The large workload greatly reduces the crawling efficiency of the corpus data.
  • this application provides a method, a server, and a computer-readable storage medium for crawling target corpus data, the main purpose of which is to improve the accuracy and efficiency of crawling target corpus data.
  • the present application provides a method for crawling target corpus data, which method includes:
  • S1 Receive a crawl request from a user that carries target corpus data carrying an initial corpus, and determine a specified information crawl rule and target information crawl rule corresponding to the crawl request of the target corpus data;
  • the present application also provides an electronic device, characterized in that the device includes: a memory and a processor, and the memory stores a crawling program of target corpus data that can be run on the processor, and the target When the crawling program of the corpus data is executed by the processor, it can implement any step in the method for crawling the target corpus data as described above.
  • the present application further provides a computer-readable storage medium, where the computer-readable storage medium includes a crawling program of target corpus data, and the crawling program of the target corpus data is executed by a processor , Can implement any step in the method of crawling target corpus data as described above.
  • the crawling method, electronic device, and computer-readable storage medium of the target corpus data proposed in this application after receiving a crawl request for target information, first determine a crawl rule required to crawl the target corpus, and call the crawl rule, Crawl the first title page URL list, the first list page URL list, and the first content page URL list in order, and then crawl the second list page URL list corresponding to the first title page URL list to generate the third list page URL list. Crawl the second content page URL list corresponding to the third list page URL list, generate the third content page URL list to obtain the content page data, and finally use the target information crawl rule to crawl out the target information, by determining the universally high crawl Rules to improve the crawling efficiency of the target corpus data. By obtaining more comprehensive content page data, the crawled target corpus data is more accurate.
  • FIG. 1 is a flowchart of a preferred embodiment of a method for crawling target corpus data
  • FIG. 2 is a schematic diagram of a preferred embodiment of an electronic device of the present application.
  • FIG. 3 is a schematic diagram of program modules of a crawling program of target corpus data in FIG. 2 of the present application.
  • FIG. 1 is a flowchart of a preferred embodiment of a method for crawling target corpus data. The method may be performed by a device, which may be implemented by software and / or hardware.
  • the method for crawling target corpus data includes steps S1-S6:
  • S1 Receive a crawl request from a user that carries target corpus data carrying an initial corpus, and determine a specified information crawl rule and target information crawl rule corresponding to the crawl request of the target corpus data;
  • the electronic device serves as a server to establish a communication connection with a user terminal, receives a crawl request sent by the user terminal, and performs corresponding processing according to the crawl request.
  • webpage data of an informational website is taken as an example, but it is not limited to webpage data of an informational website.
  • the above crawl request includes: an initial corpus, a target corpus data type, a crawl path, and the like. That is, when the user submits the crawl request, he submits the target information type in the initial corpus and the target corpus data, that is, the content to be crawled, and then sends the crawl request to the electronic device through the user terminal. The device sends the crawl request to a preset client.
  • the initial corpus refers to a pre-determined list of URLs (Uniform Resource Location, Uniform Resource Locator) to be crawled, for example, a list of URLs of informational web pages provided by a user when submitting a crawl request.
  • URLs Uniform Resource Location, Uniform Resource Locator
  • a user wants to obtain travel information from Ctrip.com: title, author, and text.
  • the user must first provide a URL list of travels on Ctrip.com, get the content page of the travels according to the URL list, and then get the title, author, and text from the content page of the travels. information.
  • the preset client is a terminal used by the crawler engineer. After receiving the crawl request, the preset general framework of the crawler is called, and the category is obtained according to the category (for example, information category) of the required content in the crawl request.
  • the crawler engineer configures parameters that need to be adjusted manually, for example, CPU resource allocation, storage path of crawled data, etc., which are not described here. Then save the parameter configuration to the configuration file in the preset path.
  • the configuration file is an XML file.
  • the configuration file further includes a rule base, and the rule base includes a specified information capture rule and a target information capture rule.
  • the above-mentioned crawling rules are related to the specific content of the crawl request. If the target information to be crawled by the user is the title, author, and text, then the crawling rules to be used in the process of crawling the target information need to be determined in advance. , Including specified information fetching rules and target information fetching rules, among which the specified information fetching rules include: title page fetching rules, list page fetching rules, content page fetching rules, target information fetching rules include: title fetching Fetch rules, author fetch rules, body fetch rules.
  • all crawling rules are implemented through regular expressions, that is, the rule base includes: title page regular expressions, list page regular expressions, content page regular expressions, title regular expressions, and author regular expressions. Expression, body regular expression.
  • the target information capture rule is obtained through the following steps:
  • Annotate the target information of the specified webpage data generate a target information data set, and determine the mapping relationship between the target information data set and the initial crawling rule;
  • a5 When the matching rate is greater than or equal to a preset threshold, use the initial capture rule as a target information capture rule; when the matching rate is less than the preset threshold, receive an adjustment instruction to the initial capture rule, and return Go to step a3.
  • a regular expression list of the website template is provided in advance for the reference of the crawler engineer.
  • the initial regular expression corresponding to the target information is determined. , That is, the initial crawl rule, and the initial crawl rule of the target information is saved to the specified path.
  • the regular expressions corresponding to the same target content are also different. Therefore, the crawler engineer needs to adjust according to the crawl request.
  • the initial crawling rules include: initial headline crawling rules, initial body crawling rules, and initial author crawling rules.
  • the specified webpage data may be webpage data whose webpage type is the same as that of the URL in the initial corpus, or webpage data corresponding to a preset proportion of URLs in the URL list in the initial corpus.
  • the specified data is webpage data corresponding to a preset proportion of URLs in the URL list in the initial corpus, randomly extract a preset proportion (for example, 10%) of the URLs from the URL list in the initial corpus and crawl its corresponding Web page data is used as designated web page data.
  • the target information is a title
  • the title in each web page data is extracted from the designated web page data to generate a target title data set [X1, X2, X3, Xi, ..., Xn], where Xi represents the designated web page data
  • the actual title of one of the web page's data corresponds to the initial title fetching rule.
  • the target information can also be the text and author of the specified web page data.
  • the title that is, the title information extracted using the extraction rule, may or may not be the actual title information.
  • step a4 can be refined into the following steps:
  • the target title information of the web page data is the same as the real-time title information; otherwise, it is determined that the target title information of the i-th web page data is different from the real-time title information.
  • M is the number of designated webpage data with the same target information and real-time information
  • T is the total number of designated webpage data.
  • a threshold for example, 80% is set in advance.
  • the matching rate is greater than or equal to 80%, it is judged that the corresponding initial crawling rule is more universal. Therefore, this initial crawling rule is used as the target crawling rule and saved.
  • the matching rate is less than 80%, it is judged that the corresponding initial crawling rule is less universal, and based on the initial crawling rule, a prompt is generated and fed back to a preset terminal for the crawler engineer to adjust the initial crawling rule.
  • the adjustment instruction includes the adjusted initial crawling rule, and then repeat steps a3 to a5.
  • the specified information grabbing rules in the rule base can also be obtained through the above steps, which will not be described here.
  • the first title page URL list is fetched using a title page regular expression
  • the first list page URL list is fetched using a list page regular expression
  • the first content page URL list is fetched using a content page regular expression.
  • the list page web page source code is obtained, and the corresponding second content page URL list is grabbed from the web page source code, and is compared with the first content page URL list. And merge to get a more comprehensive URL list of the third content page in the same industry URL.
  • a crawler program is called to crawl the data corresponding to each content page URL in the third content page URL list obtained through the above steps to generate first corpus data.
  • the specified information capture rule further includes a target information denoising rule
  • step S6 further includes:
  • a preset target information denoising rule is invoked to perform denoising processing on the target information to determine target corpus data.
  • the target information denoising rule includes: a title denoising rule, an author denoising rule, and a text denoising rule, that is, a title denoising regular expression, an author denoising regular expression, and a text denoising regular expression. .
  • Use heading denoising regular expression to denoise the target title information use author denoising regular expression to denoise the target author information, and use body denoising regular expression to denoise the target body information.
  • the cleaned corpus data is then saved as target corpus data in the target corpus, and the target corpus is fed back to the user.
  • the method for crawling target corpus data after receiving a crawl request for target information, first determines the crawl rules required to crawl the target corpus, invokes the crawl rules, and sequentially crawls the first title page URL List, first list page URL list, and first content page URL list, then crawl the second list page URL list corresponding to the first title page URL list, generate a third list page URL list, and crawl the third list page URL list Corresponding second content page URL list, generate third content page URL list to obtain content page data, and finally use target information crawling rules to crawl target information.
  • By determining universally high crawling rules improve the target corpus data. Crawling efficiency makes the target corpus data crawled more accurate by obtaining more comprehensive content page data.
  • FIG. 2 is a schematic diagram of a preferred embodiment of the electronic device 1 of the present application.
  • the electronic device 1 may be a terminal device having a data processing function, such as a smart phone, a tablet computer, a portable computer, a desktop computer, or the like.
  • the electronic device 1 includes a memory 11, a processor 12, and a network interface 13.
  • the memory 11 includes at least one type of readable storage medium, and the readable storage medium includes a flash memory, a hard disk, a multimedia card, a card-type memory (for example, SD or DX memory, etc.), a magnetic memory, a magnetic disk, an optical disk, and the like.
  • the memory 11 may be an internal storage unit of the electronic device 1 in some embodiments, such as a hard disk of the electronic device 1.
  • the memory 11 may also be an external storage device of the electronic device 1 in other embodiments, such as a plug-in hard disk, a smart memory card (SMC), and a secure digital (Secure Digital , SD) card, flash memory card (Flash card), etc. Further, the memory 11 may include both an internal storage unit and an external storage device of the electronic device 1.
  • the memory 11 can be used not only to store application software installed in the electronic device 1 and various types of data, such as a crawling program 10 of target corpus data, etc., but also to temporarily store data that has been or will be output.
  • the processor 12 may be a central processing unit (CPU), a controller, a microcontroller, a microprocessor, or other data processing chip in some embodiments, and is configured to run program codes or processes stored in the memory 11 Data, such as the crawler 10 of the target corpus data.
  • CPU central processing unit
  • controller a controller
  • microcontroller a microcontroller
  • microprocessor or other data processing chip in some embodiments, and is configured to run program codes or processes stored in the memory 11 Data, such as the crawler 10 of the target corpus data.
  • the network interface 13 may optionally include a standard wired interface, a wireless interface (such as a WI-FI interface), and is generally used to establish a communication connection between the electronic device 1 and other electronic devices.
  • a wireless interface such as a WI-FI interface
  • FIG. 2 only shows the electronic device 1 with components 11-13. Those skilled in the art can understand that the structure shown in FIG. 2 does not constitute a limitation on the electronic device 1, and may include fewer or more There are many parts, or some parts are combined, or different parts are arranged.
  • the electronic device 1 may further include a user interface.
  • the user interface may include a display, an input unit such as a keyboard, and the optional user interface may further include a standard wired interface and a wireless interface.
  • the display may be an LED display, a liquid crystal display, a touch-controlled liquid crystal display, an organic light-emitting diode (OLED) touch device, or the like.
  • the display may also be referred to as a display screen or a display unit, for displaying information processed in the electronic device 1 and for displaying a visualized user interface.
  • a crawling program 10 of target corpus data is stored in a memory 11 as a computer storage medium, and the processor 12 executes a crawling program of target corpus data stored in the memory 11.
  • the following steps are implemented:
  • A1. Receive a crawl request from a user that carries target corpus data carrying an initial corpus, and determine a specified information crawl rule and target information crawl rule corresponding to the crawl request of the target corpus data;
  • the electronic device serves as a server to establish a communication connection with the user terminal, receives a crawl request sent by the user terminal, and performs corresponding processing according to the crawl request.
  • webpage data of an informational website is taken as an example, but it is not limited to webpage data of an informational website.
  • the above crawl request includes: an initial corpus, a target corpus data type, a crawl path, and the like. That is, when the user submits the crawl request, he submits the target information type in the initial corpus and the target corpus data, that is, the content to be crawled, and then sends the crawl request to the electronic device through the user terminal. The device sends the crawl request to a preset client.
  • the initial corpus refers to a predetermined list of web page URLs to be crawled, for example, an informational web page URL list provided by the user when submitting a crawl request. Through the URL list of web pages in the initial corpus, you can get the title page, list page, and content page of each web page.
  • a user wants to obtain travel information from Ctrip.com: title, author, and text.
  • the user must first provide a URL list of travels in Ctrip.com, and obtain the content page of the travels according to the URL list, and then obtain the title, author, and Text information.
  • the preset client is a terminal used by the crawler engineer. After receiving the crawl request, the preset general framework of the crawler is called, and the category is obtained according to the category (for example, information category) of the required content in the crawl request.
  • the crawler engineer configures parameters that need to be adjusted manually, for example, CPU resource allocation, storage path of crawled data, etc., which are not described here. Then save the parameter configuration to the configuration file in the preset path.
  • the configuration file is an XML file.
  • the configuration file further includes a rule base, and the rule base includes a specified information capture rule and a target information capture rule.
  • the above-mentioned crawling rules are related to the specific content of the crawl request. If the target information to be crawled by the user is the title, author, and text, then the crawling rules to be used in the process of crawling the target information need to be determined .
  • all crawling rules are implemented through regular expressions, that is, the rule base includes: title page regular expressions, list page regular expressions, content page regular expressions, title regular expressions, and author regular expressions. Expression, body regular expression.
  • the target information capture rule is obtained through the following steps:
  • Annotate the target information of the specified webpage data generate a target information data set, and determine the mapping relationship between the target information data set and the initial crawling rule;
  • a5 When the matching rate is greater than or equal to a preset threshold, use the initial capture rule as a target information capture rule; when the matching rate is less than the preset threshold, receive an adjustment instruction to the initial capture rule, and return Go to step a3.
  • a regular expression list of the website template is provided in advance for the reference of the crawler engineer.
  • the initial regular expression corresponding to the target information is determined. , That is, the initial crawl rule, and the initial crawl rule of the target information is saved to the specified path.
  • the regular expressions corresponding to the same target content are also different. Therefore, the crawler engineer needs to adjust according to the crawl request.
  • the initial crawling rules include: initial headline crawling rules, initial body crawling rules, and initial author crawling rules.
  • the specified webpage data may be webpage data whose webpage type is the same as that of the URL in the initial corpus, or webpage data corresponding to a preset proportion of URLs in the URL list in the initial corpus.
  • the specified data is webpage data corresponding to a preset proportion of URLs in the URL list in the initial corpus, randomly extract a preset proportion (for example, 10%) of the URLs from the URL list in the initial corpus and crawl its corresponding Web page data is used as designated web page data.
  • the target information is a title
  • the title in each web page data is extracted from the designated web page data to generate a target title data set [X1, X2, X3, Xi, ..., Xn], where Xi represents the designated web page data
  • the actual title of one of the web page's data corresponds to the initial title fetching rule.
  • the target information can also be the text and author of the specified web page data.
  • the title that is, the title information extracted using the extraction rule, may or may not be the actual title information.
  • step a4 can be refined into the following steps:
  • the target title information of the web page data is the same as the real-time title information; otherwise, it is determined that the target title information of the i-th web page data is different from the real-time title information.
  • M is the number of designated webpage data with the same target information and real-time information
  • T is the total number of designated webpage data.
  • a threshold for example, 80% is set in advance.
  • the matching rate is greater than or equal to 80%, it is judged that the corresponding initial crawling rule is more universal. Therefore, this initial crawling rule is used as the target crawling rule and saved.
  • the matching rate is less than 80%, it is judged that the corresponding initial crawling rule is less universal, and based on the initial crawling rule, a prompt is generated and fed back to a preset terminal for the crawler engineer to adjust the initial crawling rule.
  • the adjustment instruction includes the adjusted initial crawling rule, and then repeat steps a3 to a5.
  • the specified information grabbing rules in the rule base can also be obtained through the above steps, which will not be described here.
  • the first title page URL list is fetched using a title page regular expression
  • the first list page URL list is fetched using a list page regular expression
  • the first content page URL list is fetched using a content page regular expression.
  • A5 Combine the first content page URL list and the second content page URL list to generate a third content page URL list, grab content page data corresponding to the third content page URL list, and generate first corpus data;
  • the list page web page source code is obtained, and the corresponding second content page URL list is grabbed from the web page source code, and is compared with the first content page URL list. And merge to get a more comprehensive URL list of the third content page in the same industry URL.
  • a crawler program is invoked to crawl data corresponding to each content page URL in the third content page URL list obtained through the above steps to generate first corpus data.
  • A6 Invoking the target information capture rule, capturing target information from the first corpus data, determining target corpus data, and sending the target corpus data to a user.
  • the specified information capture rule further includes a target information denoising rule
  • step A6 further includes:
  • a preset target information denoising rule is invoked to perform denoising processing on the target information to determine target corpus data.
  • the target information denoising rule includes: a title denoising rule, an author denoising rule, and a text denoising rule, that is, a title denoising regular expression, an author denoising regular expression, and a text denoising regular expression. .
  • Use heading denoising regular expression to denoise the target title information use author denoising regular expression to denoise the target author information, and use body denoising regular expression to denoise the target body information.
  • the cleaned corpus data is then saved as target corpus data in the target corpus, and the target corpus is fed back to the user.
  • the electronic device 1 after receiving a crawl request for target information, first determines a crawl rule required to crawl the target corpus, invokes the crawl rule, and crawls the first title page URL list, the first List page URL list and first content page URL list, then crawl the second list page URL list corresponding to the first title page URL list, generate a third list page URL list, and crawl the second list page URL list corresponding to the second Content page URL list, generate the third content page URL list to obtain the content page data, and finally use the target information crawl rule to crawl out the target information.
  • determining a universally high crawl rule improve the crawling efficiency of the target corpus data.
  • the crawled target corpus data is more accurate.
  • the crawling program 10 of the target corpus data may also be divided into one or more modules, and the one or more modules are stored in the memory 11 and processed by one or more processors. (This embodiment is executed by the processor 12) to complete the present application.
  • the modules referred to in the present application refer to a series of computer program instruction segments capable of performing specific functions.
  • FIG. 3 it is a schematic block diagram of a crawling program 10 of target corpus data in FIG. 2.
  • the crawling program 10 of target corpus data may be divided into a receiving module 110 and a first crawling module 120, the second grabbing module 130, the third grabbing module 140, the fourth grabbing module 150, and the sending module 160.
  • the functions or operation steps implemented by the modules 110-160 are similar to the above, and are not repeated here. Detailed, for example, for example:
  • the receiving module 110 is configured to receive a crawl request from a user that carries target corpus data carrying an initial corpus, and determine a specified information crawl rule and a target information crawl rule corresponding to the crawl request of the target corpus data;
  • the first crawling module 120 is configured to read a title page crawling rule, a list page crawling rule, and a content page crawling rule in the specified information crawling rule, and respectively extract URLs in the initial corpus ( Uniform Resource Location (Uniform Resource Locator) list corresponding to the first title page URL list, the first list page URL list, and the first content page URL list;
  • Uniform Resource Location Uniform Resource Locator
  • a second fetching module 130 configured to access the title page in the first title page URL list, invoke the list page scraping rule, and fetch the second list page URL list corresponding to the first title page URL list;
  • the third fetching module 140 is configured to combine the first list page URL list and the second list page URL list to generate a third list page URL list, invoke the content page crawl rule, and fetch the first URL list of the second content page corresponding to the three list page URL list;
  • a fourth fetching module 150 is configured to combine the first content page URL list and the second content page URL list to generate a third content page URL list, and capture content page data corresponding to the third content page URL list To generate first corpus data;
  • the sending module 160 is configured to invoke the target information capture rule, capture target information from the first corpus data, determine target corpus data, and send the target corpus data to a user.
  • an embodiment of the present application further provides a computer-readable storage medium, where the computer-readable storage medium includes a crawling program 10 of target corpus data, and the crawling program 10 of the target corpus data is implemented when executed by a processor.
  • the computer-readable storage medium includes a crawling program 10 of target corpus data, and the crawling program 10 of the target corpus data is implemented when executed by a processor.
  • A1. Receive a crawl request from a user that carries target corpus data carrying an initial corpus, and determine a specified information crawl rule and target information crawl rule corresponding to the crawl request of the target corpus data;
  • A5 Combine the first content page URL list and the second content page URL list to generate a third content page URL list, grab content page data corresponding to the third content page URL list, and generate first corpus data;
  • A6 Invoking the target information capture rule, capturing target information from the first corpus data, determining target corpus data, and sending the target corpus data to a user.
  • the specific implementation manner of the computer-readable storage medium of the present application is substantially the same as the specific implementation manner of the above-mentioned target corpus data crawling method, and details are not described herein again.
  • the methods in the above embodiments can be implemented by means of software plus a necessary universal hardware platform, and of course, also by hardware, but in many cases the former is better.
  • Implementation Based on such an understanding, the technical solution of this application that is essentially or contributes to the existing technology can be embodied in the form of a software product.
  • the computer software product is stored in a storage medium (such as ROM / RAM) as described above. , Magnetic disk, optical disc), including a number of instructions for causing a terminal device (which may be a mobile phone, a computer, a server, or a network device, etc.) to execute the methods described in the embodiments of the present application.

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

La présente invention concerne un procédé d'exploration de données de corpus cible. Le procédé consiste : après la réception d'une demande d'exploration pour des informations cibles, à déterminer tout d'abord une règle d'exploration requise pour une exploration d'un corpus cible, et à invoquer la règle d'exploration pour explorer de façon séquentielle une première liste URL de page de titre, une première liste URL de page de liste et une première liste URL de page de contenu à partir d'un corpus initial ; et à explorer ensuite une deuxième liste URL de page de liste correspondant à la première liste URL de page de titre, à générer une troisième liste URL de page de liste, et à générer une troisième liste URL de page de contenu, de façon à acquérir des données de page de contenu, et enfin, à utiliser une règle d'exploration d'informations cibles pour explorer les informations cibles, de façon à générer des données de corpus cible. L'invention concerne également un dispositif électronique et un support d'informations informatique. Le procédé mentionné ci-dessus permet d'améliorer l'efficacité et la précision des données de corpus cible d'exploration.
PCT/CN2018/107489 2018-08-03 2018-09-26 Dispositif et procédé d'exploration de données de corpus cible, et support d'informations WO2020024403A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201810876287.2A CN109190062B (zh) 2018-08-03 2018-08-03 目标语料数据的爬取方法、装置及存储介质
CN201810876287.2 2018-08-03

Publications (1)

Publication Number Publication Date
WO2020024403A1 true WO2020024403A1 (fr) 2020-02-06

Family

ID=64920024

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/107489 WO2020024403A1 (fr) 2018-08-03 2018-09-26 Dispositif et procédé d'exploration de données de corpus cible, et support d'informations

Country Status (2)

Country Link
CN (1) CN109190062B (fr)
WO (1) WO2020024403A1 (fr)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116361362A (zh) * 2023-05-30 2023-06-30 江西顶易科技发展有限公司 一种基于网页内容识别的用户信息挖掘方法与系统

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109918486B (zh) * 2019-01-24 2024-03-19 平安科技(深圳)有限公司 智能客服的语料构建方法、装置、计算机设备及存储介质
CN112818212B (zh) * 2020-04-23 2023-10-13 腾讯科技(深圳)有限公司 语料数据采集方法、装置、计算机设备和存储介质
CN112052320B (zh) * 2020-09-01 2023-09-29 腾讯科技(深圳)有限公司 一种信息处理方法、装置及计算机可读存储介质

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103020043A (zh) * 2012-11-16 2013-04-03 哈尔滨工业大学 一种面向web双语平行语料资源的分布式采集系统
CN103793509A (zh) * 2014-01-27 2014-05-14 北京奇虎科技有限公司 组图抓取方法与装置
CN107885777A (zh) * 2017-10-11 2018-04-06 北京智慧星光信息技术有限公司 一种基于协作式爬虫的抓取网页数据的控制方法及系统

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20120042529A (ko) * 2010-10-25 2012-05-03 삼성전자주식회사 웹 페이지 크롤링 방법 및 장치
CN107704515A (zh) * 2017-09-01 2018-02-16 安徽简道科技有限公司 基于互联网数据抓取系统的数据抓取方法
CN108334585A (zh) * 2018-01-29 2018-07-27 湖北省楚天云有限公司 一种网页爬虫方法、装置以及电子设备

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103020043A (zh) * 2012-11-16 2013-04-03 哈尔滨工业大学 一种面向web双语平行语料资源的分布式采集系统
CN103793509A (zh) * 2014-01-27 2014-05-14 北京奇虎科技有限公司 组图抓取方法与装置
CN107885777A (zh) * 2017-10-11 2018-04-06 北京智慧星光信息技术有限公司 一种基于协作式爬虫的抓取网页数据的控制方法及系统

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116361362A (zh) * 2023-05-30 2023-06-30 江西顶易科技发展有限公司 一种基于网页内容识别的用户信息挖掘方法与系统
CN116361362B (zh) * 2023-05-30 2023-08-11 江西顶易科技发展有限公司 一种基于网页内容识别的用户信息挖掘方法与系统

Also Published As

Publication number Publication date
CN109190062B (zh) 2023-04-07
CN109190062A (zh) 2019-01-11

Similar Documents

Publication Publication Date Title
WO2020024403A1 (fr) Dispositif et procédé d'exploration de données de corpus cible, et support d'informations
CN106991154B (zh) 网页渲染方法、装置、终端及服务器
WO2021008030A1 (fr) Procédé et dispositif de configuration d'un formulaire web et support de stockage lisible par ordinateur
US8095534B1 (en) Selection and sharing of verified search results
CN109145078B (zh) 对本机应用的应用页面建索引
CA2684822C (fr) Procede de conversion de donnees base sur un document de conception technique
WO2019041521A1 (fr) Appareil et procédé d'extraction de mot-clé d'utilisateur et support de mémoire lisible par ordinateur
WO2017071189A1 (fr) Dispositif, appareil et procédé d'accès à une page web et support de stockage informatique non volatil
US9934206B2 (en) Method and apparatus for extracting web page content
RU2595524C2 (ru) Устройство и способ обработки содержимого веб-ресурса в браузере
TWI584149B (zh) Web page access request response method and device
TWI683225B (zh) 腳本生成方法與裝置
US10073826B2 (en) Providing action associated with event detected within communication
WO2020015170A1 (fr) Appareil et procédé d'invocation d'interface, et support d'informations lisible par ordinateur
WO2015010566A1 (fr) Procédé pour rechercher de manière précise des informations complètes
TW201800962A (zh) 網頁文件發送方法、網頁渲染方法及裝置、網頁渲染系統
WO2019205374A1 (fr) Procédé d'apprentissage de modèle en ligne, serveur, et support de stockage
US8290928B1 (en) Generating sitemap where last modified time is not available to a network crawler
RU2653302C2 (ru) Система для обеспечения потока операций бизнес-процесса
TWI539302B (zh) 用於網路服務的延後資源當地語系化連結
WO2023155712A1 (fr) Procédé et appareil de génération de page, procédé et appareil d'affichage de page, ainsi que dispositif électronique et support de stockage
TW201610713A (zh) 在文件中識別且呈現相關報告實物
US20130081010A1 (en) Template and server content download using protocol handlers
WO2012151752A1 (fr) Annotation de résultats de recherche avec des images
US10061796B2 (en) Native application content verification

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18928429

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205 DATED 25.03.2021.)

122 Ep: pct application non-entry in european phase

Ref document number: 18928429

Country of ref document: EP

Kind code of ref document: A1