CN106649322A - Method and device for crawling keyword category information from electronic business websites - Google Patents

Method and device for crawling keyword category information from electronic business websites Download PDF

Info

Publication number
CN106649322A
CN106649322A CN201510719610.1A CN201510719610A CN106649322A CN 106649322 A CN106649322 A CN 106649322A CN 201510719610 A CN201510719610 A CN 201510719610A CN 106649322 A CN106649322 A CN 106649322A
Authority
CN
China
Prior art keywords
electric business
url
information
business website
keyword
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201510719610.1A
Other languages
Chinese (zh)
Inventor
郭秦龙
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Gridsum Technology Co Ltd
Original Assignee
Beijing Gridsum Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Gridsum Technology Co Ltd filed Critical Beijing Gridsum Technology Co Ltd
Priority to CN201510719610.1A priority Critical patent/CN106649322A/en
Publication of CN106649322A publication Critical patent/CN106649322A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
    • G06F16/9566URL specific, e.g. using aliases, detecting broken or misspelled links
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Abstract

The invention discloses a method and device for crawling keyword category information from electronic business websites, relates to the technical field of internet and mainly aims at improving the efficiency of crawling the keyword category information from the electronic business websites. According to the main technical scheme, search URL of the electronic business websites are established according to information of the electronic business websites and keywords for crawling category information; the established search URL of the electronic business websites are accessed to obtain page information of webpages corresponding to the URL; the page information of the webpages is parsed to extract the information for describing the keyword category information of the electronic business websites in the pages, and the keyword category information on the electronic business websites is obtained. The method and the device are mainly used for crawling the keyword category information on the electronic business websites.

Description

Crawl the method and device of electric business website keyword category information
Technical field
The present invention relates to Internet technical field, more particularly to one kind crawls electric business website keyword category The method and device of information.
Background technology
Keyword classification information is a highly important information.Especially for electric business website, for One search keyword at family, correctly provides the category belonging to keyword, for electric business website and For search engine marketing is all of great significance.Wherein, the category at this is just for electric business, It refers to the attribute of foundation commodity, and commodity are divided into some classifications, and can according to different dimensions To carry out multistage category.
Web crawlers is very general in an internet, generally existing technology.Many companies, it is personal Will be by web crawlers come batch, the information crawled on a large scale on WWW.General network Reptile, its principle for crawling information is generally, and it safeguards one group of URL (Uniform Resource Locator, URL) list, add an initial URL in lists first, then Each URL in traversal url list, obtains the corresponding pages of URL, then extracts in the page URL, update in url list.
At present, when electric business website keyword category information is crawled, what is be usually used is exactly general net Network reptile.Because the merchandise news of electric business website is various, its different commodity corresponds to the different pages, therefore Obtain the category information of different keyword corresponding goods, it is necessary to going from the webpage for newly crawling repeatedly Extract the URL information of webpage and then be maintained into url list, carry out URL corresponding pages again afterwards Acquisition so that crawl the less efficient of electric business website keyword category information.
The content of the invention
In view of this, the present invention provides a kind of method and dress for crawling electric business website keyword category information Put, its main purpose is to improve the efficiency for crawling electric business website keyword category information.
To reach above-mentioned purpose, the present invention provides following technical scheme:
On the one hand, the present invention provides a kind of method for crawling electric business website keyword category information, including:
According to electric business site information, the search unification of the keyword construction electric business website for crawling category information URLs URL;
The search URL of the electric business website of construction is accessed, the page letter of the corresponding webpages of the URL is obtained Breath;
The page info of the webpage is parsed, electric business website described in the page is extracted and is closed The information of keyword category, obtains electric business website keyword category information.
On the other hand, the present invention provides a kind of device for crawling electric business website keyword category information, bag Include:
Structural unit, for according to electric business site information, crawl category information keyword construction electric business The search uniform resource position mark URL of website;
Access unit, for accessing the search URL of the electric business website of construction, obtains the URL correspondences Webpage page info;
Resolution unit, for parsing to the page info of the webpage, in extracting the page The information of description electric business website keyword category, obtains electric business website keyword category information.
What the present invention was provided crawls the method and device of electric business website keyword category information, and it is climbed The webpage URL of power taking business website keyword category information is extracted from known web pages, but root According to electric business site information, crawl category information keyword construction, so relative to prior art, Eliminate and URL is extracted from known web pages and URL is stored in url list, enter again afterwards Crawling for row URL correspondence webpages, improves to a certain extent the efficiency of the webpage for crawling, Jin Erti The high efficiency for crawling electric business website keyword category information.
Described above is only the general introduction of technical solution of the present invention, in order to better understand the present invention's Technological means, and being practiced according to the content of specification, and in order to allow the above-mentioned of the present invention and Other objects, features and advantages can become apparent, below especially exemplified by the specific embodiment of the present invention.
Description of the drawings
By the detailed description for reading hereafter preferred embodiment, various other advantage and benefit for Those of ordinary skill in the art will be clear from understanding.Accompanying drawing is only used for illustrating the mesh of preferred embodiment , and it is not considered as limitation of the present invention.And in whole accompanying drawing, with identical with reference to symbol Number represent identical part.In the accompanying drawings:
Fig. 1 shows that the embodiment of the present invention provides a kind of side for crawling electric business website keyword category information Method flow chart;
Fig. 2 shows that the embodiment of the present invention provides a kind of dress for crawling electric business website keyword category information Put composition frame chart.
Specific embodiment
The exemplary embodiment of the disclosure is more fully described below with reference to accompanying drawings.Although showing in accompanying drawing The exemplary embodiment of the disclosure is shown, it being understood, however, that may be realized in various forms the disclosure And should not be limited by embodiments set forth here.On the contrary, there is provided these embodiments are able to more Thoroughly understand the disclosure, and can be by the complete technology for conveying to this area of the scope of the present disclosure Personnel.
The embodiment of the present invention provides a kind of method for crawling electric business website keyword category information, such as Fig. 1 Shown, the method includes:
101st, according to electric business site information, the search of the keyword construction electric business website for crawling category information URL。
It should be noted that crawl the corresponding URL of access critical word of category information, and in electric business net It is input into keyword in standing to scan for, the page of return is identical, in general, electric business website Search URL has a format like http://search.XXX.com/SearchKeyword=YYY, wherein, XXX is the domain name of electric business website, and YYY refers to the keyword for specifically crawling category information.
Based on this kind of principle, the electric business site information in the embodiment of the present invention can be but be not limited to electricity The domain name of business website, according to electric business site information, crawls the keyword construction electric business website of category information Search URL following shape can be constructed according to the domain-name information of electric business, the keyword for crawling category information The search URL of the electric business website of formula, the form of the search URL of structure is as implied above.For each The keyword of input, replaces the YYY parts in URL, constructs corresponding search URL.
102nd, the search URL of the electric business website of construction is accessed, the corresponding webpages of the URL are obtained Page info.
Further, in order to accelerate access construction electric business website search URL, access when, Can carry out in batches.For example, by the network library of programming language offer (such as the requests in Python Storehouse) batch access construction electric business website search URL.Some multithreadings can specifically be passed through Method, the search URL of the electric business website of construction is simultaneously and concurrently accessed in batches by multithreading, obtains institute State the page info of the corresponding webpages of URL.Can certainly be using other batch access methods, tool The embodiment of the present invention is not limited to this when body is implemented.
It should be noted that when the page info of the corresponding webpages of the URL is obtained, the page of acquisition Surface information can be HTML (Hyper text Markup Language, HTML) code Form, the concrete embodiment of the present invention is not defined to this.But subsequent page information for convenience Parsing, the page info of the preferred HTML code form of the embodiment of the present invention.
103rd, the page info of the webpage is parsed, extracts electric business net described in the page The information of keyword category of standing, obtains electric business website keyword category information.
, wherein it is desired to explanation, the page info to the webpage carry out parsing extract it is described The information of electric business website keyword category described in the page, when obtaining electric business website keyword category information, It is different according to the different meetings of form of the page info for obtaining.
For example, when the page info is HTML code form, directly to the HTML code Parsed, just can be extracted the information of electric business website keyword category described in the page, obtained To electric business website keyword category information.Wherein, directly the HTML code parsed, just The information of electric business website keyword category described in the page can be extracted, electric business website pass is obtained Keyword category information, is specifically as follows using the lxml bags in Python, according to CSS (Cascading Style Sheets, it is one kind for showing HTML or XML (standard generalized markups The a subset of language) etc. file pattern computer language) information, extract and retouched in the page The information of electric business website keyword category is stated, electric business website keyword category information is obtained.
In the embodiment of the present invention, it carries out crawling the webpage URL of electric business website keyword category information not To extract from known web pages, but according to electric business site information, crawl the keyword of category information Construction, so relative to prior art, eliminate and URL is extracted from known web pages and by URL In being stored in url list, crawling for URL correspondence webpages is carried out again afterwards, carry to a certain extent The efficiency of the high webpage for crawling, and then improve the efficiency for crawling electric business website keyword category information.
Also, the embodiment of the present invention can be carried out in batches when the search URL of construction is accessed, and enter one Step improves the efficiency of the webpage for crawling, and then improves and crawl electric business website keyword category information Efficiency.
Based on said method embodiment, the embodiment of the present invention also provides one kind and crawls electric business website keyword The device of category information, as shown in Fig. 2 the device includes:
Structural unit 21, for according to electric business site information, crawl category information keyword construction electricity The search URL of business website;Wherein, crawl the corresponding URL of access critical word of category information, with It is input into keyword in electric business website to scan for, the page of return is identical, in general, electric business The search URL of website has a format like:
http://search.XXX.com/SearchKeyword=YYY, wherein, XXX is electric business website Domain name, YYY refers to the keyword for specifically crawling category information.
Based on this kind of principle, the electric business site information in the embodiment of the present invention can be but be not limited to electricity The domain name of business website, according to electric business site information, crawls the keyword construction electric business website of category information Search URL following shape can be constructed according to the domain-name information of electric business, the keyword for crawling category information The search URL of the electric business website of formula, the form of the search URL of structure is as implied above.For each The keyword of input, replaces the YYY parts in URL, constructs corresponding search URL.
Access unit 22, for accessing the search URL of the electric business website of construction, obtains the URL The page info of corresponding webpage;Wherein, further, in order to accelerate access construction electric business website Search URL, access when, can carry out in batches.For example, provided by programming language Network library (such as the requests storehouses in Python) batch accesses the search URL of the electric business website of construction. The method that can specifically some multithreadings be passed through, by multithreading, simultaneously and concurrently batch accesses construction The search URL of electric business website, obtains the page info of the corresponding webpages of the URL.Can certainly Using other batch access methods, the embodiment of the present invention is not limited to this when being embodied as.
It should be noted that when the page info of the corresponding webpages of the URL is obtained, the page of acquisition Surface information can be HTML code form, and the concrete embodiment of the present invention is not defined to this.But The parsing of subsequent page information for convenience, the page of the preferred HTML code form of the embodiment of the present invention Information.
Resolution unit 23, for parsing to the page info of the webpage, extracts the page Described in electric business website keyword category information, obtain electric business website keyword category information.Wherein, It should be noted that carry out parsing in the page info to the webpage extracting described in the page The information of electric business website keyword category, when obtaining electric business website keyword category information, according to acquisition Page info form it is different can be different.
For example, when the page info is HTML code form, directly to the HTML code Parsed, just can be extracted the information of electric business website keyword category described in the page, obtained To electric business website keyword category information.Wherein, directly the HTML code parsed, just The information of electric business website keyword category described in the page can be extracted, electric business website pass is obtained Keyword category information, is specifically as follows using the lxml bags in Python, according to CSS information, carries The information of electric business website keyword category described in the page is taken out, electric business website keyword product are obtained Category information.
In the embodiment of the present invention, it carries out crawling the webpage URL of electric business website keyword category information not To extract from known web pages, but according to electric business site information, crawl the keyword of category information Construction, so relative to prior art, eliminate and URL is extracted from known web pages and by URL In being stored in url list, crawling for URL correspondence webpages is carried out again afterwards, carry to a certain extent The efficiency of the high webpage for crawling, and then improve the efficiency for crawling electric business website keyword category information.
Also, the embodiment of the present invention can be carried out in batches when the search URL of construction is accessed, and enter one Step improves the efficiency of the webpage for crawling, and then improves and crawl electric business website keyword category information Efficiency.
The device for crawling electric business website keyword category information includes processor and memory, above-mentioned Structural unit, access unit and resolution unit etc. as program unit store in memory, by Reason device performs storage said procedure unit in memory to realize corresponding function.
Kernel is included in processor, is gone in memory to transfer corresponding program unit by kernel.Kernel can To arrange one or more, improved by adjusting kernel parameter and crawl electric business website keyword category letter The efficiency of breath.
Memory potentially includes the volatile memory in computer-readable medium, random access memory The form such as device (RAM) and/or Nonvolatile memory, such as read-only storage (ROM) or flash memory (flash RAM), memory includes at least one storage chip.
Present invention also provides a kind of computer program, when performing on data processing equipment, It is adapted for carrying out initializing the program code of there are as below methods step:According to electric business site information, crawl product The search uniform resource position mark URL of the keyword construction electric business website of category information;Access the electricity of construction The search URL of business website, obtains the page info of the corresponding webpages of the URL;To the webpage Page info is parsed, and extracts the information of electric business website keyword category described in the page, Obtain electric business website keyword category information.
Those skilled in the art it should be appreciated that embodiments herein can be provided as method, system, Or computer program.Therefore, the application can be implemented using complete hardware embodiment, complete software Example or with reference to the form of the embodiment in terms of software and hardware.And, the application can be adopted at one Or it is multiple wherein include computer usable program code computer-usable storage medium (including but not Be limited to magnetic disc store, CD-ROM, optical memory etc.) on the computer program implemented Form.
The application is with reference to the method according to the embodiment of the present application, equipment (system) and computer program The flow chart and/or block diagram of product is describing.It should be understood that can be realized flowing by computer program instructions In each flow process and/or square frame and flow chart and/or block diagram in journey figure and/or block diagram Flow process and/or square frame combination.Can provide these computer program instructions to all-purpose computer, specially With the processor of computer, Embedded Processor or other programmable data processing devices producing one Machine so that produced by the instruction of computer or the computing device of other programmable data processing devices It is raw to be used to realize in one flow process of flow chart or one square frame of multiple flow processs and/or block diagram or multiple sides The device of the function of specifying in frame.
These computer program instructions may be alternatively stored in can guide computer or other programmable datas to process In the computer-readable memory that equipment works in a specific way so that be stored in the computer-readable and deposit Instruction in reservoir is produced and includes the manufacture of command device, and command device realization is in flow chart one The function of specifying in flow process or one square frame of multiple flow processs and/or block diagram or multiple square frames.
These computer program instructions can also be loaded into computer or other programmable data processing devices On so that series of operation steps is performed on computer or other programmable devices to produce computer The process of realization, so as to the instruction performed on computer or other programmable devices is provided for realizing Specify in one flow process of flow chart or one square frame of multiple flow processs and/or block diagram or multiple square frames The step of function.
In a typical configuration, computing device include one or more processors (CPU), input/ Output interface, network interface and internal memory.
Memory potentially includes the volatile memory in computer-readable medium, random access memory The form such as device (RAM) and/or Nonvolatile memory, such as read-only storage (ROM) or flash memory (flash RAM).Memory is the example of computer-readable medium.
Computer-readable medium includes that permanent and non-permanent, removable and non-removable media can be with Information Store is realized by any method or technique.Information can be computer-readable instruction, data knot Structure, the module of program or other data.The example of the storage medium of computer includes, but are not limited to phase Become internal memory (PRAM), static RAM (SRAM), dynamic random access memory (DRAM), other kinds of random access memory (RAM), read-only storage (ROM), electricity can Erasable programmable read-only memory (EPROM) (EEPROM), fast flash memory bank or other memory techniques, read-only light Disk read-only storage (CD-ROM), digital versatile disc (DVD) or other optical storages, magnetic Cassette tape, the storage of tape magnetic rigid disk or other magnetic storage apparatus or any other non-transmission medium, Can be used to store the information that can be accessed by a computing device.Define according to herein, computer-readable Medium does not include temporary computer readable media (transitory media), the such as data-signal and load of modulation Ripple.
Embodiments herein is these are only, the application is not limited to.For this area skill For art personnel, the application can have various modifications and variations.It is all spirit herein and principle it Interior made any modification, equivalent substitution and improvements etc., should be included in claims hereof model Within enclosing.

Claims (10)

1. a kind of method for crawling electric business website keyword category information, it is characterised in that include:
According to electric business site information, the search unification of the keyword construction electric business website for crawling category information URLs URL;
The search URL of the electric business website of construction is accessed, the page letter of the corresponding webpages of the URL is obtained Breath;
The page info of the webpage is parsed, electric business website described in the page is extracted and is closed The information of keyword category, obtains electric business website keyword category information.
2. method according to claim 1, it is characterised in that the electric business site information includes The domain name of electric business website;According to electric business site information, crawl category information keyword construction electric business net The search URL for standing includes:
Domain-name information according to electric business, the keyword for crawling category information construct the electric business net of following form The search URL for standing:
http://search.XXX.com/SearchKeyword=YYY
Wherein, XXX is the domain name of electric business website, and YYY refers to the key for specifically crawling category information Word.
3. method according to claim 1 and 2, it is characterised in that access the electric business net of construction The search URL for standing, obtaining the page info of the corresponding webpages of the URL includes:
Batch accesses the search URL of the electric business website of construction, obtains the page of the corresponding webpages of the URL Surface information.
4. method according to claim 3, it is characterised in that the batch accesses the electricity of construction The search URL of business website, obtaining the page info of the corresponding webpages of the URL includes:
Simultaneously and concurrently access the search URL of the electric business website of construction in batches by multithreading, obtain described The page info of the corresponding webpages of URL.
5. method according to claim 4, it is characterised in that the page info is hypertext The page info of the form of markup language HTML code.
6. method according to claim 5, it is characterised in that the page info to the webpage Parsed, extracted the information of electric business website keyword category described in the page, obtained electric business Website keyword category information includes:
Directly the HTML code is parsed, electric business website described in the page is extracted and is closed The information of keyword category, obtains electric business website keyword category information.
7. a kind of device for crawling electric business website keyword category information, it is characterised in that include:
Structural unit, for according to electric business site information, crawl category information keyword construction electric business The search uniform resource position mark URL of website;
Access unit, for accessing the search URL of the electric business website of construction, obtains the URL correspondences Webpage page info;
Resolution unit, for parsing to the page info of the webpage, in extracting the page The information of description electric business website keyword category, obtains electric business website keyword category information.
8. device according to claim 7, it is characterised in that the electric business site information includes The domain name of electric business website;The structural unit specifically for:
Domain-name information according to electric business, the keyword for crawling category information construct the electric business net of following form The search URL for standing:
http://search.XXX.com/SearchKeyword=YYY
Wherein, XXX is the domain name of electric business website, and YYY refers to the key for specifically crawling category information Word.
9. the device according to claim 7 or 8, it is characterised in that the access unit is used for Batch accesses the search URL of the electric business website of construction, obtains the page letter of the corresponding webpages of the URL Breath.
10. device according to claim 9, it is characterised in that the batch accesses construction The search URL of electric business website, obtaining the page info of the corresponding webpages of the URL includes:
Simultaneously and concurrently access the search URL of the electric business website of construction in batches by multithreading, obtain described The page info of the corresponding webpages of URL.
CN201510719610.1A 2015-10-29 2015-10-29 Method and device for crawling keyword category information from electronic business websites Pending CN106649322A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510719610.1A CN106649322A (en) 2015-10-29 2015-10-29 Method and device for crawling keyword category information from electronic business websites

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510719610.1A CN106649322A (en) 2015-10-29 2015-10-29 Method and device for crawling keyword category information from electronic business websites

Publications (1)

Publication Number Publication Date
CN106649322A true CN106649322A (en) 2017-05-10

Family

ID=58830257

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510719610.1A Pending CN106649322A (en) 2015-10-29 2015-10-29 Method and device for crawling keyword category information from electronic business websites

Country Status (1)

Country Link
CN (1) CN106649322A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110555176A (en) * 2018-03-30 2019-12-10 佛山市优特美邦电子商务有限公司 E-commerce platform constructed by adopting internet commodity data analysis and collection method
CN111368174A (en) * 2020-03-09 2020-07-03 北京九州云动科技有限公司 Searching method and device supporting multi-provider platform commodity URL or commodity password

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2012113658A (en) * 2010-11-26 2012-06-14 Ntt Docomo Inc Data prefetch system, and device, method, and program therefor
CN102930059A (en) * 2012-11-26 2013-02-13 电子科技大学 Method for designing focused crawler
CN102982174A (en) * 2012-12-17 2013-03-20 北京奇虎科技有限公司 Method and device for performing web search in browser
CN103927400A (en) * 2014-05-07 2014-07-16 重庆邮电大学 Web site product detailed information classification crawling and product information base establishing method
CN104881501A (en) * 2015-06-19 2015-09-02 四川大学 Automatic Internet information obtaining and pushing method

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2012113658A (en) * 2010-11-26 2012-06-14 Ntt Docomo Inc Data prefetch system, and device, method, and program therefor
CN102930059A (en) * 2012-11-26 2013-02-13 电子科技大学 Method for designing focused crawler
CN102982174A (en) * 2012-12-17 2013-03-20 北京奇虎科技有限公司 Method and device for performing web search in browser
CN103927400A (en) * 2014-05-07 2014-07-16 重庆邮电大学 Web site product detailed information classification crawling and product information base establishing method
CN104881501A (en) * 2015-06-19 2015-09-02 四川大学 Automatic Internet information obtaining and pushing method

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110555176A (en) * 2018-03-30 2019-12-10 佛山市优特美邦电子商务有限公司 E-commerce platform constructed by adopting internet commodity data analysis and collection method
CN111368174A (en) * 2020-03-09 2020-07-03 北京九州云动科技有限公司 Searching method and device supporting multi-provider platform commodity URL or commodity password

Similar Documents

Publication Publication Date Title
US11675969B2 (en) Dynamic native content insertion
CN107808000B (en) System and method for collecting and extracting data of dark net
US8239387B2 (en) Structural clustering and template identification for electronic documents
US7725466B2 (en) High accuracy document information-element vector encoding server
CN102831252B (en) A kind of method for upgrading index data base and device, searching method and system
Gowda et al. Clustering web pages based on structure and style similarity (application paper)
JP6203374B2 (en) Web page style address integration
US11580177B2 (en) Identifying information using referenced text
US8205153B2 (en) Information extraction combining spatial and textual layout cues
Szeredi et al. The semantic web explained: The technology and mathematics behind web 3.0
CN1909522A (en) Method for acquiring front-page keyword and its application system
CN104331438B (en) To novel web page contents selectivity abstracting method and device
CN102314494B (en) Method and equipment for processing webpage contents
CN106547749B (en) Webpage data acquisition method and device
CN103744845A (en) Method and system for WEB platform data caching
US20220292160A1 (en) Automated system and method for creating structured data objects for a media-based electronic document
CN110020068B (en) Method and device for configuring page crawling rules
CN106649322A (en) Method and device for crawling keyword category information from electronic business websites
CN108121712A (en) A kind of keyword storage method and device
US9195940B2 (en) Jabba-type override for correcting or improving output of a model
Li et al. Practical study of subclasses of regular expressions in DTD and XML schema
CN110110182A (en) A kind of collecting method and system suitable for crawling in batches
Fugazza et al. Describing geospatial assets in the Web of Data: A metadata management scenario
CN104021143A (en) Method and device for recording webpage access behavior
US9530094B2 (en) Jabba-type contextual tagger

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information

Address after: 100083 No. 401, 4th Floor, Haitai Building, 229 North Fourth Ring Road, Haidian District, Beijing

Applicant after: Beijing Guoshuang Technology Co.,Ltd.

Address before: 100086 Cuigong Hotel, 76 Zhichun Road, Shuangyushu District, Haidian District, Beijing

Applicant before: Beijing Guoshuang Technology Co.,Ltd.

CB02 Change of applicant information
RJ01 Rejection of invention patent application after publication

Application publication date: 20170510

RJ01 Rejection of invention patent application after publication