CN107895009A - One kind is based on distributed internet data acquisition method and system - Google Patents
One kind is based on distributed internet data acquisition method and system Download PDFInfo
- Publication number
- CN107895009A CN107895009A CN201711105738.4A CN201711105738A CN107895009A CN 107895009 A CN107895009 A CN 107895009A CN 201711105738 A CN201711105738 A CN 201711105738A CN 107895009 A CN107895009 A CN 107895009A
- Authority
- CN
- China
- Prior art keywords
- data
- plug
- unit
- url
- data acquisition
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 31
- 241000270322 Lepidosauria Species 0.000 claims abstract description 45
- 238000012545 processing Methods 0.000 claims abstract description 35
- 238000000605 extraction Methods 0.000 claims abstract description 27
- 238000013075 data extraction Methods 0.000 claims description 28
- 230000014509 gene expression Effects 0.000 claims description 24
- 230000004044 response Effects 0.000 claims description 8
- 238000012790 confirmation Methods 0.000 claims description 5
- 238000004458 analytical method Methods 0.000 description 7
- 238000010586 diagram Methods 0.000 description 4
- 239000000284 extract Substances 0.000 description 4
- 230000006870 function Effects 0.000 description 4
- 238000007405 data analysis Methods 0.000 description 3
- 230000008569 process Effects 0.000 description 3
- 238000004088 simulation Methods 0.000 description 3
- 230000000712 assembly Effects 0.000 description 2
- 238000000429 assembly Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 230000006872 improvement Effects 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 241000239290 Araneae Species 0.000 description 1
- 230000008901 benefit Effects 0.000 description 1
- 235000014510 cooky Nutrition 0.000 description 1
- 230000009193 crawling Effects 0.000 description 1
- 238000013480 data collection Methods 0.000 description 1
- 230000007812 deficiency Effects 0.000 description 1
- 230000003862 health status Effects 0.000 description 1
- 238000002513 implantation Methods 0.000 description 1
- 230000008676 import Effects 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 238000009877 rendering Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/955—Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
- G06F16/9566—URL specific, e.g. using aliases, detecting broken or misspelled links
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
Abstract
The present invention is that one kind is based on distributed internet data acquisition method, and the request for receiving user's establishment data acquisition session simultaneously creates data acquisition session, the data acquisition session that user creates is distributed into multiple reptile threads, and start the multiple reptile thread;After each reptile thread receives data acquisition session and is activated, URL is obtained from URL queues to be captured, and webpage is downloaded from the website that data acquisition session is specified, the data processing plug-in unit that data acquisition session is specified is performed, carries out the extraction of data;The data processing plug-in unit that the data acquisition session is specified is the data processing plug-in unit with difference in functionality according to the Type of website selection specified;The target data extracted deposit specified database is used for subsequent treatment, the URL to be crawled extracted is pushed to queue to be crawled.Realize that polymorphic type, the mass data of complicated website crawl.
Description
Technical field
The present invention relates to be based on distributed interconnection netting index based on internet data acquisition technique field, more particularly to one kind
According to acquisition method and system.
Background technology
With developing rapidly for network, internet turns into the carrier of bulk information, wherein including public feelings information, social thing
Part, policy repercussion, various industries information, talent market etc. are big data the analysis of public opinion system, the number of macro economic analysis system
According to basis.How to efficiently extract and turn into a huge challenge using these information.Web crawlers is data analysis system
In highly important part, it is responsible for collecting webpage from internet, gathers information, and these info webs are used to establish rope
To draw so as to provide support for searching analysis, it decides whether the content of whole data analysis system is enriched, and whether information is timely,
Therefore the quality of its performance directly affects the effect of data analysis.
As shown in figure 1, general reptile operational process is probably as follows:
(1)Scheduler (Scheduler) takes out a link (URL) from link (URL) queue to be downloaded
(2)Scheduler starts acquisition module Spiders modules
(3)URL is transmitted to downloader by acquisition module(Downloader), downloader gets off resource downloading
(4)Target data is extracted, extracts destination object(Item), then entity pipeline is given(item pipeline)Enter to advance
The processing of one step;Such as deposit database, file etc.
If what is parsed is link(URL), then link(URL)It is inserted among queue to be crawled.
At present, one of internet data acquisition method is artificial using simulation based on flow configuration acquisition method, the method
The process for clicking on browser carries out data collecting rule definition, such as:First click plate and paging obtain the data list of collection,
The specifying information page is entered back into, last field needed for extraction as needed.
This method uses browser kernel, and simulation browser downloads webpage, performs JS codes.Although it can collect all
The data that can be seen, the information without being shown on webpage can not really gather, it is maximum the problem of be will be by browsing
Data can be just collected after device parsing JS codes, rendering data, speed is slow(3 ~ 10 seconds), and the bandwidth needed is also very
It is more.Often can not meet demand when the data volume for needing to gather is more.
Another method of internet data collection is a management configuration collection rule page, and collection rule are inputted by user
Then(Plate list, data extraction information, tasks carrying information etc.), acquisition tasks issue number local or server, server root
Configured according to collection rule and carry out data acquisition, user downloading data file or can be pushed away data by server after the completion of data acquisition
Give user.
Most of simple website data can be gathered above by the method for configuration collection rule, for more complicated net
Stand, need the processing such as the data of secondary AJAX request to rise if desired for login, POST request, special Header information, partial information
Come comparatively laborious.Can not solve the situation for needing frequent switching dynamic IP to act on behalf of.Can not handle needs to carry out specially treated ability
The data of identification, data such as are parsed from picture, from audio parsing data etc..
The content of the invention
In view of the deficiency of the above method, the present invention proposes a kind of based on template, plug-in unit, the collection of distributed internet data
Method, this method according to the concrete condition of each website, the suitable acquisition module of flexible configuration and collection plug-in unit, so as to so that
With miscellaneous website data in unified method collection internet.
The present invention provides one kind and is based on template, plug-in unit, distributed internet data acquisition method and system, is led in system
To include acquisition module customization, the customization of collection plug-in unit, distributed data acquisition.Acquisition module is customized for mission bit stream, rule
Information;Gather plug-in unit customization provide personalized web site collection customization plug-in unit, as data download plug-in, Agent IP selection plug-in unit,
Document analysis plug-in unit, image data parsing plug-in unit, audio parsing plug-in unit etc.;Distributed data acquisition is coordinated in a collection cluster
Reptile resource, need to start a number of reptile according to acquisition tasks and carry out data and crawl.
The present invention, which is that one kind is a kind of, is based on distributed internet data acquisition method, receives user and creates data acquisition times
The request of business simultaneously creates data acquisition session, the data acquisition session that user creates is distributed into multiple reptile threads, and start
The multiple reptile thread;
After each reptile thread receives data acquisition session and is activated, obtained from URL queues to be captured
URL, and webpage is downloaded from the website that data acquisition session is specified, the data processing plug-in unit that data acquisition session is specified is performed, is entered
The extraction of row data;The data processing plug-in unit that the data acquisition session is specified is having according to the Type of website selection specified
The data processing plug-in unit of difference in functionality;
The target data extracted deposit specified database is used for subsequent treatment, the URL to be crawled extracted is pushed
To queue to be crawled.
Preferably, the URL to be crawled extracted is pushed to before queue to be crawled and re-scheduling confirmation is carried out to URL, judged
Whether the URL had been collected data, if it is judged that being yes, then abandoned the URL;If it is judged that be it is no, then
The URL is pushed into queue to be crawled.
Preferably, data processing plug-in unit include log in plug-in unit, data parsing plug-in unit, metadata plug-in unit, paging parsing plug-in unit,
URL parses plug-in unit;
The plug-in unit that logs in is used to need to log in the website that could carry out data extraction;
The extraction expression formula that the data parsing plug-in unit is used to be specified according to each data field carries out data extraction;
The metadata plug-in unit is used to need the website for carrying out data extraction from list page and detail page respectively;
The paging parsing plug-in unit is used for the website for obtaining the paging information of webpage;
The URL parsings plug-in unit is used to extract URL.
Preferably, data processing plug-in unit also includes picture recognition plug-in unit, voice recognition plug-in unit, Quick Response Code identification plug-in unit.
Preferably, the information that the login plug-in unit needs to configure includes being used to be implanted into user list, the website logins page
The ID of html elements corresponding to login user name, password, identifying code html elements ID, the user list includes and logs in use
Name in an account book, password.
The present invention is that one kind is based on distributed internet data acquisition system, including:
Control centre, receive user and create the request of data acquisition session and create data acquisition session, the number that user is created
Multiple reptile threads are distributed to according to acquisition tasks, and start the multiple reptile thread;
Collection center, after each reptile thread receives data acquisition session and is activated, from URL queues to be captured
Middle acquisition URL, and webpage is downloaded from the website that data acquisition session is specified, perform the data processing that data acquisition session is specified and insert
Part, the extraction of data is carried out, the data extracted are subjected to data parsing;The data processing that the data acquisition session is specified is inserted
Part is the data processing plug-in unit with difference in functionality according to the Type of website selection specified;
Data center, for the target data extracted deposit specified database to be used for into subsequent treatment;
URL pushes center, for the URL to be crawled extracted to be pushed into queue to be crawled.
Preferably, in addition to re-scheduling center, it is right before queue to be crawled for the URL to be crawled extracted to be pushed to
URL carries out re-scheduling confirmation, judges whether the URL had been collected data, if it is judged that being yes, then abandons the URL;
If it is judged that being no, then the URL is pushed into queue to be crawled.
Preferably, data processing plug-in unit include log in plug-in unit, data parsing plug-in unit, metadata plug-in unit, paging parsing plug-in unit,
URL parses plug-in unit;
The plug-in unit that logs in is used to need to log in the website that could carry out data extraction;
The extraction expression formula that the data parsing plug-in unit is used to be specified according to each data field carries out data extraction;
The metadata plug-in unit is used to need the website for carrying out data extraction from list page and detail page respectively;
The paging parsing plug-in unit is used for the website for needing to obtain the paging information of webpage;
The URL parsings plug-in unit is used to extract URL.
Preferably, data processing plug-in unit also includes picture recognition plug-in unit, voice recognition plug-in unit, Quick Response Code identification plug-in unit.
Preferably, the information that the login plug-in unit needs to configure includes being used to be implanted into user list, the website logins page
The ID of html elements corresponding to login user name, password, identifying code html elements ID, the user list includes and logs in use
Name in an account book, password.
The advantageous effects of the present invention are:(1)Based on distributed internet data acquisition method, complex web is realized
The mass data stood crawls.(2)Based on data processing Plugin Mechanism, for the needs of different type website, selection uses plug-in unit
There is the plug-in unit of corresponding function in list, realize that polymorphic type, the mass data of complicated website crawl.(3)The present invention passes through setting
Re-scheduling center, URL caused by distributed reptile is subjected to re-scheduling, avoids repeating to crawl.(4)The present invention proposes one kind and is based on disappearing
The method of queue is ceased, coordinates URL queues to be crawled caused by distributed reptile.
Brief description of the drawings
Fig. 1:Prior art reptile operational process schematic diagram;
Fig. 2:The embodiment of the present invention 1 is based on distributed internet data acquisition method schematic diagram;
Fig. 3:The embodiment of the present invention 2 is based on distributed internet data acquisition system schematic diagram;
Fig. 4:Acquisition tasks manage schematic diagram in embodiment 3;
Fig. 5:The acquisition tasks configuration page of embodiment 3;
Fig. 6:The URL rule configuration pages of embodiment 3;
Fig. 7:Embodiment 3 specifies the acquisition list for gathering the website;
Fig. 8:Embodiment 3 specifies the scope page of extraction URL link;
Fig. 9:Embodiment 3 specifies metadata to lift the expression formula page;
Figure 10:Embodiment 3 specifies the expression formula page of detailed data extraction field;
Figure 11:Embodiment 3 specifies the acquisition server for gathering the website;
Figure 12:The source file administration page of embodiment 3;
Figure 13:The startup at the control centre of embodiment 3 control collection center, stop, checking data, check seed, monitor task
Running status;
Figure 14:The specific data that the task of embodiment 3 collects;
Figure 15:The task of embodiment 3 plate information to be gathered.
Embodiment
With reference to the accompanying drawings and examples, the embodiment of the present invention is described in further detail.Implement below
Example is used to illustrate the present invention, but is not limited to the scope of the present invention.
Embodiment 1
As shown in Fig. 2 the present invention, which is one kind, is based on distributed internet data acquisition method, comprise the following steps:
S101:Receive user to create the request of data acquisition session and create data acquisition session, the data that user creates are adopted
Set task distributes to multiple reptile threads, and starts the multiple reptile thread.
First, the request that user creates data acquisition session is received, carries out the management of data acquisition session, including newly-built,
Editor, acquisition tasks are deleted, specify the titles of acquisition tasks, URL re-schedulings rule, interval time when crawling URL twice, work
Thread Count, number of retries etc..Acquisition tasks are managed with tree structure classification.
According to the demonstrative definition acquisition tasks of user, the plate list of collection, IP agencies is specified to use strategy, web data
Downloading mode(Support to download based on browser kernel and downloaded based on HttpClient), specify data processing plug-in unit, URL extractions
Scope, url list are just in expression formula, URL contents regular expression, metadata information, detailed data item extraction XPATH/CSS/
REGEX expression formulas etc..
After task creation, data acquisition session is distributed to collection reptile, start collection reptile and start data acquisition.When
After data acquisition session terminates, stop data acquisition.Simultaneously to the collection running status of reptile, the CPU of acquisition server, net
Network, internal memory, the state of disk, gathered data situation are monitored;Data whether can be collected to collection rule, collection reptile is
No normal operation provides early warning, alarm.
S102:After each reptile thread receives data acquisition session and is activated, from URL queues to be captured
Middle acquisition URL, and webpage is downloaded from the website that data acquisition session is specified, perform the data processing that data acquisition session is specified and insert
Part, carry out the extraction of data;The data processing plug-in unit that the data acquisition session is specified is according to the Type of website selection specified
The data processing plug-in unit with difference in functionality.
After acquisition tasks create, acquisition tasks are simultaneously published to collection center by control centre, and collection center is according to control
The acquisition tasks that center processed is sent carry out the extraction of data.It is distributed interconnection data acquisition to gather reptile and carry out data acquisition
Data download, the pith of data extraction are carried out in method.When carrying out distributed interconnection data acquisition, while by tens
A cluster is formed to hundreds of reptile threads, an acquisition tasks carry out collaboration processing by one or more reptile threads.Adopt
When set task starts, enabled instruction will be received simultaneously by participating in the reptile thread of this acquisition tasks, and reptile is actively from team to be captured
URL to be crawled is obtained in row, starts to download webpage from specified website, is performed at all data that data acquisition session is specified
Plug-in unit is managed, such as logs in plug-in unit, data parsing plug-in unit, metadata parsing plug-in unit, paging parsing plug-in unit, URL parsing plug-in units, it is special
Plug-in unit also has picture recognition plug-in unit, voice recognition plug-in unit, Quick Response Code identification plug-in unit etc..The present invention is provided with plug-in unit list, including
Have and log in plug-in unit, data parsing plug-in unit, metadata parsing plug-in unit, paging parsing plug-in unit, URL parsing plug-in units, special plug-in unit is also
There are the data processing plug-in units such as picture recognition plug-in unit, voice recognition plug-in unit, Quick Response Code identification plug-in unit.Data are carried out in collection reptile to adopt
The needs of different types of website are directed to during collection, selection uses the plug-in unit in plug-in unit list with corresponding function, realized more
Type, the mass data of complicated website crawl.
The plug-in unit that logs in is used to need to log in the website that could carry out data extraction;Need to log in just for part website
Derivative evidence can be gathered, system provides the plug-in unit for uniformly configuring and logging in, and the information that the plug-in unit needs to configure includes user list(Comprising
Login user name, password), be used to being implanted into login user name, the ID of html elements corresponding to password, checking in the website logins page
The ID of the html elements of code.System chooses a login user name from user list immediately, password attempt logs in website, such as goes out
Identifying code picture is sent to decoding server and obtains identifying code by existing identifying code, system, and implantation login page carries out simulation and stepped on
Land, log in and successfully preserve cookie information afterwards, so that duration data gathers.Each thread is believed using a user in acquisition system
Breath is logged in, therefore supports that multiple accounts log in collection simultaneously.Landing plug is supported by duration switching account, such as half
Hour changes an account and logged in, and also supports to switch account by after the certain webpage quantity of collection.
The extraction expression formula that the data parsing plug-in unit is used to be specified according to each data field carries out data extraction;This is inserted
It is data extraction that the group of part, which wants function, and each data field specifies extraction expression formula, and the expression formula of support has:Xpath expression formulas,
Css expression formulas, regular expression, system extract data according to this expression formula from the html source codes of webpage.
The metadata plug-in unit is used to need the website for carrying out data extraction from list page and detail page respectively;This plug-in unit
Situation about being extracted stage by stage primarily directed to data, extracts partial information in list page(Referred to as metadata), by metadata
Url with the detailed page of data is connected and is saved in queue to be crawled, afterwards can when carrying out detailed page data acquisition
To obtain this partial data from metadata, rather than extracted from the detailed page.When typical example is collection electric quotient data,
It is difficult to obtain the information such as price, the sales volume of commodity in item detail page, but is but easy to obtain commodity in items list
Price, sales volume, therefore commodity price, sales volume are just first collected when collecting items list(Metadata), finally in collection business
Commodity price, sales volume can be obtained during the detailed page of product from metadata.
The paging parsing plug-in unit is used for the website for needing to obtain the paging information of webpage;In the web2.0 epoch, in webpage
Analysis information, can not be directly obtained in webpage source code than paging informations such as following one page, last page, which pages, and
It is to be got by AJAX modes from the server of website, and these paging informations is generated using javascript.System provides
Customize interface and paging url connections are generated according to the paging javascript logics of website for data acquisition personnel.The URL solutions
Analysis plug-in unit is used to extract URL.
S103:The target data extracted deposit specified database is used for subsequent treatment, waits to climb by what is extracted
URL is taken to be pushed to queue to be crawled.
The data extracted are stored in specified database after re-scheduling, denoising, mark, for subsequent treatment.Database
Using Hadoop distributed type assemblies storage basic data, for conventional data Cun Chudao ES index databases so as to quick-searching.
The URL to be crawled extracted is pushed to queue to be crawled, and reptile is obtained after having handled data from queue to be crawled
URL to be crawled is taken to continue to gather.Queue to be crawled provides support for distributed capture, each to gather reptile from team to be crawled
The URL addresses that will be crawled are obtained in row, the url list that reptile parses uniformly is sent to queue United Dispatching to be crawled.Adopt
Collect the data acquisition daily record of reptile, the health status real-time report of collection reptile to control centre.
S104:The URL to be crawled extracted is pushed to before queue to be crawled re-scheduling confirmation is carried out to URL, judging should
Whether URL had been collected data, if it is judged that being yes, then abandoned the URL;If it is judged that being no, then will
The URL pushes to queue to be crawled.
The URL link of website is present in multiple pages, and system uses URL addresses+key(It can be defined according to business demand)
The UUID for generating MD5 carries out re-scheduling, ensures that each URL addresses only gather a data.Re-scheduling is carried out to URL and confirms there are two kinds of moulds
Formula.Pattern one is by MongoDB database re-schedulings, and each UUID carries out write into Databasce operation, and write-in failure then represents should
Record existing, otherwise the record is not present.The advantages of this pattern is to support reptile endless number system, and shortcoming is disk to be consumed
IO.Pattern two is to use BloomFilter re-schedulings, and Bloom Filter are a kind of very high random data knots of space efficiency
Structure, it compactly represents a set very much using bit array, and can judge whether an element belongs to this set.Bloom
This of Filter efficiently has certain cost:When judging whether an element belongs to some set, it is possible to can be not
The element for belonging to this set is mistakenly considered to belong to this set(false positive).Therefore, Bloom Filter are not suitable for
The application scenario of those " zero errors ".And in the case where the application scenario of low error rate can be tolerated, Bloom Filter pass through few
Mistake has exchanged being greatly saved for memory space for.
Embodiment 2:
The present invention is that one kind is based on distributed internet data acquisition system, including:
Control centre, receive user and create the request of data acquisition session and create data acquisition session, the number that user is created
Multiple reptile threads are distributed to according to acquisition tasks, and start the multiple reptile thread;
Collection center, after each reptile thread receives data acquisition session and is activated, from URL queues to be captured
Middle acquisition URL, and webpage is downloaded from the website that data acquisition session is specified, perform the data processing that data acquisition session is specified and insert
Part, the extraction of data is carried out, the data extracted are subjected to data parsing;The data processing that the data acquisition session is specified is inserted
Part is the data processing plug-in unit with difference in functionality according to the Type of website selection specified.
Data center, for the target data extracted deposit specified database to be used for into subsequent treatment;Database is adopted
With Hadoop distributed type assemblies storage basic data, for conventional data Cun Chudao ES index databases so as to quick-searching.Collection
System is acquired data, transmits data to specified database(Kafka message queues), there is special storage journey in data center
Sequence is responsible for being put in storage, stored, inquiry operation.
URL pushes center, for the URL to be crawled extracted to be pushed into queue to be crawled.
Re-scheduling center, for the URL to be crawled extracted to be pushed to before queue to be crawled, to carry out re-scheduling to URL true
Recognize, judge whether the URL had been collected data, if it is judged that being yes, then abandon the URL;If it is judged that
It is no, then the URL is pushed into queue to be crawled.The re-scheduling center of the present invention can realize self-management function, each collection
The corresponding re-scheduling queue of task, during newly-built acquisition tasks re-scheduling queue automatically create, when deleting acquisition tasks, re-scheduling queue from
It is dynamic to delete.
The data processing plug-in unit includes logging in plug-in unit, data parsing plug-in unit, metadata plug-in unit, paging parsing plug-in unit, URL
Parse plug-in unit, picture recognition plug-in unit, voice recognition plug-in unit, Quick Response Code identification plug-in unit.
The plug-in unit that logs in is used to need to log in the website that could carry out data extraction, and the plug-in unit that logs in needs what is configured
Information includes user list, is used to being implanted into login user name, the ID of html elements corresponding to password, checking in the website logins page
The ID of the html elements of code, the user list include login user name, password.The data parsing plug-in unit is used for according to each
The extraction expression formula that data field is specified carries out data extraction.The metadata plug-in unit be used for need respectively from list page and in detail
The website of data extraction is carried out in page.The paging parsing plug-in unit is used for the website for needing to obtain the paging information of webpage;It is described
URL parsing plug-in units are used to extract URL.
Embodiment 3:
The present embodiment is to gather the bid data instance of " Government of Shandong Province website ".After control centre receives user's instruction,
Start to create data acquisition session.Due to hundreds of acquisition tasks can be run in computer system, for convenience of managing, collection
Task carries out Classification Management with tree structure.As shown in figure 4, left side is acquisition tasks tree sort, right side is under the classification
Acquisition tasks list, the specific configuration information of acquisition tasks can be configured from the page, can also be direct from xml configuration files
Import.There is " Government of Shandong Province buying net " in task names list shown in Fig. 4, specify the title of acquisition tasks, i.e., this is adopted
Website specified by set task is " Shandong government procurement net ".
Fig. 5 is that data acquisition session operation rule configures the page, data acquisition session is configured in this page, mainly
Configuration task ID, task names, Thread Count, request timed out time, number of retries, agent pool, data type etc..Wherein data class
Type refers to the data classification of collection, comprising:News, forum, house property, electric business, employment etc..
Fig. 5 is the URL rule configuration pages, according to Fig. 6, refers to the URL regular expressions for determining content pages, row in task
The URL regular expressions of table page and URL blacklist regular expression, these URL rules, which can use, realizes specified connect
The java classes of mouth, specific regular expression is generated by java classes.
The data processing plug-in unit specified is listed in resolver list.In distributed internet data acquisition system, including
Plug-in unit, data parsing plug-in unit, metadata plug-in unit, paging parsing plug-in unit, URL parsings plug-in unit, picture recognition plug-in unit, sound is logged in know
Other plug-in unit, the multiple data processing plug-in units with difference in functionality of Quick Response Code identification plug-in unit etc., for adopting according to being carried out data
The type of the website of collection uses the plug-in unit of difference in functionality.When carrying out data extraction, collection center is sent according to control centre
Acquisition tasks, the data processing plug-in unit specified is performed, carries out the extraction of data.
In the present embodiment, according to Shandong government procurement net and the feature of wanted gathered data, this acquisition tasks is specified
Required data processing plug-in unit includes:LoginParser:Landing plug;DataParser:Data parse plug-in unit;
MetaParser:Metadata parses plug-in unit;LinkParser:URL parses plug-in unit.When reptile is carrying out data extraction, perform
The above-mentioned data processing plug-in unit specified carries out the extraction of data.
Control centre when creating acquisition tasks, will also to URL seeds, URL extraction scopes, meta-data extraction scope, in detail
Count accurately and specified according to extraction, server, source file etc..As shown in fig. 7, specify the acquisition list for gathering the website.Such as Fig. 8 institutes
Show, specify the scope of extraction URL link.As shown in figure 9, specify metadata to lift expression formula, meta-data extraction in the present embodiment
Title, issuing time, detail page url.As shown in Figure 10, the expression formula of detailed data extraction field is specified, in the present embodiment
Specify content, text classification, subclassification, province, the extraction expression formula of picture address.As shown in figure 11, specify and gather the website
Acquisition server, can be one or multiple.Each crawler server includes multiple reptile threads, realizes distribution
The data extraction of formula.As shown in figure 12, source file can upload xml acquisition tasks configuration files, and system will solve from xml document
Analyse configuration information, without being configured in the page, the more convenient and quicker for skilled data acquisition personnel.
After acquisition tasks create, acquisition tasks are published to collection center by control centre, and collection center is according to control
The acquisition tasks that center is sent carry out the extraction of data.As shown in figure 13, the startup at control centre's control collection center, stopping,
Check data, check the running status of seed, monitor task.After collection center is activated, system will be led to by socket modes
Know that selected acquisition server participates in the data acquisition of the task.After stopping, system will be notified selected by socket modes
Acquisition server stops the data acquisition of the task.As shown in figure 14, control centre can check that the task collects specific
Data.As shown in figure 15, the task plate information to be gathered, i.e. URL link are checked.
Described above is only the preferred embodiment of the present invention, it is noted that for the ordinary skill people of the art
For member, without departing from the technical principles of the invention, some improvement and modification can also be made, these improvement and modification
Also it should be regarded as protection scope of the present invention.
Claims (10)
1. one kind is based on distributed internet data acquisition method, it is characterised in that:
Receive user to create the request of data acquisition session and create data acquisition session, the data acquisition session that user is created
Multiple reptile threads are distributed to, and start the multiple reptile thread;
After each reptile thread receives data acquisition session and is activated, obtained from URL queues to be captured
URL, and webpage is downloaded from the website that data acquisition session is specified, the data processing plug-in unit that data acquisition session is specified is performed, is entered
The extraction of row data;The data processing plug-in unit that the data acquisition session is specified is having according to the Type of website selection specified
The data processing plug-in unit of difference in functionality;
The target data extracted deposit specified database is used for subsequent treatment, the URL to be crawled extracted is pushed
To queue to be crawled。
2. one kind according to claim 1 is based on distributed internet data acquisition method, it is characterised in that:Will extraction
URL to be crawled out is pushed to before queue to be crawled and carries out re-scheduling confirmation to URL, judges whether the URL had been collected
Data, if it is judged that being yes, then the URL is abandoned;If it is judged that being no, then the URL is pushed into team to be crawled
Row。
3. one kind according to claim 1 is based on distributed internet data acquisition method, it is characterised in that:At data
Managing plug-in unit includes logging in plug-in unit, data parsing plug-in unit, metadata plug-in unit, paging parsing plug-in unit, URL parsing plug-in units;
The plug-in unit that logs in is used to need to log in the website that could carry out data extraction;
The extraction expression formula that the data parsing plug-in unit is used to be specified according to each data field carries out data extraction;
The metadata plug-in unit is used to need the website for carrying out data extraction from list page and detail page respectively;
The paging parsing plug-in unit is used for the website for obtaining the paging information of webpage;
The URL parsings plug-in unit is used to extract URL。
4. one kind according to claim 1 is based on distributed internet data acquisition method, it is characterised in that:At data
Managing plug-in unit also includes picture recognition plug-in unit, voice recognition plug-in unit, Quick Response Code identification plug-in unit。
5. one kind according to claim 3 is based on distributed internet data acquisition method, it is characterised in that:It is described to step on
Record plug-in unit needs the information that configures to include being used in user list, the website logins page being implanted into login user name, corresponding to password
The ID of html elements, the html elements of identifying code ID, the user list include login user name, password。
6. one kind is based on distributed internet data acquisition system, it is characterised in that:Including:
Control centre, receive user and create the request of data acquisition session and create data acquisition session, the number that user is created
Multiple reptile threads are distributed to according to acquisition tasks, and start the multiple reptile thread;
Collection center, for after each reptile thread receives data acquisition session and is activated, from URL to be captured
URL is obtained in queue, and webpage is downloaded from the website that data acquisition session is specified, is performed at the data that data acquisition session is specified
Plug-in unit is managed, carries out the extraction of data, the data extracted are subjected to data parsing;At the data that the data acquisition session is specified
Reason plug-in unit is the data processing plug-in unit with difference in functionality according to the Type of website selection specified;
Data center, for the target data extracted deposit specified database to be used for into subsequent treatment;
URL pushes center, for the URL to be crawled extracted to be pushed into queue to be crawled。
7. one kind according to claim 6 is based on distributed internet data acquisition system, it is characterised in that:Also include
Re-scheduling center, re-scheduling confirmation is carried out to URL for the URL to be crawled extracted to be pushed to before queue to be crawled, judging should
Whether URL had been collected data, if it is judged that being yes, then abandoned the URL;If it is judged that being no, then will
The URL pushes to queue to be crawled.
8. one kind according to claim 6 is based on distributed internet data acquisition system, it is characterised in that:At data
Managing plug-in unit includes logging in plug-in unit, data parsing plug-in unit, metadata plug-in unit, paging parsing plug-in unit, URL parsing plug-in units;
The plug-in unit that logs in is used to need to log in the website that could carry out data extraction;
The extraction expression formula that the data parsing plug-in unit is used to be specified according to each data field carries out data extraction;
The metadata plug-in unit is used to need the website for carrying out data extraction from list page and detail page respectively;
The paging parsing plug-in unit is used for the website for needing to obtain the paging information of webpage;
The URL parsings plug-in unit is used to extract URL.
9. one kind according to claim 6 is based on distributed internet data acquisition method, it is characterised in that:At data
Managing plug-in unit also includes picture recognition plug-in unit, voice recognition plug-in unit, Quick Response Code identification plug-in unit.
10. one kind according to claim 8 is based on distributed internet data acquisition system, it is characterised in that:It is described
The information that logging in plug-in unit needs to configure includes corresponding for being implanted into login user name, password in user list, the website logins page
The ID of html elements, identifying code html elements ID, the user list includes login user name, password.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711105738.4A CN107895009B (en) | 2017-11-10 | 2017-11-10 | Distributed internet data acquisition method and system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711105738.4A CN107895009B (en) | 2017-11-10 | 2017-11-10 | Distributed internet data acquisition method and system |
Publications (2)
Publication Number | Publication Date |
---|---|
CN107895009A true CN107895009A (en) | 2018-04-10 |
CN107895009B CN107895009B (en) | 2021-09-03 |
Family
ID=61804941
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201711105738.4A Active CN107895009B (en) | 2017-11-10 | 2017-11-10 | Distributed internet data acquisition method and system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107895009B (en) |
Cited By (34)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109088908A (en) * | 2018-06-06 | 2018-12-25 | 武汉酷犬数据科技有限公司 | A kind of the distributed general collecting method and system of network-oriented |
CN109145233A (en) * | 2018-08-27 | 2019-01-04 | 山东浪潮商用系统有限公司 | internet information acquisition system |
CN109285046A (en) * | 2018-08-10 | 2019-01-29 | 浙江工业大学 | A kind of electric business big data acquisition system based on business plug-in unit |
CN109445870A (en) * | 2018-09-28 | 2019-03-08 | 浙江乾冠信息安全研究院有限公司 | A kind of data processing method, electronic equipment and storage medium |
CN109542903A (en) * | 2018-11-14 | 2019-03-29 | 珠海格力智能装备有限公司 | Data processing method and device |
CN109753421A (en) * | 2018-12-29 | 2019-05-14 | 广东益萃网络科技有限公司 | Optimization method, device, computer equipment and the storage medium of service system |
CN109783714A (en) * | 2019-01-08 | 2019-05-21 | 上海因致信息科技有限公司 | Interface data acquisition methods and system |
CN109933706A (en) * | 2019-03-29 | 2019-06-25 | 北京达佳互联信息技术有限公司 | A kind of data capture method, device, electronic equipment and storage medium |
CN109948079A (en) * | 2019-03-11 | 2019-06-28 | 湖南衍金征信数据服务有限公司 | A kind of method that distributed capture discloses page data |
CN110134854A (en) * | 2019-05-28 | 2019-08-16 | 江苏快页信息技术有限公司 | A kind of crawler acquisition method based on user's incentive mechanism |
CN110147475A (en) * | 2019-03-29 | 2019-08-20 | 汇通达网络股份有限公司 | A kind of network data acquisition system of distributed deployment |
CN110147362A (en) * | 2019-04-04 | 2019-08-20 | 中电科大数据研究院有限公司 | One kind is based on the acquisition of event driven DOC DATA and processing system and its method |
CN110175862A (en) * | 2019-04-16 | 2019-08-27 | 苏宁易购集团股份有限公司 | Commodity price calculation method and system are sold for electric business platform |
CN110188258A (en) * | 2019-04-19 | 2019-08-30 | 平安科技(深圳)有限公司 | The method and device of external data is obtained using crawler |
CN110334139A (en) * | 2018-12-18 | 2019-10-15 | 济南百航信息技术有限公司 | A method of third party system data are docked by simulated operation |
CN110347899A (en) * | 2019-07-04 | 2019-10-18 | 北京熵简科技有限公司 | Distributed interconnection data collection system and method based on event-based model |
CN110365810A (en) * | 2019-07-23 | 2019-10-22 | 中南民族大学 | Domain name caching method, device, equipment and storage medium based on web crawlers |
CN110457312A (en) * | 2019-07-05 | 2019-11-15 | 中国平安财产保险股份有限公司 | Acquisition method, device, equipment and the readable storage medium storing program for executing of diversiform data |
CN110737647A (en) * | 2019-08-20 | 2020-01-31 | 广州宏数科技有限公司 | Internet big data cleaning method |
CN110737813A (en) * | 2019-09-26 | 2020-01-31 | 苏州浪潮智能科技有限公司 | method, equipment and medium for improving efficiency of reptile |
CN111428107A (en) * | 2020-03-23 | 2020-07-17 | 新华智云科技有限公司 | Multi-center comprehensive web crawler system |
CN111488508A (en) * | 2020-04-10 | 2020-08-04 | 长春博立电子科技有限公司 | Internet information acquisition system and method supporting multi-protocol distributed high concurrency |
CN111597421A (en) * | 2020-04-30 | 2020-08-28 | 武汉思普崚技术有限公司 | Method, device, equipment and storage medium for realizing website picture crawler |
CN111680203A (en) * | 2020-05-07 | 2020-09-18 | 支付宝(杭州)信息技术有限公司 | Data acquisition method and device and electronic equipment |
CN111722980A (en) * | 2020-06-11 | 2020-09-29 | 咪咕文化科技有限公司 | Data acquisition system and method |
CN111881335A (en) * | 2020-07-28 | 2020-11-03 | 芯薇(上海)智能科技有限公司 | Crawler technology-based multitasking system and method |
CN111953766A (en) * | 2020-08-07 | 2020-11-17 | 福建省天奕网络科技有限公司 | Method and system for collecting network data |
CN111949852A (en) * | 2020-08-31 | 2020-11-17 | 东华理工大学 | Macroscopic economy analysis method and system based on internet big data |
CN112100061A (en) * | 2020-08-28 | 2020-12-18 | 广州探迹科技有限公司 | Visual crawler code compiling and debugging method |
CN112579862A (en) * | 2020-12-22 | 2021-03-30 | 福建江夏学院 | Xpath automatic extraction method based on MD5 value comparison |
CN112597373A (en) * | 2020-12-29 | 2021-04-02 | 科技谷(厦门)信息技术有限公司 | Data acquisition method based on distributed crawler engine |
CN112650570A (en) * | 2020-12-29 | 2021-04-13 | 百果园技术(新加坡)有限公司 | Dynamically expandable distributed crawler system, data processing method and device |
CN112667873A (en) * | 2020-12-16 | 2021-04-16 | 北京华如慧云数据科技有限公司 | Crawler system and method suitable for general data acquisition of most websites |
WO2021088350A1 (en) * | 2019-11-07 | 2021-05-14 | 南京莱斯网信技术研究院有限公司 | Script-based web service paging data acquisition system |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102685220A (en) * | 2012-04-28 | 2012-09-19 | 苏州阔地网络科技有限公司 | Method and system for data interaction based on WEB page |
CN103970788A (en) * | 2013-02-01 | 2014-08-06 | 北京英富森信息技术有限公司 | Webpage-crawling-based crawler technology |
CN105956175A (en) * | 2016-05-24 | 2016-09-21 | 考拉征信服务有限公司 | Webpage content crawling method and device |
US20170270206A1 (en) * | 2014-11-05 | 2017-09-21 | Facebook, Inc. | Social-Based Optimization of Web Crawling for Online Social Networks |
-
2017
- 2017-11-10 CN CN201711105738.4A patent/CN107895009B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102685220A (en) * | 2012-04-28 | 2012-09-19 | 苏州阔地网络科技有限公司 | Method and system for data interaction based on WEB page |
CN103970788A (en) * | 2013-02-01 | 2014-08-06 | 北京英富森信息技术有限公司 | Webpage-crawling-based crawler technology |
US20170270206A1 (en) * | 2014-11-05 | 2017-09-21 | Facebook, Inc. | Social-Based Optimization of Web Crawling for Online Social Networks |
CN105956175A (en) * | 2016-05-24 | 2016-09-21 | 考拉征信服务有限公司 | Webpage content crawling method and device |
Cited By (46)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109088908A (en) * | 2018-06-06 | 2018-12-25 | 武汉酷犬数据科技有限公司 | A kind of the distributed general collecting method and system of network-oriented |
CN109285046A (en) * | 2018-08-10 | 2019-01-29 | 浙江工业大学 | A kind of electric business big data acquisition system based on business plug-in unit |
CN109145233A (en) * | 2018-08-27 | 2019-01-04 | 山东浪潮商用系统有限公司 | internet information acquisition system |
CN109445870A (en) * | 2018-09-28 | 2019-03-08 | 浙江乾冠信息安全研究院有限公司 | A kind of data processing method, electronic equipment and storage medium |
CN109542903A (en) * | 2018-11-14 | 2019-03-29 | 珠海格力智能装备有限公司 | Data processing method and device |
CN110334139A (en) * | 2018-12-18 | 2019-10-15 | 济南百航信息技术有限公司 | A method of third party system data are docked by simulated operation |
CN109753421B (en) * | 2018-12-29 | 2022-06-10 | 广东益萃网络科技有限公司 | Service system optimization method and device, computer equipment and storage medium |
CN109753421A (en) * | 2018-12-29 | 2019-05-14 | 广东益萃网络科技有限公司 | Optimization method, device, computer equipment and the storage medium of service system |
CN109783714A (en) * | 2019-01-08 | 2019-05-21 | 上海因致信息科技有限公司 | Interface data acquisition methods and system |
CN109948079A (en) * | 2019-03-11 | 2019-06-28 | 湖南衍金征信数据服务有限公司 | A kind of method that distributed capture discloses page data |
CN109933706A (en) * | 2019-03-29 | 2019-06-25 | 北京达佳互联信息技术有限公司 | A kind of data capture method, device, electronic equipment and storage medium |
CN110147475A (en) * | 2019-03-29 | 2019-08-20 | 汇通达网络股份有限公司 | A kind of network data acquisition system of distributed deployment |
CN110147475B (en) * | 2019-03-29 | 2023-07-21 | 汇通达网络股份有限公司 | Distributed deployment network data acquisition system |
CN110147362A (en) * | 2019-04-04 | 2019-08-20 | 中电科大数据研究院有限公司 | One kind is based on the acquisition of event driven DOC DATA and processing system and its method |
CN110175862A (en) * | 2019-04-16 | 2019-08-27 | 苏宁易购集团股份有限公司 | Commodity price calculation method and system are sold for electric business platform |
CN110188258A (en) * | 2019-04-19 | 2019-08-30 | 平安科技(深圳)有限公司 | The method and device of external data is obtained using crawler |
CN110134854A (en) * | 2019-05-28 | 2019-08-16 | 江苏快页信息技术有限公司 | A kind of crawler acquisition method based on user's incentive mechanism |
CN110347899A (en) * | 2019-07-04 | 2019-10-18 | 北京熵简科技有限公司 | Distributed interconnection data collection system and method based on event-based model |
CN110347899B (en) * | 2019-07-04 | 2021-06-22 | 北京熵简科技有限公司 | Distributed internet data acquisition system and method based on event-driven model |
CN110457312A (en) * | 2019-07-05 | 2019-11-15 | 中国平安财产保险股份有限公司 | Acquisition method, device, equipment and the readable storage medium storing program for executing of diversiform data |
CN110457312B (en) * | 2019-07-05 | 2023-07-07 | 中国平安财产保险股份有限公司 | Method, device, equipment and readable storage medium for collecting multi-type data |
CN110365810A (en) * | 2019-07-23 | 2019-10-22 | 中南民族大学 | Domain name caching method, device, equipment and storage medium based on web crawlers |
CN110737647A (en) * | 2019-08-20 | 2020-01-31 | 广州宏数科技有限公司 | Internet big data cleaning method |
CN110737647B (en) * | 2019-08-20 | 2023-07-25 | 广州宏数科技有限公司 | Internet big data cleaning method |
CN110737813A (en) * | 2019-09-26 | 2020-01-31 | 苏州浪潮智能科技有限公司 | method, equipment and medium for improving efficiency of reptile |
CN110737813B (en) * | 2019-09-26 | 2022-07-29 | 苏州浪潮智能科技有限公司 | Method, equipment and medium for improving efficiency of reptiles |
WO2021088350A1 (en) * | 2019-11-07 | 2021-05-14 | 南京莱斯网信技术研究院有限公司 | Script-based web service paging data acquisition system |
CN111428107B (en) * | 2020-03-23 | 2023-09-01 | 新华智云科技有限公司 | Multi-center comprehensive web crawler system |
CN111428107A (en) * | 2020-03-23 | 2020-07-17 | 新华智云科技有限公司 | Multi-center comprehensive web crawler system |
CN111488508A (en) * | 2020-04-10 | 2020-08-04 | 长春博立电子科技有限公司 | Internet information acquisition system and method supporting multi-protocol distributed high concurrency |
CN111597421A (en) * | 2020-04-30 | 2020-08-28 | 武汉思普崚技术有限公司 | Method, device, equipment and storage medium for realizing website picture crawler |
CN111597421B (en) * | 2020-04-30 | 2022-08-30 | 武汉思普崚技术有限公司 | Method, device, equipment and storage medium for realizing website picture crawler |
CN111680203A (en) * | 2020-05-07 | 2020-09-18 | 支付宝(杭州)信息技术有限公司 | Data acquisition method and device and electronic equipment |
CN111680203B (en) * | 2020-05-07 | 2023-04-18 | 支付宝(杭州)信息技术有限公司 | Data acquisition method and device and electronic equipment |
CN111722980A (en) * | 2020-06-11 | 2020-09-29 | 咪咕文化科技有限公司 | Data acquisition system and method |
CN111722980B (en) * | 2020-06-11 | 2023-10-20 | 咪咕文化科技有限公司 | Data acquisition system and method |
CN111881335A (en) * | 2020-07-28 | 2020-11-03 | 芯薇(上海)智能科技有限公司 | Crawler technology-based multitasking system and method |
CN111953766A (en) * | 2020-08-07 | 2020-11-17 | 福建省天奕网络科技有限公司 | Method and system for collecting network data |
CN112100061A (en) * | 2020-08-28 | 2020-12-18 | 广州探迹科技有限公司 | Visual crawler code compiling and debugging method |
CN111949852A (en) * | 2020-08-31 | 2020-11-17 | 东华理工大学 | Macroscopic economy analysis method and system based on internet big data |
CN112667873A (en) * | 2020-12-16 | 2021-04-16 | 北京华如慧云数据科技有限公司 | Crawler system and method suitable for general data acquisition of most websites |
CN112579862B (en) * | 2020-12-22 | 2022-06-14 | 福建江夏学院 | Xpath automatic extraction method based on MD5 value comparison |
CN112579862A (en) * | 2020-12-22 | 2021-03-30 | 福建江夏学院 | Xpath automatic extraction method based on MD5 value comparison |
CN112650570A (en) * | 2020-12-29 | 2021-04-13 | 百果园技术(新加坡)有限公司 | Dynamically expandable distributed crawler system, data processing method and device |
CN112597373A (en) * | 2020-12-29 | 2021-04-02 | 科技谷(厦门)信息技术有限公司 | Data acquisition method based on distributed crawler engine |
CN112597373B (en) * | 2020-12-29 | 2023-09-15 | 科技谷(厦门)信息技术有限公司 | Data acquisition method based on distributed crawler engine |
Also Published As
Publication number | Publication date |
---|---|
CN107895009B (en) | 2021-09-03 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107895009A (en) | One kind is based on distributed internet data acquisition method and system | |
CN101222349B (en) | Method and system for collecting web user action and performance data | |
US9519561B2 (en) | Method and system for configuration-controlled instrumentation of application programs | |
CN107071009A (en) | A kind of distributed big data crawler system of load balancing | |
CN107025296B (en) | Based on science service information intelligent grasping system method of data capture | |
CN106897357A (en) | A kind of method for crawling the network information for band checking distributed intelligence | |
CN107885777A (en) | A kind of control method and system of the crawl web data based on collaborative reptile | |
CN102184184B (en) | Method for acquiring webpage dynamic information | |
CN105912587A (en) | Data acquisition method and system | |
CN110020062B (en) | Customizable web crawler method and system | |
CN109933701A (en) | A kind of microblog data acquisition methods based on more strategy fusions | |
Bollen et al. | An architecture for the aggregation and analysis of scholarly usage data | |
CN106603296A (en) | Log processing method and device | |
WO2014055579A1 (en) | Pagination of data based on recorded url requests | |
CN109729044A (en) | A kind of general internet data acquisition is counter to climb system and method | |
CN106209863B (en) | A kind of web portal security monitoring method based on whole station scanning | |
CN110851681A (en) | Crawler processing method and device, server and computer readable storage medium | |
CN105721578A (en) | User behavior data collection method and system | |
JP6286559B2 (en) | Method and device for adding sign icons in interactive applications | |
CN106897313B (en) | Mass user service preference evaluation method and device | |
CN107480189A (en) | A kind of various dimensions real-time analyzer and method | |
CN105245394A (en) | Method and equipment for analyzing network access log based on layered approach | |
Intapong et al. | Collecting data of SNS user behavior to detect symptoms of excessive usage: Development of data collection application | |
CN108205548A (en) | A kind of Web Spider structure and its method of work based on agriculture webpage information acquisition | |
Sah | A New Architecture for Managing Enterprise Log Data. |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
TR01 | Transfer of patent right | ||
TR01 | Transfer of patent right |
Effective date of registration: 20231012 Address after: Address: Building 44, Building 241, Shuaifuyuan Community, Tongzhou District, Beijing, 101100 Patentee after: Zhang Wei Address before: Floor 9, block D, Beifa building, No. 15, Xueyuan South Road, Haidian District, Beijing 100080 Patentee before: BEIJING GUOXIN HONGSHU TECHNOLOGY CO.,LTD. |