CN103279537A - Method and device for acquiring web page data - Google Patents

Method and device for acquiring web page data Download PDF

Info

Publication number
CN103279537A
CN103279537A CN2013102173918A CN201310217391A CN103279537A CN 103279537 A CN103279537 A CN 103279537A CN 2013102173918 A CN2013102173918 A CN 2013102173918A CN 201310217391 A CN201310217391 A CN 201310217391A CN 103279537 A CN103279537 A CN 103279537A
Authority
CN
China
Prior art keywords
data
web
web page
node
joint element
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN2013102173918A
Other languages
Chinese (zh)
Inventor
张怡明
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
SHANGHAI SOFTVAN TECHNOLOGIES Co Ltd
Original Assignee
SHANGHAI SOFTVAN TECHNOLOGIES Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by SHANGHAI SOFTVAN TECHNOLOGIES Co Ltd filed Critical SHANGHAI SOFTVAN TECHNOLOGIES Co Ltd
Priority to CN2013102173918A priority Critical patent/CN103279537A/en
Publication of CN103279537A publication Critical patent/CN103279537A/en
Pending legal-status Critical Current

Links

Images

Abstract

The invention discloses a method and a device for acquiring web page data. The method includes decomposing a web page framework of a first target web page according to a receive retrieval request of a user to generate at least one web page node element comprising a node identifier; receiving a data acquisition command of the user; executing data acquisition behavior on node data contents corresponding to respective node identifiers according to data acquisition subcommands to acquire node data of each web page node element; and composing the acquired node data of the web page node elements to acquire the web page data of the first target web page. The data acquisition command of the user comprises the data acquisition subcommands corresponding to the node identifiers respectively. Compared with an existing scheme that web page data only can be retrieved and acquired manually in the prior art, the method and the device in an embodiment of the invention have the advantages that the user only needs to input instructions, the data can be acquired without manual retrieval or downloading, and large quantities of labor cost and material cost are saved.

Description

A kind of web data acquisition methods and device
Technical field
The application relates to the Computer Applied Technology field, particularly a kind of web data acquisition methods and device.
Background technology
Along with rapid development of network technique, the internet becomes the great platform of quantity of information, also is the highest information data carrier of transfer efficiency.
At present, carrying out web data when obtaining, generally by adopting artificial mode to retrieve and download the web data of meeting consumers' demand on the internet, such mode need expend great deal of labor and goods and materials cost.
Summary of the invention
The application's technical matters to be solved provides a kind of web data acquisition methods and device, when solving in the prior art manual retrieval and obtaining the web data of meeting consumers' demand in the webpage, consumes a large amount of human costs and goods and materials cost techniques problem.
The application provides a kind of web data acquisition methods, comprising:
The webpage framework that first target web is decomposed in the user search request that foundation receives is to generate at least one web page joint element, and each described web page joint element comprises node identification;
Receive user's data and obtain order, described data are obtained order and are comprised that corresponding with each described node identification respectively data obtain subcommand;
With described data obtain subcommand to its separately the node data content of node identification correspondence carry out data and obtain behavior, to obtain the node data of each described web page joint element;
The web data of the node data of each web page joint element of getting access to being formed described first target web.
Said method, preferred, described data are obtained order and are comprised that data grasp subcommand and/or clicking operation subcommand.
Said method, preferred, after the web data of described first target web of described composition, described method also comprises:
The data acquisition lines that records each described web page joint element in described first target web is, data are obtained order and data are obtained the time interval that the data between two adjacent described web page joint elements of front and back are obtained, and obtains flow process to generate data;
Determine at least one second target web, the webpage framework of described second target web is identical with the webpage framework of described first target web;
Obtain flow process with described data the node data content execution data of the web page joint element correspondence in each described second target web are obtained behavior, obtain the web data of each described second target web.
Said method, preferred, the user search request that described foundation receives is decomposed the webpage framework of first target web to generate at least one web page joint element, comprising:
Read the source code of the webpage framework correspondence of first target web;
Decompose described webpage framework according to described source code, to generate at least one web page joint element corresponding with the retrieval key value in the user search request that receives respectively, each described web page joint element comprises node identification.
Said method, preferred, after the node data of described each web page joint element that will get access to was formed the web data of described first target web, described method also comprised:
Determine webpage to be edited, described to wait to edit each web page joint element of webpage corresponding one by one with at least one web page joint element in described first target web;
According to the web data of described first target web, edit described node data content of waiting to edit each the web page joint element in the webpage.
The application also provides a kind of web data deriving means, comprising:
The webpage resolving cell, the webpage framework that is used for decomposing first target web according to the user search request that receives is to generate at least one web page joint element, and each described web page joint element comprises node identification;
The order receiving element be used for to receive user's data and obtains order, and described data are obtained order and comprised that corresponding with each described node identification respectively data obtain subcommand;
The behavior performance element, be used for described data obtain subcommand to its separately the node data content of node identification correspondence carry out data and obtain behavior, to obtain the node data of each described web page joint element;
The data assembled unit, the node data that is used for each the web page joint element that will get access to is formed the web data of described first target web.
Said apparatus, preferred, described data are obtained order and are comprised that data grasp subcommand and/or clicking operation subcommand.
Said apparatus, preferred, described device also comprises:
The process recording unit, be used for record describedly and the data acquisition lines of each described web page joint element of target web obtain order and the data time interval that the data between adjacent two described web page joint elements are obtained before and after obtaining for, data, with the generating run flow process;
The webpage determining unit is used for determining at least one second target web in batches, and the webpage framework of described second target web is identical with the webpage framework of described first target web;
The batch data acquiring unit is used for obtaining flow process with described data the node data content execution data of each each web page joint element correspondence of described second target web is obtained behavior, obtains the web data of each described second target web.
Said apparatus, preferred, described webpage resolving cell comprises:
Code reads subelement, for the source code of the webpage framework correspondence that reads first target web;
Webpage decomposes subelement, is used for decomposing described webpage framework according to described source code, and to generate at least one web page joint element corresponding with the retrieval key value in the user search request that receives respectively, each described web page joint element comprises node identification.
Said apparatus, preferred, described device also comprises:
The webpage determining unit is used for determining webpage to be edited, and described to wait to edit each web page joint element of webpage corresponding one by one with at least one web page joint element in described first target web;
The web page editing unit is used for the web data according to described first target web, the node data content of editing each the web page joint element in described the 3rd target web.
By such scheme as can be known, a kind of web data acquisition methods and device that the application provides, by the node elements according to user's request decomposition goal webpage, be met the node elements of user's request, obtain order according to user's data again each node elements execution data is obtained behavior, obtain the node data of each node elements, thereby obtain getting access in the target web web data of meeting consumers' demand, with respect to the scheme of retrieving and obtain web data in the prior art by artificial mode, only need user input instruction, need not manual information retrieval and data are obtained in download, saved great deal of labor and goods and materials cost.
Description of drawings
In order to be illustrated more clearly in the technical scheme in the embodiment of the present application, the accompanying drawing of required use is done to introduce simply in will describing embodiment below, apparently, accompanying drawing in describing below only is some embodiment of the application, for those of ordinary skills, under the prerequisite of not paying creative work, can also obtain other accompanying drawing according to these accompanying drawings.
The process flow diagram of a kind of web data acquisition methods embodiment one that Fig. 1 provides for the application;
Fig. 2 is the application exemplary plot of the embodiment of the present application one;
The part process flow diagram of a kind of web data acquisition methods embodiment two that Fig. 3 provides for the application;
The part process flow diagram of a kind of web data acquisition methods embodiment three that Fig. 4 provides for the application;
The part process flow diagram of a kind of web data acquisition methods embodiment four that Fig. 5 provides for the application;
Fig. 6 is the application exemplary plot of the embodiment of the present application four;
The structural representation of a kind of web data deriving means embodiment five that Fig. 7 provides for the application;
Fig. 8 is the part-structure synoptic diagram of the embodiment of the present application five;
The part-structure synoptic diagram of a kind of web data deriving means embodiment six that Fig. 9 provides for the application;
The part-structure synoptic diagram of a kind of web data deriving means embodiment seven that Figure 10 provides for the application;
The structural representation of a kind of web data deriving means embodiment eight that Figure 11 provides for the application;
Figure 12 is the part-structure synoptic diagram of the embodiment of the present application eight.
Embodiment
Below in conjunction with the accompanying drawing in the embodiment of the present application, the technical scheme in the embodiment of the present application is clearly and completely described, obviously, described embodiment only is the application's part embodiment, rather than whole embodiment.Based on the embodiment among the application, those of ordinary skills are not making the every other embodiment that obtains under the creative work prerequisite, all belong to the scope of the application's protection.
With reference to figure 1, it shows the process flow diagram of a kind of web data acquisition methods embodiment one that the application provides, and described method can may further comprise the steps:
Step 101: decompose the webpage framework of first target web to generate at least one web page joint element according to the user search request that receives.
Wherein, each described web page joint element comprises node identification.Described node identification shows unique eigenwert of each web page joint element in the described webpage framework, can be station location marker.And described user search request shows the feature of the web data that the user need obtain, and for example, the data that the user need obtain are travel information or recreation data etc.
Need to prove that described webpage framework refers to its webpage at the layout structure of browser end, for example, in the webpage among Fig. 2, comprise title block, tabulation hurdle and text column, the layout structure that these three columns form can be understood as the webpage framework of this webpage.
And each webpage is carried out after framework decomposes, all can generate a plurality of node elements, first target web in the example of Fig. 2 is example, can obtain three web page joint elements after described webpage framework is decomposed: title node, tabulation node and text node.
In this application, the user who supposes to show in the user search request need obtain is data about travel information, and thus, the web page joint element of Sheng Chenging can comprise in the embodiment of the present application: title node and the text node relevant with tourism respectively.
Step 102: receive user's data and obtain order, described data are obtained order and are comprised that corresponding with each described node identification respectively data obtain subcommand.
Wherein, described data are obtained subcommand and are indicated that the user to the data acquisition lines of the node data content of the web page joint element under it is.For example, a certain data are obtained subcommand and are the operation behavior that the node data content is clicked or data the grasp instruction to the web page joint element of its corresponding node sign.
Step 103: with described data obtain subcommand to its separately the node data content of node identification correspondence carry out data and obtain behavior, to obtain the node data of each described web page joint element.
Wherein, described data are obtained order and are comprised that data grasp subcommand and/or clicking operation subcommand.
In the embodiment of the present application, described data are obtained and can be comprised in the order that data grasp subcommand, that is: described data acquisition lines grasps behavior for comprising data, at this moment, the specific implementation of described step 103 can for:
With described data grasp subcommand to its separately the node data content of node identification correspondence carry out data owner and grasp, to obtain the node data of each described web page joint element.
For example, in the target web shown in Fig. 2, described title node and text node are carried out data extracting subcommand, obtain described title node and described text node node data separately.
Need to prove, described web page joint element is except text node as shown in Figure 2, also comprise as nodes such as button or sublinks, described data are obtained subcommand and are specially the clicking operation subcommand at this moment, can be understood as: to the clicking operation subcommand of described button or sublink, this moment described step 103 specific implementation can for:
With described clicking operation subcommand described button or sublink are carried out clicking operation;
Obtain the web data of the webpage of the clicked back data that occur of described button or described sublink correspondence.
Wherein, can be understood as in the such scheme: after receiving user operation commands, it is relevant with the type of its corresponding web page joint element that each described data is obtained subcommand, type such as text node or button node according to the web page joint element, node data content to this element is carried out corresponding operation behavior, as data extracting or clicking operation etc., and then get access to the node data of each web page joint element correspondence, formed the web data of described first target web by the node data of each described web page joint element.
Step 104: the node data of each web page joint element that will get access to is formed the web data of described first target web.
Wherein, the specific implementation of described step 104 can for:
The web data of the node data of each web page joint element of getting access to being formed described first webpage with the form of tabulation.
By such scheme as can be known, a kind of web data acquisition methods embodiment one that the application provides, by the node elements according to user's request decomposition goal webpage, be met the node elements of user's request, obtain order according to user's data again each node elements execution data is obtained behavior, obtain the node data of each node elements, thereby the web data that obtains meeting consumers' demand in the target web, with respect to the scheme of retrieving and obtain web data in the prior art by artificial mode, only need user input instruction, need not manual information retrieval and data are obtained in download, saved great deal of labor and goods and materials cost.
With reference to figure 3, it shows the part process flow diagram of a kind of web data acquisition methods embodiment two that the application provides, and after described step 104, described method is further comprising the steps of:
Step S105: the data acquisition lines that records each described web page joint element in described first target web is, data are obtained order and data are obtained the time interval that the data between two adjacent described web page joint elements of front and back are obtained, and obtains flow process to generate data.
For example, in target web as shown in Figure 2, the data acquisition lines of record header node and text node is obtained order and the data time interval that adjacent title node and text data between nodes are obtained before and after obtaining for, data, be that coordinate generates the data of described first target web are obtained flow process with the time shaft, as: at first title node is carried out the operation behavior that data are obtained, after the very first time length, the text node is carried out the operation behavior that data are obtained at interval.
Step S106: determine at least one second target web, the webpage framework of described second target web is identical with the webpage framework of described first target web.
Wherein, identical the showing of webpage framework of the webpage framework of described second target web and described first target web: for same user search request, after the webpage framework of described second target web decomposed, the web page joint element that generates is corresponding one by one with the web page joint element of described first target web.
Need to prove, determine at least one second target web among the described step S106, be appreciated that into the process of determining batch tasks, the web data of namely determining to contain a plurality of second target webs obtains the set of tasks of task.
Step S107: obtain flow process with described data the node data content execution data of the web page joint element correspondence of each described second target web are obtained behavior, obtain the web data of each described second target web.
Wherein, described step S107 can be understood as: with above-mentioned steps 101 to the execution flow process of step 104 each described second target web is carried out the operational processes that data are obtained, obtain the web data of each described second target web.
In the application's practical application, after described step S105, the data that generate can also be obtained flow process stores as config option, in order to, directly call the config option of storage and realize the data of target web are obtained when obtaining at the follow-up web data that need carry out the target web of same web page framework.
By such scheme as can be known, a kind of web data acquisition methods embodiment two that the application provides, after the web data that gets access to first target web, the web data acquisition process that records the described first target web data generates data and obtains flow process, obtaining flow process with described data obtains the web data that a plurality of second target webs with same webpage framework carry out equal user's request, finish batch tasks, now for manual retrieval in the prior art and obtain in the scheme of web data, can get access to the web data of target web fast and effectively, especially when finishing batch tasks, great deal of labor and goods and materials cost have obviously been saved.
When the application's specific implementation, the webpage framework that first target web is decomposed in the user search request that described foundation receives can pass through the following steps specific implementation to generate at least one web page joint element, with reference to figure 4, the part process flow diagram of a kind of web data acquisition methods embodiment three that provides for the application, the user search request that described foundation receives is decomposed the webpage framework of first target web to generate at least one web page joint element, can realize by following each step:
Step S401: the source code that reads the webpage framework correspondence of described first target web.
Wherein, before the webpage framework to described first target web decomposes, need to convert this webpage framework to can read pattern, namely read the source code of this webpage framework.
Step S402: decompose described webpage framework according to described source code, to generate at least one web page joint element corresponding with the retrieval key value in the user search request that receives respectively.
Wherein, each described web page joint element comprises node identification.
For example, described retrieval key value comprises as tourism, plays or keyword such as round trip flight trip.
Need to prove, after reading described source code, by described source code is resolved, identify in the described source code code of web page joint element in the presentation web page framework, thereby generate each self-corresponding web page joint element successively, and each web page joint element is corresponding with the retrieval key value in the above-mentioned user search request respectively, can be understood as, utilize the source code of the webpage framework of described first target web, decompose described webpage framework and generate at least one web page joint element corresponding with described retrieval key value respectively, each described web page joint element comprises a distinctive node identification of this element.
In the practical application of prior art, after getting access to the data of a certain webpage, need manually these data editions in certain specific issuing web site, for example, after each particular moment gets access to the amusement top news, these top news need be edited in some particular webpage the labor intensive material resources.For saving the manpower and materials of editor's webpage, with reference to figure 5, the part process flow diagram of a kind of web data acquisition methods embodiment four that provides for the application, after described step 104, described method also comprises:
Step S108: determine webpage to be edited, described to wait to edit each web page joint element of webpage corresponding one by one with at least one web page joint element in described first target web.
Wherein, it is described that to wait to edit each web page joint element of webpage corresponding one by one with at least one web page joint element in described first target web, refer to: all web page joint elements that generate after webpage described to be edited is decomposed all are included in the web page joint element of described first target web, and described each the web page joint element of editing webpage of waiting refers to the node elements corresponding with user's request in described first target web, can be understood as, wait to edit in the webpage described, decomposite the corresponding node elements of the retrieval key value with in the described user search request that needs editor.For example, the web page joint element that first target web as shown in Figure 2 decomposites comprises title node and text node, the web page joint element of editing webpage of waiting shown in Figure 6 is waited to edit title node and waited to edit the text node, and is corresponding with described title node and text node respectively.
Step S109: according to the web data of described first target web, edit described node data content of waiting to edit each the web page joint element in the webpage.
Wherein, described step S109 can realize in the following manner:
Determine in described first target web that respectively the web page joint element corresponding with described each web page joint element of waiting to edit webpage is as the destination node element;
The node data Edition Contains of described destination node element is waited to edit in the web page joint element corresponding in the webpage described to it.
With reference to figure 7, it shows the structural representation of a kind of web data deriving means embodiment five that the application provides, and described device comprises:
Webpage resolving cell 701, the webpage framework that is used for decomposing first target web according to the user search request that receives is to generate at least one web page joint element, and each described web page joint element comprises node identification.
Wherein, described user search request shows the feature of the web data that the user need obtain, and for example, the data that the user need obtain are travel information or recreation data etc.Described node identification shows unique eigenwert of each web page joint element in the described webpage framework, can be station location marker.
Need to prove that described webpage framework refers to its webpage at the layout structure of browser end, for example, in the webpage among Fig. 2, comprise title block, tabulation hurdle and text column, the layout structure that these three columns form can be understood as the webpage framework of this webpage.
And each webpage is carried out after framework decomposes, all can generate a plurality of node elements, first target web in the example of Fig. 2 is example, can obtain three web page joint elements after described webpage framework is decomposed: title node, tabulation node and text node.
In this application, the user who supposes to show in the user search request need grasp is data about travel information, and thus, the web page joint element of Sheng Chenging can comprise in the embodiment of the present application: title node and the text node relevant with tourism respectively.
Order receiving element 702 be used for to receive user's data and obtains order, and described data are obtained order and comprised that corresponding with each described node identification respectively data obtain subcommand.
Wherein, described data are obtained subcommand and are indicated that the user is to the behavior of obtaining of the node data content of the web page joint element under it.For example, a certain data are obtained subcommand and are the operation behavior that the node data content is clicked or data the grasp instruction to the web page joint element of its corresponding node sign.
Need to prove that described order receiving element 702 is connected with described webpage resolving cell 701.
Behavior performance element 703, be used for described data obtain subcommand to its separately the node data content of node identification correspondence carry out data and obtain behavior, to obtain the node data of each described web page joint element.
Need to prove that described behavior performance element 703 is connected with described order receiving element 702.
Wherein, described data are obtained order and are comprised that data grasp subcommand and/or clicking operation subcommand.
With reference to figure 8, be the part-structure synoptic diagram of the embodiment of the present application five, described behavior performance element 703 comprises:
First carries out subelement 731, be used for described data grasp subcommand to its separately the node data content of node identification correspondence carry out data owner and grasp, to obtain the node data of each described web page joint element.
Second carries out subelement 732, is used for described clicking operation subcommand described button or sublink being carried out clicking operation, with the web data of the webpage that obtains data that the clicked back of described button occurs or described sublink correspondence.
In the embodiment of the present application, described data are obtained and can be comprised in the order that data grasp subcommand, that is: described data acquisition lines grasps behavior for comprising data.For example, in the target web shown in Fig. 2, described title node and text node are carried out data extracting subcommand, obtain described title node and described text node node data separately.
Need to prove, described web page joint element is except text node as shown in Figure 2, comprise also that as nodes such as button or sublinks described data are obtained subcommand and are specially the clicking operation subcommand at this moment, can be understood as: to the clicking operation subcommand of described button or sublink.
Wherein, can be understood as in the such scheme: after receiving user operation commands, it is relevant with the type of its corresponding web page joint element that each described data is obtained subcommand, type such as text node or button node according to the web page joint element, node data content to this element is carried out corresponding operation behavior, as data extracting or clicking operation etc., and then get access to the node data of each web page joint element correspondence, formed the web data of described first target web by the node data of each described web page joint element.
Data assembled unit 704, the node data that is used for each the web page joint element that will get access to is formed the web data of described first target web.
Wherein, the specific implementation of described data assembled unit 704 can for:
The web data of the node data of each web page joint element of getting access to being formed described first webpage with the form of tabulation.
Need to prove that described data assembled unit 704 is connected with described behavior performance element 703.
By such scheme as can be known, a kind of web data deriving means embodiment five that the application provides, by the node elements according to user's request decomposition goal webpage, be met the node elements of user's request, obtain order according to user's data again each node elements execution data is obtained behavior, obtain the node data of each node elements, thereby the web data that obtains meeting consumers' demand in the target web, with respect to the scheme of retrieving and obtain web data in the prior art by artificial mode, only need user input instruction, need not manual information retrieval and data are obtained in download, saved great deal of labor and goods and materials cost.
With reference to figure 9, it shows the part-structure synoptic diagram of a kind of web data deriving means embodiment six that the application provides, and described device also comprises:
Process recording unit 705, be used for record describedly and the data acquisition lines of each described web page joint element of target web obtain order and the data time interval that the data between adjacent two described web page joint elements are obtained before and after obtaining for, data, obtain flow process to generate data.
For example, in target web as shown in Figure 2, the data acquisition lines of record header node and text node is obtained order and the data time interval that adjacent title node and text data between nodes are obtained before and after obtaining for, data, be that coordinate generates the data of described first target web are obtained flow process with the time shaft, as: at first title node is carried out the operation behavior that data are obtained, after the very first time length, the text node is carried out the operation behavior that data are obtained at interval.
Webpage determining unit 706 is used for determining at least one second target web in batches, and the webpage framework of described second target web is identical with the webpage framework of described first target web.
Wherein, identical the showing of webpage framework of the webpage framework of described second target web and described first target web: for same user search request, after the webpage framework of described second target web decomposed, the web page joint element that generates is corresponding one by one with the web page joint element of described first target web.
Need to prove that described webpage determining unit 705 is determined at least one second target web, is appreciated that into the process of determining batch tasks, the web data of namely determining to contain a plurality of second target webs obtains the set of tasks of task.
Batch data acquiring unit 707 is used for obtaining flow process with described data the node data content execution data of each each web page joint element correspondence of described second target web is obtained behavior, obtains the web data of each described second target web.
Need to prove that described batch data acquiring unit 707 is connected with described batch webpage determining unit 706 with described process recording unit 705 respectively.
Wherein, the execution function of described batch data acquiring unit 707 can be understood as: with the scheme described in above-mentioned webpage resolving cell 701, order receiving element 702, behavior performance element 703 and the data assembled unit 704 each described second target web is carried out the operational processes that data are obtained, obtain the web data of each described second target web.
In the application's practical application, after 705 generations described data in described process recording unit are obtained flow process, described data flow can be stored as config option, in order to, directly call the config option of storage and realize the data of target web are obtained when obtaining at the follow-up web data that need carry out the target web of same web page framework.
By such scheme as can be known, a kind of web data deriving means embodiment six that the application provides, after the web data that gets access to first target web, the web data acquisition process that records the described first target web data generates data and obtains flow process, obtaining flow process with described data obtains the web data that a plurality of second target webs with same webpage framework carry out equal user's request, finish batch tasks, now for manual retrieval in the prior art and obtain in the scheme of web data, can get access to the web data of target web fast and effectively, especially when finishing batch tasks, great deal of labor and goods and materials cost have obviously been saved.
With reference to Figure 10, the part-structure synoptic diagram of a kind of web data deriving means embodiment seven that provides for the application, when the application's specific implementation, described webpage resolving cell 701 comprises:
Code reads subelement 711, for the source code of the webpage framework correspondence that reads described first target web.
Wherein, before the webpage framework to described first target web decomposes, need to convert this webpage framework to can read pattern, namely read the source code of this webpage framework.
Webpage decomposes subelement 712, be used for decomposing described webpage framework according to described source code, to generate at least one web page joint element corresponding with the retrieval key value in the user search request that receives respectively, each described web page joint element comprises node identification.
For example, described retrieval key value comprises as tourism, plays or keyword such as round trip flight trip.
Wherein, described element generates subelement 712 and reads subelement 711 with described code and be connected.
Need to prove, after reading described source code, by described source code is resolved, identify in the described source code code of web page joint element in the presentation web page framework, thereby generate each self-corresponding web page joint element successively, and each web page joint element is corresponding with the retrieval key value in the above-mentioned user search request respectively, can be understood as, utilize the source code of the webpage framework of described first target web, decompose described webpage framework and generate at least one web page joint element corresponding with described retrieval key value respectively, each described web page joint element comprises a distinctive node identification of this element.
In the application of the time of prior art, after getting access to the data of a certain webpage, need manually these data editions in certain specific issuing web site, for example, after each particular moment gets access to the amusement top news, these top news need be edited in some particular webpage the labor intensive material resources.For saving the manpower and materials of editor's webpage, with reference to Figure 11, the structural representation of a kind of web data deriving means embodiment eight that provides for the application, described device also comprises:
Webpage determining unit 708 is used for determining webpage to be edited, and described to wait to edit each web page joint element of webpage corresponding one by one with at least one web page joint element in described first target web.
Wherein, it is described that to wait to edit each web page joint element of webpage corresponding one by one with at least one web page joint element in described first target web, refer to: all web page joint elements that generate after webpage described to be edited is decomposed all are included in the web page joint element of described first target web, and described each the web page joint element of editing webpage of waiting refers to the node elements corresponding with user's request in described first target web, can be understood as, wait to edit in the webpage described, decomposite the corresponding node elements of the retrieval key value with in the described user search request that needs editor.For example, the web page joint element that first target web as shown in Figure 2 decomposites comprises title node and text node, the web page joint element of editing webpage of waiting shown in Figure 6 is waited to edit title node and waited to edit the text node, and is corresponding with described title node and text node respectively.
Web page editing unit 709 is used for the web data according to described first target web, edits described node data content of waiting to edit each the web page joint element in the webpage.
Need to prove that described web page editing unit 709 is connected with described data assembled unit 704 with described webpage determining unit 708 respectively.
Need to prove that above-mentioned web page editing unit 709 is according to the web data of described first target web, when editing described node data content of waiting to edit each the web page joint element in the webpage, can realize in the following manner:
Determine in described first target web that respectively the web page joint element corresponding with described each web page joint element of waiting to edit webpage is as the destination node element;
The node data Edition Contains of described destination node element is waited to edit in the web page joint element corresponding in the webpage described to it.
In addition, with reference to Figure 12, part-structure synoptic diagram for the embodiment of the present application, unit, described web page editing unit 709 also can be connected with described batch data acquiring unit 707, according to the web data of each described second target web, edit the described web page joint element corresponding with the web page joint element in the described second target web content in the webpage of waiting to edit.
Need to prove that each embodiment in this instructions all adopts the mode of going forward one by one to describe, what each embodiment stressed is and the difference of other embodiment that identical similar part is mutually referring to getting final product between each embodiment.
At last, also need to prove, in this article, relational terms such as first and second grades only is used for an entity or operation are made a distinction with another entity or operation, and not necessarily requires or hint and have the relation of any this reality or in proper order between these entities or the operation.And, term " comprises ", " comprising " or its any other variant are intended to contain comprising of nonexcludability, thereby make and comprise that process, method, article or the equipment of a series of key elements not only comprise those key elements, but also comprise other key elements of clearly not listing, or also be included as the intrinsic key element of this process, method, article or equipment.Do not having under the situation of more restrictions, the key element that is limited by statement " comprising ... ", and be not precluded within process, method, article or the equipment that comprises described key element and also have other identical element.
More than a kind of web data acquisition methods provided by the present invention and device are described in detail, used specific case herein principle of the present invention and embodiment are set forth, the explanation of above embodiment just is used for helping to understand method of the present invention and core concept thereof; Simultaneously, for one of ordinary skill in the art, according to thought of the present invention, the part that all can change in specific embodiments and applications, in sum, this description should not be construed as the restriction to the application.

Claims (10)

1. a web data acquisition methods is characterized in that, comprising:
The webpage framework that first target web is decomposed in the user search request that foundation receives is to generate at least one web page joint element, and each described web page joint element comprises node identification;
Receive user's data and obtain order, described data are obtained order and are comprised that corresponding with each described node identification respectively data obtain subcommand;
With described data obtain subcommand to its separately the node data content of node identification correspondence carry out data and obtain behavior, to obtain the node data of each described web page joint element;
The web data of the node data of each web page joint element of getting access to being formed described first target web.
2. method according to claim 1 is characterized in that, described data are obtained order and comprised that data grasp subcommand and/or clicking operation subcommand.
3. method according to claim 1 and 2 is characterized in that, after the web data of described first target web of described composition, described method also comprises:
The data acquisition lines that records each described web page joint element in described first target web is, data are obtained order and data are obtained the time interval that the data between two adjacent described web page joint elements of front and back are obtained, and obtains flow process to generate data;
Determine at least one second target web, the webpage framework of described second target web is identical with the webpage framework of described first target web;
Obtain flow process with described data the node data content execution data of the web page joint element correspondence in each described second target web are obtained behavior, obtain the web data of each described second target web.
4. method according to claim 1 and 2 is characterized in that, the user search request that described foundation receives is decomposed the webpage framework of first target web to generate at least one web page joint element, comprising:
Read the source code of the webpage framework correspondence of first target web;
Decompose described webpage framework according to described source code, to generate at least one web page joint element corresponding with the retrieval key value in the user search request that receives respectively, each described web page joint element comprises node identification.
5. method according to claim 1 and 2 is characterized in that, after the node data of described each web page joint element that will get access to was formed the web data of described first target web, described method also comprised:
Determine webpage to be edited, described to wait to edit each web page joint element of webpage corresponding one by one with at least one web page joint element in described first target web;
According to the web data of described first target web, edit described node data content of waiting to edit each the web page joint element in the webpage.
6. a web data deriving means is characterized in that, comprising:
The webpage resolving cell, the webpage framework that is used for decomposing first target web according to the user search request that receives is to generate at least one web page joint element, and each described web page joint element comprises node identification;
The order receiving element be used for to receive user's data and obtains order, and described data are obtained order and comprised that corresponding with each described node identification respectively data obtain subcommand;
The behavior performance element, be used for described data obtain subcommand to its separately the node data content of node identification correspondence carry out data and obtain behavior, to obtain the node data of each described web page joint element;
The data assembled unit, the node data that is used for each the web page joint element that will get access to is formed the web data of described first target web.
7. device according to claim 6 is characterized in that, described data are obtained order and comprised that data grasp subcommand and/or clicking operation subcommand.
8. according to claim 6 or 7 described devices, it is characterized in that described device also comprises:
The process recording unit, be used for record describedly and the data acquisition lines of each described web page joint element of target web obtain order and the data time interval that the data between adjacent two described web page joint elements are obtained before and after obtaining for, data, with the generating run flow process;
The webpage determining unit is used for determining at least one second target web in batches, and the webpage framework of described second target web is identical with the webpage framework of described first target web;
The batch data acquiring unit is used for obtaining flow process with described data the node data content execution data of each each web page joint element correspondence of described second target web is obtained behavior, obtains the web data of each described second target web.
9. according to claim 6 or 7 described devices, it is characterized in that described webpage resolving cell comprises:
Code reads subelement, for the source code of the webpage framework correspondence that reads first target web;
Webpage decomposes subelement, is used for decomposing described webpage framework according to described source code, and to generate at least one web page joint element corresponding with the retrieval key value in the user search request that receives respectively, each described web page joint element comprises node identification.
10. according to claim 6 or 7 described devices, it is characterized in that described device also comprises:
The webpage determining unit is used for determining webpage to be edited, and described to wait to edit each web page joint element of webpage corresponding one by one with at least one web page joint element in described first target web;
The web page editing unit is used for the web data according to described first target web, the node data content of editing each the web page joint element in described the 3rd target web.
CN2013102173918A 2013-05-31 2013-05-31 Method and device for acquiring web page data Pending CN103279537A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2013102173918A CN103279537A (en) 2013-05-31 2013-05-31 Method and device for acquiring web page data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2013102173918A CN103279537A (en) 2013-05-31 2013-05-31 Method and device for acquiring web page data

Publications (1)

Publication Number Publication Date
CN103279537A true CN103279537A (en) 2013-09-04

Family

ID=49062056

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2013102173918A Pending CN103279537A (en) 2013-05-31 2013-05-31 Method and device for acquiring web page data

Country Status (1)

Country Link
CN (1) CN103279537A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109313662A (en) * 2016-06-20 2019-02-05 微软技术许可有限责任公司 To the destructing and presentation to webpage in the machine application experience

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101206664A (en) * 2007-12-17 2008-06-25 张尧森 Method for interception and incorporation of web page information unit
CN101944109A (en) * 2010-09-06 2011-01-12 华南理工大学 System and method for extracting picture abstract based on page partitioning
US20110035345A1 (en) * 2009-08-10 2011-02-10 Yahoo! Inc. Automatic classification of segmented portions of web pages
WO2011130868A1 (en) * 2010-04-19 2011-10-27 Hewlett-Packard Development Company, L. P. Segmenting a web page into coherent functional blocks
WO2013063734A1 (en) * 2011-10-31 2013-05-10 Hewlett-Packard Development Company, L.P. Determining document structure similarity using discrete wavelet transformation

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101206664A (en) * 2007-12-17 2008-06-25 张尧森 Method for interception and incorporation of web page information unit
US20110035345A1 (en) * 2009-08-10 2011-02-10 Yahoo! Inc. Automatic classification of segmented portions of web pages
WO2011130868A1 (en) * 2010-04-19 2011-10-27 Hewlett-Packard Development Company, L. P. Segmenting a web page into coherent functional blocks
CN101944109A (en) * 2010-09-06 2011-01-12 华南理工大学 System and method for extracting picture abstract based on page partitioning
WO2013063734A1 (en) * 2011-10-31 2013-05-10 Hewlett-Packard Development Company, L.P. Determining document structure similarity using discrete wavelet transformation

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
王璟琦: ""基于内容单元的网页解析与内容提取"", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *
苗苗: ""基于页面分块的网页内容提取的研究与实现"", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109313662A (en) * 2016-06-20 2019-02-05 微软技术许可有限责任公司 To the destructing and presentation to webpage in the machine application experience
CN109313662B (en) * 2016-06-20 2022-02-01 微软技术许可有限责任公司 Deconstruction and presentation of web pages into a native application experience

Similar Documents

Publication Publication Date Title
CN108763171B (en) Automatic document generation method based on format template
CN101025738B (en) Template-free dynamic website generating method
CN103353899B (en) The accurate searching method of a kind of integrated information
CN109933311A (en) A kind of information system creation method and relevant apparatus
CN101944094A (en) Webpage information extraction method and device thereof
CN102810094A (en) Report generation method and device
CN102651002A (en) Webpage information extracting method and system
CN101477571A (en) Method and apparatus for marking network contents semantic structure
CN105045837A (en) Information searching method and information searching device
CN107391509A (en) Label recommendation method and device
CN106201459A (en) A kind of rapid build special topic lands the method and system of page
CN107016102A (en) A kind of big data web crawlers paging collocation method
CN102646095A (en) Object classifying method and system based on webpage classification information
CN107330009A (en) Descriptor disaggregated model creation method, creating device and storage medium
CN103853770B (en) The method and system of model content in a kind of extraction forum Web pages
JP2008146424A (en) Xml document conformity calculation method, its program, and information processor
CN106095961A (en) Table display processing method and device
CN101763424B (en) Method for determining characteristic words and searching according to file content
CN101639840A (en) Method and device for identifying semantic structure of network information
CN101339568B (en) Method and device for constructing data tree
CN103279537A (en) Method and device for acquiring web page data
CN103294714B (en) The defining method of the memory location of the field attribute value of index field and device
CN109885767B (en) Method and system for recommending software assets based on GitHub
CN106372042A (en) Document content acquisition method and device
CN109145307A (en) User's face sketch recognition method, method for pushing, device, equipment and storage medium

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20130904

RJ01 Rejection of invention patent application after publication