CN104281575A - Webpage data obtaining method and template engine - Google Patents

Webpage data obtaining method and template engine Download PDF

Info

Publication number
CN104281575A
CN104281575A CN201310273053.6A CN201310273053A CN104281575A CN 104281575 A CN104281575 A CN 104281575A CN 201310273053 A CN201310273053 A CN 201310273053A CN 104281575 A CN104281575 A CN 104281575A
Authority
CN
China
Prior art keywords
data
web
web page
target web
page
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201310273053.6A
Other languages
Chinese (zh)
Inventor
张宝玉
马向晖
郭铁志
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
SHANGHAI MUSE INFORMATION SCIENCE & TECHNOLOGY Co Ltd
Original Assignee
SHANGHAI MUSE INFORMATION SCIENCE & TECHNOLOGY Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by SHANGHAI MUSE INFORMATION SCIENCE & TECHNOLOGY Co Ltd filed Critical SHANGHAI MUSE INFORMATION SCIENCE & TECHNOLOGY Co Ltd
Priority to CN201310273053.6A priority Critical patent/CN104281575A/en
Publication of CN104281575A publication Critical patent/CN104281575A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]

Abstract

The invention discloses a webpage data obtaining method and a template engine. Position identifiers of all webpage elements in a webpage template of a target webpage are recorded when the target webpage is generated. The webpage data obtaining method comprises the steps of responding to data of the target webpage to obtain a request, and positioning each data element in the target webpage according to the position identifier of each webpage element; extracting the data content of each data element; obtaining current webpage data of the target webpage according to all data contents. By means of the webpage data obtaining method and the template engine, a developer does not need to perform DOM programming when obtaining of the webpage data of the target webpage is achieved, the position identifiers of all webpage elements recorded in advance can be referenced to extract the current webpage data in the target webpage according to the webpage data obtaining request of a user after an output task of the target webpage is completed, additional operation procedures are not needed, complicated programming flows are omitted, and the webpage data obtaining efficiency is improved.

Description

A kind of web data acquisition methods and masterplate engine
Technical field
The application relates to Computer Applied Technology field, particularly a kind of web data acquisition methods and masterplate engine.
Background technology
Along with the development of internet, Page template engine becomes the main tool performing auto-building html files task in front end gradually.The Page template engine of auto-building html files task is performed in front end, its principle of work is: first determine web data, described web data is inserted in the Page template pre-set afterwards, then run output finished product webpage by described Page template, terminate auto-building html files task.
In actual applications, usually there is the demand returning the web data extracted in webpage after auto-building html files.Such as, for the list webpage exported by Page template engine, user need list auto-building html files before text data carry out the operations such as checking.
And in the webpage of existing Page template engine generation, the mode usually adopting traditional DOM to programme obtains the web data in target web, the acquisition flow process of web data can be increased thus, and coding is comparatively loaded down with trivial details, the acquisition efficiency of reduction web data.
Summary of the invention
Technical problems to be solved in this application are to provide a kind of web data acquisition methods and masterplate engine, in order to solve the acquisition flow process being realized the scheme increase web data that web data obtains in prior art by DOM programming mode, reduce the technical matters of the acquisition efficiency of web data.
A kind of web data acquisition methods that the application provides, record the station location marker of each web page element in the Page template of described target web when target web generates, described method comprises:
In response to the data acquisition request of described target web, according to the station location marker of each described web page element, locate each data element in described target web;
Extract the data content in each described data element;
According to each described data content, obtain the current web page data of described target web.
Said method, preferably, the described station location marker according to each described web page element, locate each data element in described target web, comprising:
The described station location marker according to each described web page element, utilizes each data element in target web described in CSS (cascading style sheet) selection CSS Selector tool positioned.
Said method, preferably, the described station location marker recording each web page element in the Page template of described target web when target web generates, comprising:
Obtain the off-set value of each web page element in its Page template when target web generates;
The station location marker of each web page element in the Page template before generating according to described off-set value and described target web, the station location marker of each described web page element when determining that described target web generates.
Said method, preferably, described data element comprises at least one row element, and each described row element comprises at least one column element;
Accordingly, the described station location marker according to each described web page element, locate each data element in described target web, comprising:
According to the station location marker of each described web page element, locate each web page element in described target web;
Locate the row element in each described web page element respectively;
Locate the column element in each described row element respectively.
Said method, preferably, described according to each described data content, the current web page data obtaining described target web comprise:
According to the data filtering rule in described data acquisition request, respectively data filtering is carried out to each described data content;
Carry out Data Format Transform to each data content through data filtering, the data content obtained forms the current web page data of described target web;
Wherein, the data layout of the current web page data of described target web is identical with the data layout of raw page data, and described raw page data is the primary data being placed into described Page template and then generating described target web.
This application provides a kind of template engine, comprising:
Identification record unit, for record described target web when target web generates web page template in the station location marker of each web page element;
Element positioning unit, for the data acquisition request in response to described target web, according to the station location marker of each described web page element, locates each data element in described target web;
Contents extracting unit, for extracting the data content in each described data element;
Data capture unit, for according to each described data content, obtains the current web page data of described target web.
Above-mentioned template engine, preferred:
Described element positioning unit, specifically for the described station location marker according to each described web page element, utilizes each data element in target web described in CSS (cascading style sheet) selection CSS Selector tool positioned.
Above-mentioned template engine, preferably, described identification record unit comprises:
Off-set value obtains subelement, for obtaining the off-set value of each web page element in its Page template when target web generates;
Mark determines subelement, for the station location marker according to each web page element in the web page template before described off-set value and the generation of described target web, determines the station location marker of each described web page element during described auto-building html files.
Above-mentioned template engine, preferably, data element comprises at least one row element, and each described row element comprises at least one column element;
Accordingly, described element positioning unit comprises:
Web page element locator unit, for the station location marker according to each described web page element, locates each web page element in described target web;
Row element locator unit, for locating the row element in each described web page element respectively;
Column element locator unit, for locating the column element in each described row element respectively.
Above-mentioned template engine, preferably, described data capture unit comprises:
Data filtering subelement, for according to the data filtering rule in described data acquisition request, carries out data filtering to each described data content respectively;
Format conversion subelement, for carrying out Data Format Transform to each data content through data filtering, the data content obtained forms the current web page data of described target web;
Wherein, the data layout of the current web page data of described target web is identical with the data layout of raw page data, and described raw page data is the primary data being placed into described Page template and then generating described target web.
From such scheme, a kind of data capture method that the application provides and masterplate engine, by the station location marker of each web page element in the Page template of described target web pre-recorded when target web generates, after described target web generates, when needs obtain described target web, first according to the station location marker of each described web page element, locate each data element in described target web, extract the data content in each described data element, afterwards again according to each described data content, get the current web page data of described target web.The application is when the web data of realize target webpage obtains, DOM programming is carried out without the need to developer, after described masterplate engine completes the output task of target web, according to the acquisition demand of user to web data, the current web page data that just can extract in described target web with reference to the station location marker of pre-recorded each web page element, without the need to increasing extra operating process, avoiding loaded down with trivial details programming flow process, improving the acquisition efficiency of web data.
Accompanying drawing explanation
In order to be illustrated more clearly in the technical scheme in the embodiment of the present application, below the accompanying drawing used required in describing embodiment is briefly described, apparently, accompanying drawing in the following describes is only some embodiments of the application, for those of ordinary skill in the art, under the prerequisite not paying creative work, other accompanying drawing can also be obtained according to these accompanying drawings.
The process flow diagram of a kind of web data acquisition methods embodiment one that Fig. 1 provides for the application;
Fig. 2 is the exemplary plot of the embodiment of the present application one;
Fig. 3 is another exemplary plot of the embodiment of the present application one;
The partial process view of a kind of web data acquisition methods embodiment two that Fig. 4 provides for the application;
Fig. 5 it illustrates the partial process view in a kind of web data acquisition methods embodiment three that the application provides;
Fig. 6 is the partial process view in the embodiment of the present application three;
Fig. 7 is the exemplary plot in the embodiment of the present application three;
Partial process view in a kind of web data acquisition methods embodiment four that Fig. 8 provides for the application;
The structural representation of a kind of masterplate engine embodiment five that Fig. 9 provides for the application;
Part-structure schematic diagram in a kind of masterplate engine embodiment six that Figure 10 provides for the application;
Part-structure schematic diagram in a kind of masterplate engine embodiment seven that Figure 11 provides for the application;
Part-structure schematic diagram in a kind of masterplate engine embodiment eight that Figure 12 provides for the application.
Embodiment
Below in conjunction with the accompanying drawing in the embodiment of the present application, be clearly and completely described the technical scheme in the embodiment of the present application, obviously, described embodiment is only some embodiments of the present application, instead of whole embodiments.Based on the embodiment in the application, those of ordinary skill in the art are not making the every other embodiment obtained under creative work prerequisite, all belong to the scope of the application's protection.
With reference to figure 1, for the process flow diagram of a kind of web data acquisition methods embodiment one that the application provides, described method can be applied to masterplate engine, and described masterplate engine can utilize its Page template to generate target web according to the web data prepared, described method can comprise the following steps:
Step 101: the station location marker recording each web page element in the Page template of described target web when target web generates.
It should be noted that, the Page template in described masterplate engine comprises at least one web page element.Each described web page element carries out control realization in the mode of CSS (cascading style sheet) Cascading Stylesheet to other effects such as the layout in target web to be generated, font, color, backgrounds usually, and then by the described target web of the engine-operated output of described masterplate.
And after described masterplate engine generates described target web, in described masterplate engine can there is change more or less in the position of each web page element, its reason is, after described target web generates, described Page template is owing to being placed into data in each web page element of its inside, thus, all can there is off-set value more or less in each web page element.Therefore, described target web before being exported by described masterplate engine and afterwards, Page template in described masterplate engine is different, and this step record is the station location marker of each web page element in the Page template of target web after described target web generates.
Wherein, above-mentioned steps 101 is be performed before the acquisition of the current web page data of the embodiment of the present application realize target webpage, namely when needing the current web page data acquisition carrying out target web, in the Page template of described target web, the station location marker of each web page element is recorded, and can be used when carrying out the acquisition of current web page data.
Step 102: in response to the data acquisition request of described target web, according to the station location marker of each described web page element, locates the data element in described target web.
Wherein, described data acquisition request shows that the current web page data of operator's needs such as user to the target web generated obtain, and can be triggered by user, thus impels template engine to perform step 102 and subsequent step thereof.
It should be noted that, when described target web generates, each web page element in described Page template is placed into data, and after described target web generates, the data cell in each web page element in described Page template is each self-corresponding data element of each web page element.Each described data element comprises respective data content.
Step 103: extract the data content in each described data element.
Wherein, in each described data element, its data content can exist with the form of ranks, or exists to arrange capable form.Such as, data content in described data element exists with the form of ranks, namely each data element comprises at least one row element, each described row element comprises at least one column element, described row element can be understood as data iteration item, the partial data content of data element belonging to it is stored in each described column element, and the data content in all column elements in all row elements forms the data content of described data element, as shown in Figure 2, such as, in a data element, comprise multiple data group be made up of the data item of same type, now, described data group is as row element, described data item is as column element.And for example, data content in described data element exists to arrange capable form, namely each data element comprises at least one column element, each described column element comprises at least one row element, the data content of data element belonging to it is stored in each described row element, and the data content in all row elements in all column elements forms the data content of described data element, as shown in Figure 3.
It should be noted that, in described step 103, the data content of each described data element extracted can be JSON(JavaScript Object Notation) data of form.
Step 104: according to each described data content, obtains the current web page data of described target web.
Wherein, directly each described data content can be combined in described step 104, obtain the current web page data of described target web.Such as, each described data content is integrated according to the size of storage space or the order of its each self-corresponding station location marker, obtains the current web page data of described target web.
It should be noted that, the process described masterplate engine being generated after the raw page data of preparation is inserted Page template target web is called forward engineering, and described forward engineering performs inserts the user input data (raw page data) of text or extended formatting the task that Page template exports stencil web.Accordingly, in this application, the process of the current web page data obtained in described target web by described masterplate engine is called reverse-engineering.
From such scheme, a kind of web data acquisition methods embodiment one that the application provides, by the station location marker of each web page element in the Page template of described target web pre-recorded when target web generates, after described target web generates, when needs obtain described target web, first according to the station location marker of each described web page element, locate each data element in described target web, extract the data content in each described data element, afterwards again according to each described data content, get the current web page data of described target web.The application is when the web data of realize target webpage obtains, DOM programming is carried out without the need to developer, after described masterplate engine completes the output task of target web, according to the acquisition demand of user to web data, the current web page data that just can extract in described target web with reference to the station location marker of pre-recorded each web page element, without the need to increasing extra operating process, avoiding loaded down with trivial details programming flow process, improving the acquisition efficiency of web data.
Wherein, in the actual realization of the embodiment of the present application, in described step 101 when positioning each data element, each data element in target web described in CSS (cascading style sheet) selection CSS Selector tool positioned can be utilized.
It should be noted that, described CSS Selector instrument is applicable in the application of the browsing device net page data acquisition supporting CSS, when browser at target web place does not support CSS as ExtJS, JQuery, DOJO etc., the location of each data element in each self-corresponding framework selection tool realize target webpage of ExtJS, JQuery, DOJO can be used.
With reference to figure 4, be the process flow diagram of step 101 described in a kind of web data acquisition methods embodiment two that the application provides, described step 101 can be realized by following steps:
Step 401: the off-set value obtaining each web page element in its Page template when target web generates.
Wherein, in above, after described target web generates, described Page template is owing to being placed into data in each web page element of its inside, make the off-set value that each web page element all can occur more or less, therefore, in described step 401, carry out calculating to the off-set value after each described web page element is placed into data to obtain.
Step 402: the station location marker of each web page element in the Page template before generating according to described off-set value and described target web, the station location marker of each described web page element when determining that described target web generates.
Wherein, the station location marker of each web page element in the Page template before described target web generates, can utilize regular expression before described Page template Output rusults, calculate the starting point and terminating point that are placed into data in each web page element, as its respective station location marker.And in described step 402, before generating according to described target web, starting point and the terminating point of data is placed in each web page element in described Page template, and after described target web generation, the off-set value of each web page element in described Page template, calculates the station location marker of each web page element when described target web generates.
With reference to figure 5, it illustrates the process flow diagram of step 102 described in a kind of web data acquisition methods embodiment three that the application provides, wherein, in the embodiment of the present application, data content in described data element exists with the form of ranks, and namely described data element comprises at least one row element, and each described row element comprises at least one column element, accordingly, described step 102 can be realized by following steps:
Step 501: in response to the data acquisition request of described target web, according to the station location marker of each described web page element, locates each web page element in described target web.
In above, CSS supported by the browser of the target web carried in the application, therefore, when performing described step 102, CSS Selector instrument can be utilized to perform the task of each data element in the described target web in location.
And show in described step 501, when realizing the location of each data element in described target web, first web page element each in the Page template in described target web is positioned.
Step 502: locate the row element in each described web page element respectively.
It should be noted that, data element in each described web page element exists with the form of ranks, and namely in form as shown in Figure 2, described data element comprises at least one row element, after each web page element has been located, locate each row element in described web page element.
Step 503: locate the column element in each described row element respectively.
Wherein, in above, data content in all column elements in all row elements forms the data content of the web page element corresponding to it, therefore, after described step 502 completes the location to the row element in each described web page element, in described step 503, each described row element is positioned, realize the location to data element each in described target web.
Wherein, based on the implementation of above-mentioned steps 102, with reference to figure 6, be the process flow diagram of step 103 in the embodiment of the present application three, described step 103 can be realized by following steps:
Step 601: quote each described data element.
It should be noted that, in the embodiment of the present application, described step 103 can utilize HTML DOM in the target web of having located, extract data content in described target web.First, HTML DOM is utilized to quote in described target web by the data element of locating.
Step 602: iteration quotes each row element in described data element.
Wherein, in above, the data group that row element in described data element can form as the data item by multiple same form, can be understood as data iteration item, therefore, when described data element exists with ranks form, when quoting row element, each row element in data element described in iterated application.
Step 603: extract the data content in each column element in each described row element, form the data content in data element belonging to it.
Wherein, data content in described column element can exist with the form of text node, therefore, after in described step 602, iteration quotes described row element, described step 603 can data content in this row element of extracting directly in each column element Chinese version node, the data content in the described data element belonging to it of composition.As shown in Figure 7, be the logic relation picture in the embodiment of the present application between data element, row element and column element.
It should be noted that, the data content in each described data element extracted in described step 103 can exist with the data of JSON form.
With reference to 8, be the process flow diagram of step 104 described in a kind of web data acquisition methods embodiment four that the application provides, described step 104 can be realized by following steps:
Step 801: according to the data filtering rule in described data acquisition request, respectively data filtering is carried out to each described data content.
Wherein, described data filtering rule can be that user submits to template engine in the lump when trigger data obtains request, user pre-sets data filtering rule, this data filtering rule can show that the Data Identification that user needs to retain or needs filter the Data Identification given up, such as, user needs filtering advertising data acquisition news data etc. to give up, and only retains the input data of user.
Step 802: carry out Data Format Transform to each data content through data filtering, the data content obtained forms the current web page data of described target web.
Wherein, the data layout of the current web page data of described target web is identical with the data layout of raw page data, and described raw page data is the primary data being placed into described Page template and then generating described target web.
It should be noted that, the data content extracted in described step 103 exists with forms such as Boolean type, integer type, pointer type usually, and before these data contents are fed back to user, the data content of these different-formats is needed to carry out format conversion, obtain the data with raw page data same kind, as text character string format etc., and then each data content through format conversion is combined, obtain the current web page data of described target web.
And described raw page data is before described target web generates, the webpage primary data that user prepares.Therefore, described masterplate engine after performing forward engineering according to described raw page data, then performs reverse-engineering, obtains current web page data, is the data with described raw page data identical data form.
Such as, described masterplate engine is according to the data of user's input, after the execution such as ad data and news data forward engineering exports a certain webpage, when user needs to verify the data of its input, described masterplate engine execution reverse-engineering is needed to get the data of user's input, now, described masterplate engine is first according to the station location marker of each described web page element of record, locate each data element in described target web, extract the data content in each described data element again, thus after the filtration such as ad data and news data and Data Format Transform are carried out to each described data content, obtain the data content identical with the data layout that original user inputs and form web data, be the current web page data of described target web.
With reference to figure 9, be the structural representation of a kind of masterplate engine embodiment five that the application provides, described masterplate engine comprises:
Identification record unit 901, for record described target web when target web generates web page template in the station location marker of each web page element.
It should be noted that, the Page template in described masterplate engine comprises at least one web page element.Each described web page element carries out control realization in the mode of CSS (cascading style sheet) Cascading Stylesheet to other effects such as the layout in target web to be generated, font, color, backgrounds usually, and then by the described target web of the engine-operated output of described masterplate.
And after described masterplate engine generates described target web, in described masterplate engine can there is change more or less in the position of each web page element, its reason is, after described target web generates, described Page template is owing to being placed into data in each web page element of its inside, thus, all can there is off-set value more or less in each web page element.Therefore, described target web before being exported by described masterplate engine and afterwards, Page template in described masterplate engine is different, and described identification record unit 901 record is the station location marker of each web page element in the Page template of target web after described target web generates.
Wherein, above-mentioned identification record unit 901 was triggered operation before the acquisition of the current web page data of the embodiment of the present application realize target webpage, namely when needing the current web page data acquisition carrying out target web, in the Page template of described target web, the station location marker of each web page element is recorded, and can be used in the acquisition carrying out current web page data.
Element positioning unit 902, for the data acquisition request in response to described target web, according to the station location marker of each described web page element, locates each data element in described target web.
Wherein, described data acquisition request shows that the current web page data of operator's needs such as user to the target web generated obtain, and can be triggered by user, thus impels described element positioning unit 902 and subsequent cell to run.
It should be noted that, when described target web generates, each web page element in described Page template is placed into data, and after described target web generates, the data cell in each web page element in described Page template is each self-corresponding data element of each web page element.Each described data element comprises respective data content.
Contents extracting unit 903, for extracting the data content in each described data element.
Wherein, in each described data element, its data content can exist with the form of ranks, or exists to arrange capable form.Such as, data content in described data element exists with the form of ranks, namely each data element comprises at least one row element, each described row element comprises at least one column element, described row element can be understood as data iteration item, the partial data content of data element belonging to it is stored in each described column element, and the data content in all column elements in all row elements forms the data content of described data element, as shown in Figure 2, such as, in a data element, comprise multiple data group be made up of the data item of same type, now, described data group is as row element, described data item is as column element.And for example, data content in described data element exists to arrange capable form, namely each data element comprises at least one column element, each described column element comprises at least one row element, the data content of data element belonging to it is stored in each described row element, and the data content in all row elements in all column elements forms the data content of described data element, as shown in Figure 3.
It should be noted that, the data content of each described data element extracted in described contents extracting unit 903 can be JSON(JavaScript Object Notation) data of form.
Data capture unit 904, for according to each described data content, obtains the current web page data of described target web.
Wherein, each described data content can directly combine by described data capture unit 904, obtains the current web page data of described target web.Such as, each described data content is integrated according to the size of storage space or the order of its each self-corresponding station location marker, obtains the current web page data of described target web.
It should be noted that, the process described masterplate engine being generated after the raw page data of preparation is inserted Page template target web is called forward engineering, and described forward engineering performs inserts the user input data (raw page data) of text or extended formatting the task that Page template exports stencil web.Accordingly, in this application, the process of the current web page data obtained in described target web by described masterplate engine is called reverse-engineering.
From such scheme, a kind of masterplate engine embodiment five that the application provides, by the station location marker of each web page element in the Page template of described target web pre-recorded when target web generates, after described target web generates, when needs obtain described target web, first according to the station location marker of each described web page element, locate each data element in described target web, extract the data content in each described data element, afterwards again according to each described data content, get the current web page data of described target web.The application is when the web data of realize target webpage obtains, DOM programming is carried out without the need to developer, after described masterplate engine completes the output task of target web, according to the acquisition demand of user to web data, the current web page data that just can extract in described target web with reference to the station location marker of pre-recorded each web page element, without the need to increasing extra operating process, avoiding loaded down with trivial details programming flow process, improving the acquisition efficiency of web data.
Wherein, in the actual realization of the embodiment of the present application, in described identification record unit 901 when positioning each data element, each data element in target web described in CSS (cascading style sheet) selection CSS Selector tool positioned can be utilized.
It should be noted that, described CSS Selector instrument is applicable in the application of the browsing device net page data acquisition supporting CSS, when browser at target web place does not support CSS as ExtJS, JQuery, DOJO etc., the location of each data element in each self-corresponding framework selection tool realize target webpage of ExtJS, JQuery, DOJO can be used.
With reference to Figure 10, be the structural representation of label record unit 901 described in a kind of masterplate engine embodiment six that the application provides, described label record unit 901 comprises:
Off-set value obtains subelement 911, for obtaining the off-set value of each web page element in its Page template when target web generates.
Wherein, in above, after described target web generates, described Page template is owing to being placed into data in each web page element of its inside, make the off-set value that each web page element all can occur more or less, therefore, obtain subelement 911 by described off-set value and calculating acquisition is carried out to the off-set value after each described web page element is placed into data.
Mark determines subelement 912, for the station location marker according to each web page element in the web page template before described off-set value and the generation of described target web, determines the station location marker of each described web page element during described auto-building html files.
Wherein, the station location marker of each web page element in the Page template before described target web generates, can utilize regular expression before described Page template Output rusults, calculate the starting point and terminating point that are placed into data in each web page element, as its respective station location marker.And described mark determines that subelement 912 is according to before described target web generation, starting point and the terminating point of data is placed in each web page element in described Page template, and after described target web generation, the off-set value of each web page element in described Page template, calculates the station location marker of each web page element when described target web generates.
With reference to Figure 11, for the structural representation of element positioning unit 902 described in a kind of masterplate engine embodiment seven that the application provides, wherein, in application embodiment, data content in described data element exists with the form of ranks, and namely described data element comprises at least one row element, and each described row element comprises at least one column element, accordingly, described element positioning unit 902 comprises:
Web page element locator unit 921, for the data acquisition request in response to described target web, according to the station location marker of each described web page element, locates each web page element in described target web.
In above, CSS supported by the browser of the target web carried in the application, therefore, when described element positioning unit 902 runs, CSS Selector instrument can be utilized to perform the task of each data element in the described target web in location.
And described web page element locator unit 921 shows, when realizing the location of each data element in described target web, first web page element each in the Page template in described target web is positioned.
Row element locator unit 922, for locating the row element in each described web page element respectively.
It should be noted that, data element in each described web page element exists with the form of ranks, and namely in form as shown in Figure 2, described data element comprises at least one row element, after each web page element has been located, locate each row element in described web page element.
Column element locator unit 923, for locating the column element in each described row element respectively.
Wherein, in above, data content in all column elements in all row elements forms the data content of the web page element corresponding to it, therefore, after described row element locator unit 922 completes the location to the row element in each described web page element, by described column element locator unit 923, each described row element is positioned, realize the location to data element each in described target web.
Wherein, based on the implementation of above-mentioned element positioning unit 902, in the embodiment of the present application seven, described contents extracting unit 903 can utilize HTML DOM in the target web of having located, extract data content in described target web.First, utilize HTML DOM to quote by the data element of locating in described target web, secondly, iteration quotes each row element in described data element, and then the data content in each column element extracting in each described row element, form the data content in data element belonging to it.
Wherein, in above, the data group that row element in described data element can form as the data item by multiple same form, can be understood as data iteration item, therefore, when described data element exists with ranks form, when quoting row element, each row element in data element described in iterated application.And the data content in described column element can exist with the form of text node, therefore, after described contents extracting unit 903 iteration quotes described row element, can data content in this row element of extracting directly in each column element Chinese version node, the data content in the described data element belonging to it of composition.As shown in Figure 7, be the logic relation picture in the embodiment of the present application between data element, row element and column element.
It should be noted that, the data content in each described data element extracted in described contents extracting unit 903 can exist with the data of JSON form.
With reference to Figure 12, be the structural representation of data capture unit 904 described in a kind of masterplate engine embodiment eight that the application provides, described data capture unit 904 can comprise:
Data filtering subelement 941, for according to the data filtering rule in described data acquisition request, carries out data filtering to each described data content respectively.
Wherein, described data filtering rule can be that user submits to template engine in the lump when trigger data obtains request, user pre-sets data filtering rule, this data filtering rule can show that the Data Identification that user needs to retain or needs filter the Data Identification given up, such as, user needs filtering advertising data acquisition news data etc. to give up, and only retains the input data of user.
Format conversion subelement 942, for carrying out Data Format Transform to each data content through data filtering, the data content obtained forms the current web page data of described target web;
Wherein, the data layout of the current web page data of described target web is identical with the data layout of raw page data, and described raw page data is the primary data being placed into described Page template and then generating described target web.
It should be noted that, the data content extracted in described contents extracting unit 903 exists with forms such as Boolean type, integer type, pointer type usually, and before these data contents are fed back to user, the data content of these different-formats is needed to carry out format conversion, obtain the data with raw page data same kind, as text character string format etc., and then each data content through format conversion is combined, obtain the current web page data of described target web.
And described raw page data is before described target web generates, the webpage primary data that user prepares.Therefore, described masterplate engine after performing forward engineering according to described raw page data, then performs reverse-engineering, obtains current web page data, is the data with described raw page data identical data form.
Such as, described masterplate engine is according to the data of user's input, after the execution such as ad data and news data forward engineering exports a certain webpage, when user needs to verify the data of its input, described masterplate engine execution reverse-engineering is needed to get the data of user's input, now, described masterplate engine is first according to the station location marker of each described web page element of record, locate each data element in described target web, extract the data content in each described data element again, thus after the filtration such as ad data and news data and Data Format Transform are carried out to each described data content, obtain the data content identical with the data layout that original user inputs and form web data, be the current web page data of described target web.
It should be noted that, each embodiment in this instructions all adopts the mode of going forward one by one to describe, and what each embodiment stressed is the difference with other embodiments, between each embodiment identical similar part mutually see.
Finally, also it should be noted that, in this article, the such as relational terms of first and second grades and so on is only used for an entity or operation to separate with another entity or operational zone, and not necessarily requires or imply the relation that there is any this reality between these entities or operation or sequentially.And, term " comprises ", " comprising " or its any other variant are intended to contain comprising of nonexcludability, thus make to comprise the process of a series of key element, method, article or equipment and not only comprise those key elements, but also comprise other key elements clearly do not listed, or also comprise by the intrinsic key element of this process, method, article or equipment.When not more restrictions, the key element limited by statement " comprising ... ", and be not precluded within process, method, article or the equipment comprising described key element and also there is other identical element.
Above a kind of web data acquisition methods provided by the present invention and masterplate engine are described in detail, apply specific case herein to set forth principle of the present invention and embodiment, the explanation of above embodiment just understands method of the present invention and core concept thereof for helping; Meanwhile, for one of ordinary skill in the art, according to thought of the present invention, all will change in specific embodiments and applications, in sum, this description should not be construed as the restriction to the application.

Claims (10)

1. a web data acquisition methods, is characterized in that, records the station location marker of each web page element in the Page template of described target web when target web generates, and described method comprises:
In response to the data acquisition request of described target web, according to the station location marker of each described web page element, locate each data element in described target web;
Extract the data content in each described data element;
According to each described data content, obtain the current web page data of described target web.
2. method according to claim 1, is characterized in that, the described station location marker according to each described web page element, locates each data element in described target web, comprising:
The described station location marker according to each described web page element, utilizes each data element in target web described in CSS (cascading style sheet) selection CSS Selector tool positioned.
3. method according to claim 1 and 2, is characterized in that, the described station location marker recording each web page element in the Page template of described target web when target web generates, comprising:
Obtain the off-set value of each web page element in its Page template when target web generates;
The station location marker of each web page element in the Page template before generating according to described off-set value and described target web, the station location marker of each described web page element when determining that described target web generates.
4. method according to claim 1 and 2, is characterized in that, described data element comprises at least one row element, and each described row element comprises at least one column element;
Accordingly, the described station location marker according to each described web page element, locate each data element in described target web, comprising:
According to the station location marker of each described web page element, locate each web page element in described target web;
Locate the row element in each described web page element respectively;
Locate the column element in each described row element respectively.
5. method according to claim 1 and 2, is characterized in that, described according to each described data content, the current web page data obtaining described target web comprise:
According to the data filtering rule in described data acquisition request, respectively data filtering is carried out to each described data content;
Carry out Data Format Transform to each data content through data filtering, the data content obtained forms the current web page data of described target web;
Wherein, the data layout of the current web page data of described target web is identical with the data layout of raw page data, and described raw page data is the primary data being placed into described Page template and then generating described target web.
6. a template engine, is characterized in that, comprising:
Identification record unit, for record described target web when target web generates web page template in the station location marker of each web page element;
Element positioning unit, for the data acquisition request in response to described target web, according to the station location marker of each described web page element, locates each data element in described target web;
Contents extracting unit, for extracting the data content in each described data element;
Data capture unit, for according to each described data content, obtains the current web page data of described target web.
7. the template engine according to claims 6, is characterized in that:
Described element positioning unit, specifically for the described station location marker according to each described web page element, utilizes each data element in target web described in CSS (cascading style sheet) selection CSS Selector tool positioned.
8. the template engine according to claim 6 or 7, is characterized in that, described identification record unit comprises:
Off-set value obtains subelement, for obtaining the off-set value of each web page element in its Page template when target web generates;
Mark determines subelement, for the station location marker according to each web page element in the web page template before described off-set value and the generation of described target web, determines the station location marker of each described web page element during described auto-building html files.
9. the template engine according to claim 6 or 7, is characterized in that, data element comprises at least one row element, and each described row element comprises at least one column element;
Accordingly, described element positioning unit comprises:
Web page element locator unit, for the station location marker according to each described web page element, locates each web page element in described target web;
Row element locator unit, for locating the row element in each described web page element respectively;
Column element locator unit, for locating the column element in each described row element respectively.
10. the template engine according to claim 6 or 7, is characterized in that, described data capture unit comprises:
Data filtering subelement, for according to the data filtering rule in described data acquisition request, carries out data filtering to each described data content respectively;
Format conversion subelement, for carrying out Data Format Transform to each data content through data filtering, the data content obtained forms the current web page data of described target web;
Wherein, the data layout of the current web page data of described target web is identical with the data layout of raw page data, and described raw page data is the primary data being placed into described Page template and then generating described target web.
CN201310273053.6A 2013-07-01 2013-07-01 Webpage data obtaining method and template engine Pending CN104281575A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310273053.6A CN104281575A (en) 2013-07-01 2013-07-01 Webpage data obtaining method and template engine

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310273053.6A CN104281575A (en) 2013-07-01 2013-07-01 Webpage data obtaining method and template engine

Publications (1)

Publication Number Publication Date
CN104281575A true CN104281575A (en) 2015-01-14

Family

ID=52256459

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310273053.6A Pending CN104281575A (en) 2013-07-01 2013-07-01 Webpage data obtaining method and template engine

Country Status (1)

Country Link
CN (1) CN104281575A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104965783A (en) * 2015-06-16 2015-10-07 百度在线网络技术(北京)有限公司 Method and apparatus for monitoring web content presentation
CN106991131A (en) * 2017-03-08 2017-07-28 陕西识代运筹信息科技股份有限公司 A kind of data processing method and device
CN109522018A (en) * 2018-11-14 2019-03-26 腾讯科技(深圳)有限公司 Page processing method, device and storage medium
CN109684571A (en) * 2018-12-28 2019-04-26 咪咕文化科技有限公司 A kind of collecting method and device, storage medium

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104965783A (en) * 2015-06-16 2015-10-07 百度在线网络技术(北京)有限公司 Method and apparatus for monitoring web content presentation
CN106991131A (en) * 2017-03-08 2017-07-28 陕西识代运筹信息科技股份有限公司 A kind of data processing method and device
CN109522018A (en) * 2018-11-14 2019-03-26 腾讯科技(深圳)有限公司 Page processing method, device and storage medium
CN109522018B (en) * 2018-11-14 2021-06-18 腾讯科技(深圳)有限公司 Page processing method and device and storage medium
CN109684571A (en) * 2018-12-28 2019-04-26 咪咕文化科技有限公司 A kind of collecting method and device, storage medium
CN109684571B (en) * 2018-12-28 2021-02-05 咪咕文化科技有限公司 Data acquisition method and device and storage medium

Similar Documents

Publication Publication Date Title
CN101122899B (en) Report generation method and device
CN103034633B (en) Generate the method and device of the result of page searching summary of extension
CN101025738B (en) Template-free dynamic website generating method
CN103389895B (en) A kind of generation method and system of front end page
CN102646038B (en) Control content display method, control content display device and mobile terminal
JP2011523133A (en) Layout file structure processing method and apparatus
CN107329747A (en) A kind of method and device for generating multi-threaded pattern
CN112083920A (en) Front-end page design method, device, storage medium and equipment
JP5930496B2 (en) Method and apparatus for acquiring structured information in layout file
CN104714949A (en) Method for customizing report dynamically
CN103309806B (en) The device and method of a kind of quick development and testing
CN105068815A (en) Page editor interaction apparatus and method
US20140215306A1 (en) In-Context Editing of Output Presentations via Automatic Pattern Detection
KR101950126B1 (en) Mathematical formula processing method, device, apparatus and computer storage medium
CN105654022A (en) Method and device for extracting structured document information
CN101359285B (en) Apparatus and method for inserting control into document
CN105204860A (en) Method and device for rapidly generating user-defined static Web page
CN102651002A (en) Webpage information extracting method and system
CN110543303A (en) Visual business platform
CN106162302B (en) Layout method and device for Launcher main interface and smart television
CN113268227A (en) Zero-code visualization software development platform and development method
CN113609820A (en) Method, device and equipment for generating word file based on extensible markup language file
CN104281575A (en) Webpage data obtaining method and template engine
CN109213480A (en) A kind of method, storage medium, equipment and system for developing the back-stage management page
CN106547895A (en) A kind of extracting method and device of info web

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C02 Deemed withdrawal of patent application after publication (patent law 2001)
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20150114