CN101576885A - Technical scheme for extracting dynamic generation web page contents - Google Patents

Technical scheme for extracting dynamic generation web page contents Download PDF

Info

Publication number
CN101576885A
CN101576885A CNA2008100941885A CN200810094188A CN101576885A CN 101576885 A CN101576885 A CN 101576885A CN A2008100941885 A CNA2008100941885 A CN A2008100941885A CN 200810094188 A CN200810094188 A CN 200810094188A CN 101576885 A CN101576885 A CN 101576885A
Authority
CN
China
Prior art keywords
area
data
web page
positioning
extractive technique
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CNA2008100941885A
Other languages
Chinese (zh)
Other versions
CN101576885B (en
Inventor
韩露
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Individual
Original Assignee
Individual
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Individual filed Critical Individual
Priority to CN200810094188.5A priority Critical patent/CN101576885B/en
Publication of CN101576885A publication Critical patent/CN101576885A/en
Application granted granted Critical
Publication of CN101576885B publication Critical patent/CN101576885B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Transfer Between Computers (AREA)

Abstract

The invention provides a technical scheme, relating to the fields of computer network application, software, hardware and the combination of the software and the hardware. Specific contents in dynamic web page data can be accurately and flexibly extracted at a client terminal (such as a browser, and the like) or a network transfer node (such as an acting server, and the like) having a request of obtaining the web page data. For the given web page data, a smaller subarea (named as a result area hereinafter) is positioned from a bigger data lookup area (named as a lookup area hereinafter) according to a group of area positioning element information. The result area is determined by two positioning pointers aiming at the lookup area, namely that the two positioning pointers point out two positions in the lookup area and an area between the two positions is just the result area. According to the requirement, this method can be used for many times, subsequent lookup of every time can use the prior result area as the current lookup area, and the positioning element information of the specific area of this turn is used for determining the positions of the positioning pointers, thereby the range of the lookup area and the result area can be gradually shortened until the result area is just target data expected in advance, and then the positioning and the extraction of the data are finished. In addition, the positioning of the positioning pointers is achieved by searching a sign regular expression according to a specific rule.

Description

Extract the technical scheme of dynamic generation web page contents
Affiliated technical field
The present invention proposes the application that a kind of technical scheme relates to computer software, hardware and is combined in network field.It makes that obtaining the client (as browser etc.) of web data or forwarded node (as acting server etc.) in request can accurately extract the certain content in the dynamic web page data neatly.
Background technology
Widely used on the internet webpage is the data according to certain computer linguistic norm formation with the text data format description.This type of linguistic norm is to see the markup language that information designs for Web page create with in web browser.Wherein, and hypertext markup language (English: Hyper Text Markup Language, hereinafter to be referred as HTML) be the most generally be used and support a kind of.It is an international standard of being formulated standard by World Wide Web Consortium (English: The World Wide Web Consortium, hereinafter to be referred as W3C).
Html language is described the each side attribute of webpage with plain text data, comprises the refers to of Word message content, page layout form, webpage representation style and other types content (as image, video, sound etc.).According to its data description, browser is presented at the mode of content of pages with the W3C standard code in the user interface.Because the formulation of html language standard, therefore the page can represent to page reader with the pattern that the page animation person designs in advance.Html language is supported by all main flow browsers and is used, and is the basis and the core technology of internet web page reading function.
On the HTML of standard normative foundation, each browser developers or development company have introduced some extra flag informations, JavaScript script that is proposed as VBScript script that Microsoft proposed and Netscape etc.These extra flag informations are affixed on the information based on HTML, just can produce bandwagon effect and the additional function that some HTML standards are not provided on corresponding browser.
In addition, also there is the textual form linguistic norm of other other non-HTML, derives from by HTML usually, also have all characteristics of above-mentioned html language.
The HTML growth data of above-mentioned standard html language data, non-standard content, other non-HTML normative text data layout webpage descriptive languages, all be applicable to scope of the present invention, and in explanation of the present invention, be referred to as homepages language and web data.
The typical scene that web data is employed is: page reader is left the web data of specified network position in by the mode request of manual input, selected prefabricated bookmark or webpage clicking internal chaining in the browser operation interface, this network site " URL(uniform resource locator) " (English: Uniform Resource Location, hereinafter to be referred as URL) describe, be often referred to a web page server (English: the web page files of being safeguarded Web Server).This web page files can be to be mapped in esse file on the storage medium, also can be virtual file.After web data was delivered to browser by the source network position, browser was shown to page reader with the information content that this webpage comprised with described exhibition method of these data and form in the mode of homepages language defined.
According to the difference of application scenarios, user side software can be browser, also can be other web-page requests of passing through to submit to or transmit to web page server the URL form, and obtains the software or the hardware of corresponding web page data.Communication that wherein relates to and mutual use HTML (Hypertext Markup Language) (English: Hyper Text TransferProtocol, hereinafter to be referred as HTTP) finish.
At server end, web data may be produced with two kinds of forms:
1) static Web page data: all web datas (comprising web page contents and ways of presentation information) are write in advance by the webpage design personnel and are finished, and are stored on the storage medium that web page server can directly visit with document form.When receiving client to the request of certain webpage, web page server reads corresponding file on the storage medium, and file content is directly sent to client.It is identical that data that client obtains and webpage design personnel write in advance.
2) dynamic web page data: the finger divides web data (comprising web page contents and ways of presentation information) dynamically to generate for server.Usually, webpage design developer uses the Page template data to determine part static in the webpage as the framework of webpage, using server side scripts or other program development technology then, serves as the basis generating code of affix dynamic content in position with the Page template data.When receiving client to the request of certain webpage, web page server reads the static part of web data, implement the related dynamic content generating code of this webpage then, obtain the data of dynamic part, make up according to predetermined mode and static data, form the web data that finally sends to client.Client may obtain web data inequality owing to transmitting parameter or the difference of access time to the repeatedly visit of same webpage.
Need should be mentioned that, by caching technology (English Cache Service), web page server can with before the static Web page that reads or the dynamic web page of generation be temporarily stored in the internal memory, under certain condition, directly send temporal data in the internal memory for the request of the identical URL of subsequent access, but this does not change initial mode and the classification thereof that generates of this page.That is, if initial web data dynamically generated, even so under the effect of caching process, some subsequent access can directly obtain these data and not need dynamically to produce separately, and these subsequent access still are considered to be in and read the dynamic web page data.
In some cases, client or forward node need obtain all or part of of multidate information in the web data, then the data that extract are used.At present, mainly containing following several method realizes:
1) use regular expression: regular expression is a kind of simple character combination pattern representation, by using wild symbol, any character string example that meets this character combination pattern can be expressed with a regular expression character string., and will mate word string and from web data full text, extract or exclude by being identified the target string feature at this.Be characterized in be fit to extracting relatively more fixing, the special and simple data of form, as data in URL, special format data, the form etc.But the text high for form repeatability, that combination is complicated, its accuracy and applicability then usually can't reach requirement.
2) use the HTML marker character to get rid of: promptly to filter and abandon each HTML marker character in the web data and additional thereon format information, only keep the text message of its content.Be characterized in and be converted into content of text to the html data file fully, relatively be applicable to web data simple in structure and that interfere information is few.Yet for the webpage that a large amount of interference texts are arranged, it does not have to the separating capacity of different texts with to the accurate extractability of valid data.
3) use the absolute position: promptly for the form of fixed character quantity, the absolute position of the starting point of define objective data and end point in partial data.Be characterized in having accurate extractability, yet most webpage does not adopt the formal definition of strictness like this for the very strict and fixing web page text of form.
4) use similarity to get rid of: one group of sample webpage promptly to be provided, will to get rid of with the same or analogous part of sample webpage in the target web data.Be characterized in more accurately to obtain the dynamic part in the web data, yet this method the Different Dynamic data can not be distinguished well, thereby also the interfere information that exists in the dynamic data can't be got rid of well.Simultaneously, the efficient of this kind method is lower, can cause bigger system overhead.
From above analysis as seen, these four kinds of extracting modes all are applicable to some scene, but separately defective and restriction are all arranged.Concluding the conclusion of getting up is, these four kinds of methods all can't provide perfect flexible, the accurate extractability to complicated dynamic web page data, be to adapt to dissimilar dynamic datas, or be the interfere information that to distinguish effectively and to get rid of in the dynamic data.
Summary of the invention
The present invention proposes a kind of method and flow process of in the dynamic web page data, carrying out the content location and extracting, can reach simply, extract flexibly and exactly the effect of the designated key content of predetermined website.
For the ease of follow-up introduction, at first introduce two notions:
1. page masterplate data:
In the description about the dynamic web page data, be determined in advance changeless part in the page masterplate general reference webpage, comprise the format information of page demonstration and the total content of other similar pages usually.Generally in the dynamic web page development activities, the producer can detach out design separately to page masterplate from target web, and to save as a series of static files be the masterplate file, is the generating code of basic additional dynamic content then with the masterplate file.Yet the generating code of dynamic content also can produce a part of changeless content in some cases, or follows the output (as personnel's appellation of limiting form, time on date etc.) of very strict form.Static part data that above both of these case produces and strict formal definition all are called as the static part of page masterplate data or (broad sense) dynamic page in subsequent descriptions of the present invention.
2. dynamic web page data of homology masterplate (homology dynamic web page):
Same page masterplate data can be used by one or more dynamic web page.Under situation about being exclusively enjoyed by a dynamic web page, may be corresponding to different URL (owing to passing to the parameter difference of webpage), also may be identical URL (but the different time visit may produce different dynamic contents).Under situation about being shared by a plurality of dynamic web pages, corresponding a plurality of different URL.Under the both of these case, all dynamic web page data results that generated based on same page masterplate data, all be counted as being implemented in the example of these Page template data, in subsequent descriptions of the present invention, be called as the dynamic web page data (or being called for short the homology dynamic web page) of homology masterplate.
The technical program is applied to predetermined homology dynamic web page, can determine the source page masterplate data of a given webpage institute subordinate by predefined URL matched rule, thereby also just can obtain to be its pre-configured extraction element information.Extract element information according to these, just can cooperate the ad hoc rules of being formulated at this homology dynamic web page, required content is handled and extracted to the data of webpage example according to defined method of the technical program and flow process.Under the definite by appointment situation in the source of web data, also may directly use the extraction element information of agreement, do not mate the selection of carrying out element information and need not carry out URL.
The cardinal principle of location and extraction given content is: orient a less subregion (hereinafter referred to as results area) according to one group of extracted region element information from a bigger data search zone (hereinafter referred to as the seek area).This results area is to be determined by two positioning pointers at the seek area, is promptly pointed out two positions of inside, seek area by two positioning pointers, and the zone between these two positions is results area.Can repeatedly use the method as required, follow-up searching each time all uses previous results area conduct when inferior seek area, and uses the specific region elements of fix information of this round to determine the positioning pointer position.Like this, just can progressively dwindle the scope of seek area and results area, be desired destination data in advance up to results area, then the location of data and extraction are finished.
Positioning pointer obtains the location by the coupling of sign regular expression in the seek area.As a kind of special case, if do not comprise any wild symbol in the sign regular expression, then it is a specific character string, and matching operation in such cases is character string and accurately searches.The elements of fix information of positioning pointer comprises:
1) sign regular expression
2) search the starting position:, or put (if second positioning pointer of this round, position or its relative position that can also first positioning pointer begin) by certain maintenance of byte number sign from the beginning or the ending in data search zone
The direction of 3) searching: forward or backward
4) number of times of banner word regular expression coupling appearance
5) whether results area comprises the word string of the final coupling of positioning pointer
A complete elements of fix information of taking turns to search that two positioning pointers have been divided being combined to form of other elements of fix information.Wherein, the sign regular expression that use the location is all formulated according to source page masterplate data, and can have nothing in common with each other; The matching times of appointment is positive integer and can has nothing in common with each other.Have a kind of special case, take turns the location of only having carried out the one-time positioning pointer in searching one, the positioning pointer that results area is defined as this time obtaining is to the scope between seek area one lateral boundaries (head or tail).This special case can think that second positioning pointer directly is set in specified border, and principle still according to the invention and method.The complete elements of fix information of each wheel has formed the extraction element information that is directed to these page masterplate data together with the order information of round etc., is used to support the contents extraction of the dynamic web page data of corresponding homology masterplate.
More than these configurable information all be by source page masterplate data analysis is obtained, under the situation that does not have source page masterplate data raw data, also can obtain by this homology dynamic web page is carried out analytic induction.Because the existence of the page masterplate data of homology dynamic web page ownership can partly be selected appropriate sign regular expression from its static data generally speaking, thereby guarantee present technique extensive applicability and versatility.Zone location element information take the altitude really depends on concrete page masterplate data, at each group homology dynamic web page, needs corresponding extraction element information and supports.This combined information can and be stored in external file or the database with specific format statement, and undertaken by URL coupling and homology dynamic web page accordingly related, thereby the effect that reaches flexible adjustment and be easy to expand.This information also can directly be written as program code and obtain embodying from programmed logic, externally under the not complete situation that whole elements of fix are provided of configuration information, carries out default conventional act by program code and carries out and extract operation.
In most cases the body matter of textual form extracts in the dynamic web page, and all method is finished thus; By same page data being used the different element informations that extract be used in combination this method, can also realize extraction respectively to many group subject contents in the same page.For the target data (as link, picture, video etc.) of other type, can after carrying out abundant zone location, use the method for regular expression coupling to extract information with individual features through said method.
The invention has the beneficial effects as follows, for the predetermined content of dynamic web page provides a kind of method in common and flow process, be not subjected to restrictions such as web page contents type, display styles, layout format, design language, can cover the tissue morphology of all dynamic web pages basically.Have customizability and content for a purpose flexibly, be particularly useful for the extraction of interior particular topic of webpage and plate content, can get rid of the interference of irrelevant information well.For complex structure and the bigger dynamic web page of data volume, still can carry out information extraction accurately and efficiently.Each zone location of taking turns is all relatively independent, and the data in the target area are not done change, thereby is easy to be used in combination with other extractive techniques.
Description of drawings
Set forth in the appended claims and be considered to characteristics of the present invention and creative feature.But, below reading by the reference accompanying drawing to the detailed description of illustrative embodiment can be easier to understand invention itself with and use-pattern.Below in conjunction with figure and embodiment the present invention is further described:
Fig. 1. schematically show an exemplary embodiments of the present invention (device part)
Fig. 2. schematically show an exemplary embodiments of the present invention (contents extraction flow process)
Among the figure:
1. put the first place of first round seek area
2. the tail position of first round seek area
3. the first round first positioning pointer of searching, promptly put the first round results area and second first place of taking turns the seek area
4. the first round second positioning pointer of searching, promptly first round results area and second is taken turns the tail position of seek area
5. second take turns first positioning pointer of searching, promptly second take turns results area and extract the reference position of target data
6. second take turns first positioning pointer of searching, promptly second take turns results area and extract the end position of target data
10. the first round first positioning pointer search direction of searching (this example for backward)
20. the first round second positioning pointer search direction of searching (this example for forward)
30. second takes turns the first positioning pointer search direction of searching (this example for backward)
40. second takes turns the second positioning pointer search direction of searching (this example for forward)
101. the first round searches, preceding n1-1 the coupling of the sign regular expression that first positioning pointer ran into (n1 is predetermined matching times, the n1=3 in this example of this time locating)
101. the first round searches, the n1 time of the sign regular expression that first positioning pointer ran into coupling (n1 is predetermined matching times, the n1=3 in this example of this time locating)
103. the first round searches, preceding m1-1 the coupling of the sign regular expression that second positioning pointer ran into (m1 is predetermined matching times, the m1=2 in this example of this time locating)
104. the first round searches, the m1 time of the sign regular expression that first positioning pointer ran into coupling (m1 is predetermined matching times, the m1=2 in this example of this time locating)
105. second takes turns and searches, preceding n2-1 the coupling of the sign regular expression that first positioning pointer ran into (n2 is predetermined matching times, the n2=4 in this example of this time locating)
106. second takes turns and searches, the n2 time coupling of the sign regular expression that first positioning pointer ran into (n2 is predetermined matching times, the n2=4 in this example of this time locating)
108. second takes turns and searches, the m2 time coupling of the sign regular expression that second positioning pointer ran into (n2 is predetermined matching times, the m2=1 in this example of this time locating)
200. the seek area that the first round searches, promptly complete web data
201. the results area that the first round searches promptly second is taken turns the seek area of searching
202. second takes turns the results area of searching, and promptly will extract the target area of data
500. web page access extend information collection
501. web data
502. the additional contextual information of web page access (as URL etc.)
600. contents extraction engine
601. web page contents extraction module
Select module 602. extract element information
603. the extraction element information of webpage correspondence
604. webpage extracts the selection information of key element
610. the functional module of other additional extracting modes
620. the content results that extracts
Embodiment
Embodiments of the invention are described with reference to the accompanying drawings.In the following description, details such as many concrete technical characterictics have been set forth so that more fully understand the present invention.But for those skilled in the art clearly, realization of the present invention can not have some concrete technical characterictics wherein.Complete description of the present invention is provided by the summary of the invention one joint institute of this paper, and its scope is limited by the language of appending claims.
Fig. 1 schematically shows the device part of an exemplary embodiments of the present invention, and its core is that web page contents extracts engine or web page contents extraction procedure module (600), comprises plurality of sub module as described below.
A certain dynamic web page data are being carried out in the activity of contents extraction, shown module be input as web page access extend information collection (500), comprise the additional contextual information (502) of web data (501) and web page access.Wherein web data (501) is an original data content, is the source data that web page contents extracts; The additional contextual information of web page access (502) is specific supplementary (as the URL of webpage etc.), can select certain content at this specific homology dynamic web page to extract key element and rule is carried out the subsequent extracted operation according to this Intelligence Page contents extraction engine (500).
The additional contextual information of web page access (502) is delivered to extracts element information selection module (602), this module is extracted the selection information (604) of key element according to webpage, select the extraction element information of condition coupling, i.e. the extraction element information (603) of webpage correspondence.Wherein, webpage extracts certain rule of correspondence that the selection information (604) of key element is the additional contextual information (502) of the web page access extraction element information (603) corresponding with webpage.Extract element information and select module (602) that the extraction element information (603) of the webpage correspondence chosen is indicated to web page contents extraction module (601), carry out corresponding operating at web data (501) for it.
Web page contents extraction module (601) uses the specified mode of extraction element information (603) of webpage correspondence, and web data (501) is carried out content choice and extracts operation, produces the content results (620) of extraction.Optionally, extract engine (600) inside at web page contents, the functional module (610) that can also have other additional extracting modes, can use for the result data of web page contents extraction module (601) that other guide is selected and extracting mode is further operated, and obtain the content results (620) of final extraction.
It is pointed out that the above embodiments only are preferences of device involved in the present invention, in using actual enforcement of the present invention, also can use the embodiment different, and needn't influence applicability of the present invention with this example.As, do not needing external information can obtain then not need the additional contextual information (502) of web page access under the situation of extraction element information (603) of webpage correspondence; Arranging the particular Web page source and extracting under the situation of element information, then do not needing to extract element information and select module (602) and webpage to extract the selection information (604) of key element.And for example, when the extraction element information (603) of all webpage correspondences is prior agreement and is implemented in web page contents extraction module (601) program inside, then do not need the extraction element information (603) of the webpage correspondence of externalizing.For another example, the functional module (610) of other additional extracting modes is optional modules, under the situation of having selected this module, can be placed on the flow process afterbody shown in this example; If the output result of this flow process maybe can merge under the situation of single continuum for single continuum, also can be placed on web page contents extraction module (601) and hold before; , also functional module (610) can be placed between the serial operation of a plurality of web page contents extraction modules (601) when being two or more in the quantity of web page contents extraction module (601).In addition, each module in this diagram is divided for ease of the logic function of describing operation and being done, and in practical application software of the present invention or device, and unnecessaryly realizes according to same division at aspects such as code encapsulation, compiling link or physiques.
Fig. 2 schematically shows contents extraction flow process part in the exemplary embodiments of the present invention, and this flow process is the core operation of web page contents extraction module (601).Web page contents extraction module (601) can carry out content choice and extraction with the latter for the initial operation data according to the former specified mode behind the extraction element information (603) and web data (501) that obtain the webpage correspondence.
Formulated to extraction element information (603) exemplary of the webpage correspondence in this diagram following contents extraction rule:
A) content search of webpage and extraction divide two-wheeled to carry out
B) first round first positioning pointer of searching is searched backward from the starting point of seek area
C) the sign regular expression that uses of the first round first positioning pointer of searching
D) first positioning pointer of searching will be positioned at coupling the 3rd time the first round
E) first round second positioning pointer of searching is searched forward from the end point of seek area
F) the sign regular expression that uses of the first round second positioning pointer of searching
G) second positioning pointer of searching will be positioned at coupling the 2nd time the first round
H) first round results area does not comprise the final characters matched string of first and second positioning pointers
I) second take turns first positioning pointer of searching starting point, search backward from the seek area
J) the second sign regular expression of taking turns first positioning pointer use of searching
K) second take turns first positioning pointer of searching and to be positioned at the 4th coupling
L) second take turns second positioning pointer of searching end point, search forward from the seek area
M) the second sign regular expression of taking turns second positioning pointer use of searching
N) second take turns second positioning pointer of searching and to be positioned at coupling the 1st time
O) second take turns results area and do not comprise first but comprise the final characters matched string of second positioning pointer
Annotate: as a kind of special case, if do not comprise any wild symbol in the sign regular expression, then it is a specific character string, and regular expression matching operation in such cases is equal to character string and accurately searches.
As shown in the figure, the first round searches with complete web data (200) as the seek area.
First positioning pointer is put (1) and is searched action (10) backward since the first place of first round seek area, find the coupling of preceding 2 sign regular expressions at (101) institute marker location, (102) institute's marker location is found the coupling of the 3rd sign regular expression, then epicycle first positioning pointer (3) is positioned (102) marker location.
Second positioning pointer is searched action (20) forward since the tail position (2) of first round seek area, find the coupling of preceding 1 sign regular expression at (103) institute marker location, (104) institute's marker location is found the coupling of the 2nd sign regular expression, then epicycle second positioning pointer (4) is positioned (104) marker location.
Get data segment (being results area) between two positioning pointers that the first round searches as second seek area of taking turns to search (201).The final characters matched string of first and second positioning pointers is not contained in this and takes turns results area (promptly getting rid of from edges of regions).
First positioning pointer is put (3) since second first place of taking turns the seek area and is searched action (30) backward, find 3 times to indicate the coupling of regular expression at (105) institute marker location, (106) institute's marker location is found the coupling of the 4th sign regular expression, then epicycle first positioning pointer (5) is positioned (106) marker location.
Second positioning pointer is searched action (40) forward since the second tail position (4) of taking turns the seek area, finding the 1st time at (108) institute marker location indicates the coupling of regular expression, then epicycle second positioning pointer (6) is positioned (108) marker location.
Get second and take turns the target area (202) that data are extracted in data segment (the being results area) conduct between two positioning pointers of searching.The final characters matched string of first positioning pointer is not contained in target area (promptly getting rid of from edges of regions), and the final characters matched string of second positioning pointer is contained in the target area.Data search and extraction are finished.
It is pointed out that the above embodiments only are special cases of device involved in the present invention, many different extraction element informations combinations can be arranged, all go for scope of the present invention.As: the wheel number of the content search of webpage and extraction can be any positive integer.And for example: every two positioning pointers of taking turns all can be searched backward from the starting position, seek area, or search forward from the seek area end position, both search directions can be the same or different, and second positioning pointer can also be searched forward or backward since the position (or its relative position) of first positioning pointer in addition; Every take turns behind the location pointer and needn't first preceding second after, results area is only got between the two and is got final product; Results area can comprise or not comprise the word string of last coupling.For another example: the sign regular expression that each positioning pointer is used can be identical also can be different, matching times is positive integer, can be identical or different.In addition, also allow between two-wheeled is searched, to insert other scopes and dwindle mode, make that the seek scope of subsequent passes delimited again.

Claims (8)

1. the invention provides a kind of technology that specific dynamic content in the dynamic generation web page is accurately extracted, it is characterized in that: for given in a web data seek area, two positioning pointers are positioned in the zone by search operation, and the subregion that intercepts between two positioning pointers is as a result of regional.Results area can be used as the target data of content search and extraction.Perhaps, also can repeatedly use this process, previous round is searched the results area that obtains, take turns the seek area of searching as back one, up to the round of searching of appointment, its results area is the target data of content search and extraction.
2. dynamic web content extractive technique according to claim 1 is characterized in that: the method that positioning pointer is located in the seek area is to use the sign regular expression, begins to carry out the order coupling according to the direction of appointment from appointed positions.The position that begins to search can be the reference position or the end position of seek area, if when wheel second positioning pointer of searching, and can also be with the position of first positioning pointer or its relative position as searching starting point.The direction of searching can be forward or backward.At each positioning pointer, a predetermined matching times is arranged.When not reaching preassigned matching times, aforementioned searching carried out continuation to former direction, till reaching predetermined matching times, is the position location of this pointer.As a kind of special case, if do not comprise any wild symbol in the sign regular expression, then it is a specific character string, and regular expression matching operation in such cases is equal to character string and accurately searches.
3. dynamic web content extractive technique according to claim 1, also comprise: when using positioning pointer to form results area, the sign regular expression matched character string that needs appointment or agreement whether this positioning pointer finally to be navigated to is included in the results area.
4. dynamic web content extractive technique according to claim 1, also comprise: have a kind of special circumstances, certain is taken turns to search and only uses a positioning pointer, in this case, can arrange starting position, seek area or end position are considered as the position of second positioning pointer.
5. dynamic web content extractive technique according to claim 1 also comprises: can the original web page data be used other extractive technique carry out preextraction before carry out this extractive technique, with the input of its result as this extractive technique; Can after finish extraction to web data, this extractive technique result data be used other extractive technique further extract; For the situation of using many wheels to search, also can between round, use other extractive technique, the results area of previous round is further dwindled, and as one seek area of taking turns, back, prerequisite is that the result of this extractive technique is that single area maybe can be merged into single area then.
6. dynamic web content extractive technique according to claim 1, it is characterized in that: the round order information of searching, each wheel are searched related positioning pointer element information (general designation contents extraction element information) can all or part ofly come from the configurable external parameter, can all or part of conduct arrange in advance to be implemented in software or the hardware with the form of program code.
7. dynamic web content extractive technique according to claim 1, it is characterized in that: the contents extraction element information is to obtain by the analysis in advance to the masterplate of target dynamic webpage, under the unavailable situation of relevant masterplate data, can understand its template information roughly by the dynamic web page data of observing its generation, and analyze acquisition contents extraction element information.
8. dynamic web content extractive technique according to claim 1, also comprise: when the source web page data are carried out contents extraction, can utilize some characteristic in relevant additional information of web page access or the web data, find corresponding with it contents extraction element information in the mode of making an appointment, and this web data is extracted with the specified mode of these information.For the web data of separate sources, may find and use different contents extraction element informations.
CN200810094188.5A 2008-05-08 2008-05-08 Technical scheme for extracting dynamic generation web page contents Expired - Fee Related CN101576885B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN200810094188.5A CN101576885B (en) 2008-05-08 2008-05-08 Technical scheme for extracting dynamic generation web page contents

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN200810094188.5A CN101576885B (en) 2008-05-08 2008-05-08 Technical scheme for extracting dynamic generation web page contents

Publications (2)

Publication Number Publication Date
CN101576885A true CN101576885A (en) 2009-11-11
CN101576885B CN101576885B (en) 2012-02-22

Family

ID=41271819

Family Applications (1)

Application Number Title Priority Date Filing Date
CN200810094188.5A Expired - Fee Related CN101576885B (en) 2008-05-08 2008-05-08 Technical scheme for extracting dynamic generation web page contents

Country Status (1)

Country Link
CN (1) CN101576885B (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103186640A (en) * 2011-12-31 2013-07-03 百度在线网络技术(北京)有限公司 AC algorithm based regular matching flow filtering method and device
CN104486154A (en) * 2014-12-12 2015-04-01 北京国双科技有限公司 Data lead-in method and device
CN106649392A (en) * 2015-11-03 2017-05-10 任子行网络技术股份有限公司 Method and apparatus for obtaining information based on what-you-see-is-what-you-get technology
CN107870951A (en) * 2016-09-28 2018-04-03 珠海金山办公软件有限公司 The jump method and device of a kind of document file page
CN114676330A (en) * 2022-03-30 2022-06-28 南京厚建软件有限责任公司 Method for uniformly recovering interactive data of Internet platform

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101094194B (en) * 2006-06-19 2010-06-23 腾讯科技(深圳)有限公司 Method for picking up web information needed by user in web page
CN100461183C (en) * 2007-07-10 2009-02-11 北京大学 Metadata automatic extraction method based on multiple rule in network search
CN101089856A (en) * 2007-07-20 2007-12-19 李沫南 Method for abstracting network data and web reptile system

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103186640A (en) * 2011-12-31 2013-07-03 百度在线网络技术(北京)有限公司 AC algorithm based regular matching flow filtering method and device
CN104486154A (en) * 2014-12-12 2015-04-01 北京国双科技有限公司 Data lead-in method and device
CN104486154B (en) * 2014-12-12 2017-12-19 北京国双科技有限公司 The introduction method and device of data
CN106649392A (en) * 2015-11-03 2017-05-10 任子行网络技术股份有限公司 Method and apparatus for obtaining information based on what-you-see-is-what-you-get technology
CN107870951A (en) * 2016-09-28 2018-04-03 珠海金山办公软件有限公司 The jump method and device of a kind of document file page
CN114676330A (en) * 2022-03-30 2022-06-28 南京厚建软件有限责任公司 Method for uniformly recovering interactive data of Internet platform
CN114676330B (en) * 2022-03-30 2023-12-08 南京厚建软件有限责任公司 Method for uniformly recovering interactive data of Internet platform

Also Published As

Publication number Publication date
CN101576885B (en) 2012-02-22

Similar Documents

Publication Publication Date Title
US20190220490A1 (en) Combining website characteristics in an automatically generated website
US6438540B2 (en) Automatic query and transformative process
CN102349087B (en) Automatically providing content associated with captured information, such as information captured in real-time
JP4124261B2 (en) Document analysis system, document analysis method, and program thereof
CN102959537B (en) Machine translation system and method of machine translation
US20100030752A1 (en) System, methods and applications for structured document indexing
US20110153590A1 (en) Apparatus and method for searching for open api and generating mashup block skeleton code
CA2817554A1 (en) Mobile content management system
US20090019015A1 (en) Mathematical expression structured language object search system and search method
JP2010541074A (en) System and method for including interactive elements on a search results page
US8892537B2 (en) System and method for providing total homepage service
Sundaramoorthy et al. Newsone—an aggregation system for news using web scraping method
CN104169912A (en) Information processing terminal and method, and information management apparatus and method
CN101576885B (en) Technical scheme for extracting dynamic generation web page contents
JP2008226235A (en) Information feedback system, information feedback method, information control server, information control method, and program
CN109960721A (en) Multiple Compression based on source contents constructs content
RU2698405C2 (en) Method of search in database
JP4883644B2 (en) RECOMMENDATION DEVICE, RECOMMENDATION SYSTEM, RECOMMENDATION DEVICE CONTROL METHOD, AND RECOMMENDATION SYSTEM CONTROL METHOD
CN101539914A (en) Technical proposal for readable customization conversion of web pages
JP3467160B2 (en) Multilingual communication system, server device, and document transmission method for server device
KR20210098813A (en) Apparatus of crawling and analyzing text data and method thereof
JP4002943B1 (en) Search optimization apparatus, method, and computer program
WO2022014629A1 (en) Webpage processing device, webpage processing method, and recording medium
KR102280028B1 (en) Method for managing contents based on chatbot using big-data and artificial intelligence and apparatus for the same
CN115879417A (en) Media editing method, device, computer and readable storage medium

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20120222

Termination date: 20150508

EXPY Termination of patent right or utility model