CN103714116A - Webpage information extracting method and webpage information extracting equipment - Google Patents

Webpage information extracting method and webpage information extracting equipment Download PDF

Info

Publication number
CN103714116A
CN103714116A CN201310529500.XA CN201310529500A CN103714116A CN 103714116 A CN103714116 A CN 103714116A CN 201310529500 A CN201310529500 A CN 201310529500A CN 103714116 A CN103714116 A CN 103714116A
Authority
CN
China
Prior art keywords
information
page
decimation rule
webpage
info web
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201310529500.XA
Other languages
Chinese (zh)
Inventor
徐锐波
付赟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Qihoo Technology Co Ltd
Qizhi Software Beijing Co Ltd
Original Assignee
Beijing Qihoo Technology Co Ltd
Qizhi Software Beijing Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Qihoo Technology Co Ltd, Qizhi Software Beijing Co Ltd filed Critical Beijing Qihoo Technology Co Ltd
Priority to CN201310529500.XA priority Critical patent/CN103714116A/en
Publication of CN103714116A publication Critical patent/CN103714116A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web

Abstract

The invention provides a webpage information extracting method and webpage information extracting equipment. The webpage information extracting method comprises the following steps of acquiring an extracting rule which is automatically generated according to webpage contents; and extracting webpage information by using the extracting rule. According to an embodiment of the invention, the problem that a certain error rate exists due to the fact that the webpage information is manually extracted in the prior art is solved; the webpage information extracting cost is reduced; moreover, the problem that the extracting rule which is the basis for extraction of the webpage information in the prior art cannot be updated in real time can be solved; and the accuracy on extraction of the webpage information is improved.

Description

Info web extracting method and equipment
Technical field
The present invention relates to internet, applications field, particularly relate to a kind of info web extracting method and equipment.
Background technology
Info web extractive technique be one about extract the technology of target information from webpage, from the structural data of natural language text and webpage, extract the technology of valuable information.
Info web of the prior art extracts and adopts artificial extracting method, by observing webpage and source code thereof, finds out some rules, then extract valuable information according to these regular codings by programming personnel.In order to make info web leaching process simple, programming personnel has built several modes specification normative language and user interface thereof.
Yet, in prior art, the artificial method of extracting of this employing at least exists following 2 deficiencies: first, each website in webpage is all needed to manual compiling rule, when needs capture large batch of website, artificial decimation rule also carries out coding and has certain error rate, and cost is excessive.Secondly, when there is change in the page structure of website, rule originally loses effect, therefore need manually again to carry out regular extraction and coding, cannot real-time update and manually find that page structure change causes info web to extract the decimation rule of foundation not in time, reduce the accuracy that info web extracts.
Summary of the invention
In view of the above problems, the present invention has been proposed to provide a kind of info web extracting method that overcomes the problems referred to above or address the above problem at least in part and corresponding info web extraction equipment.
According to one aspect of the present invention, a kind of info web extracting method is provided, comprising: obtain the decimation rule automatically generating according to web page contents; Utilize described decimation rule to extract info web; Wherein, the generation method of described decimation rule is as follows: automatic analyzing web page content, find out valuable information, wherein, modifiable information in described valuable packets of information purse rope page framework; Identify described valuable information, automatic learning also generates corresponding described decimation rule.
Alternatively, utilize described decimation rule to extract info web, comprising: utilize station location marker information in described decimation rule to determine the position of extractible described info web, wherein, described decimation rule comprises station location marker information; According to definite position, info web is extracted one by one.
Alternatively, reference position and the final position of the extractible described info web of described station location marker message identification.
Alternatively, said method also comprises: when described webpage framework changes, the webpage framework that automatic analysis is new, upgrades described decimation rule.
Alternatively, described webpage comprises list page and/or detail page.
Alternatively, in described list page, valuable information comprises: distinct information in different lists page; Or distinct information in the different entries of same list.
Alternatively, analyzing web page content, finds out valuable information automatically, comprising: between different lists page, search difference region, described difference region comprises distinct information in described different lists page; Get the longest difference region, as list area, the information recording in described list area is valuable information.
Alternatively, analyzing web page content, finds out valuable information automatically, comprising: a plurality of entries in described list area are contrasted; Record the different entries of domain of the existence, using it as valuable information.
Alternatively, in described detail page, valuable information comprises: in specifying duration, be worth constant information, wherein, the constant information of described value at least comprises the information with certain information content, and the information that can access other links by the constant information of described value; Or distinct information in different detail pages.
Alternatively, automatically, before analyzing web page content, also comprise: webpage to be resolved is carried out to denoising in the page.
According to another aspect of the present invention, a kind of info web extraction equipment is provided, comprising: regular generation module, be configured to automatic analyzing web page content, find out valuable information, wherein, modifiable information in described valuable packets of information purse rope page framework; Identify described valuable information, learn and generate corresponding described decimation rule; Acquisition module, is configured to obtain the decimation rule automatically generating according to web page contents; Extraction module, is configured to utilize described decimation rule to extract info web.
Alternatively, described rule claims that module is also configured to when described webpage framework changes, and the webpage framework that automatic analysis is new, upgrades described decimation rule.
Alternatively, described regular generation module carries out denoising in the page to webpage to be resolved before being also configured to automatic analyzing web page content.
According to the embodiment of the present invention, can obtain the decimation rule automatically generating according to web page contents, and utilize the decimation rule of above-mentioned automatic generation to extract info web, solve the problem that available technology adopting is manually obtained decimation rule and extracted the certain error rate of existence that info web causes, reduced the cost that extracts info web.In addition, the embodiment of the present invention can generate the decimation rule that extracts info web foundation automatically, solved in prior art when change occurs page structure, artificial find that page structure change causes info web to extract the problem that the decimation rule of foundation cannot real-time update not in time, improves the accuracy that info web extracts.
Above-mentioned explanation is only the general introduction of technical solution of the present invention, in order to better understand technological means of the present invention, and can be implemented according to the content of instructions, and for above and other objects of the present invention, feature and advantage can be become apparent, below especially exemplified by the specific embodiment of the present invention.
According to the detailed description to the specific embodiment of the invention by reference to the accompanying drawings below, those skilled in the art will understand above-mentioned and other objects, advantage and feature of the present invention more.
Accompanying drawing explanation
By reading below detailed description of the preferred embodiment, various other advantage and benefits will become cheer and bright for those of ordinary skills.Accompanying drawing is only for the object of preferred implementation is shown, and do not think limitation of the present invention.And in whole accompanying drawing, by identical reference symbol, represent identical parts.In the accompanying drawings:
Fig. 1 shows the processing flow chart of info web extracting method according to an embodiment of the invention;
Fig. 2 shows the processing flow chart of info web extracting method in accordance with a preferred embodiment of the present invention;
Fig. 3 shows the processing flow chart of asking difference region method in accordance with a preferred embodiment of the present invention;
Fig. 4 shows the processing flow chart of division entry method in accordance with a preferred embodiment of the present invention;
Fig. 5 shows according to the processing flow chart of the info web extracting method of another preferred embodiment of the present invention; And
Fig. 6 shows the structural representation of info web extraction equipment according to an embodiment of the invention.
Embodiment
Exemplary embodiment of the present disclosure is described below with reference to accompanying drawings in more detail.Although shown exemplary embodiment of the present disclosure in accompanying drawing, yet should be appreciated that and can realize the disclosure and the embodiment that should do not set forth limits here with various forms.On the contrary, it is in order more thoroughly to understand the disclosure that these embodiment are provided, and can by the scope of the present disclosure complete convey to those skilled in the art.
In correlation technique, mention, there is certain error rate in the artificial info web that extracts, and when there is change in page structure, the artificial page info that extracts exists and manually finds that page structure change causes info web to extract the problem that the decimation rule of foundation cannot real-time update not in time, and then the problem that can cause accuracy that info web extracts to reduce.
For solving the problems of the technologies described above, the embodiment of the present invention provides a kind of info web extracting method.Fig. 1 shows the processing flow chart of info web extracting method according to an embodiment of the invention.As shown in Figure 1, this flow process originates in step S102, obtains the extracting rule automatically generating according to web page contents.In the embodiment of the present invention, the generation method of extracting rule is: automatic analyzing web page content, find out valuable information, wherein, modifiable information in valuable packets of information purse rope page framework.After finding valuable information, identify above-mentioned valuable information, through the study of above-mentioned valuable information is generated to corresponding decimation rule.After getting decimation rule, carry out step S104 as shown in Figure 1, utilize above-mentioned decimation rule to extract info web.
According to the embodiment of the present invention, can obtain the decimation rule automatically generating according to web page contents, and utilize the decimation rule of above-mentioned automatic generation to extract info web, solve the problem that available technology adopting is manually obtained decimation rule and extracted the certain error rate of existence that info web causes, reduced the cost that extracts info web.In addition, the embodiment of the present invention can generate the decimation rule that extracts info web foundation automatically, solved in prior art, when there is change in page structure, artificial find that page structure change causes info web to extract the problem that the decimation rule of foundation cannot real-time update not in time, improves the accuracy that info web extracts.
Particularly, Fig. 2 shows the processing flow chart of info web extracting method in accordance with a preferred embodiment of the present invention.In the embodiment of the present invention, according to info web, webpage can be divided into list page or detail page, wherein, list page refers to that what in webpage, present is many entries, and detail page refers to that what in webpage, present is specifying information.Now respectively the info web in list page and detail page being extracted to flow process describes.The webpage relating in flow process shown in Fig. 2 is list page.Shown in step S202 as shown in Figure 2, this flow process originates in the URL(uniform resource locator) (Uniform Resource Locator, hereinafter to be referred as url) of obtaining multiple list page.
After step S202 is finished, as Fig. 2, execution step S204, according to the urls download list page of multiple list page, and carries out denoising to list page.
Because web data is semi-structured, dispersion and isomery, therefore, there is not unified management in web data conventionally, and the layout style of web data and content change very fast.In the embodiment of the present invention, before analyzing web page is searched valuable information automatically, need preferentially webpage to be carried out to denoising.By denoising, the embodiment of the present invention can be removed some unworthy information (for example web advertisement) in web page contents, to extract more expeditiously info web in follow-up operation.
Preferably, in the embodiment of the present invention, the concrete mode of list page being carried out to denoising is: the url to same list page repeatedly downloads, and compares, and obtain comparative result to downloading the data of the page obtaining.According to different part in content at the same level in comparative result delete list page, the unworthy information such as web advertisement in delete list page, obtain the list page through denoising.
After list page is carried out to denoising, search the valuable information in list page, wherein, modifiable information in valuable packets of information purse rope page framework.In the embodiment of the present invention, list page consists of jointly a plurality of entries, and between each entry, there is similar design feature, in list page, valuable information is included in information distinct in different lists page, also comprises distinct information in the different entries of same list page.For finding out valuable information, the step S206 in execution graph 2, obtains the difference region between list page.Difference region between this list page comprises the distinctive information existing in different lists page, i.e. valuable information in list page.After obtaining the difference region between list page, execution step S208, gets the longest difference region, and as list area, the information recording in list area is valuable information.
After obtaining the difference region between list page, because valuable information in list page also comprises distinct information in the different entries of same list, the embodiment of the present invention continues to carry out step S210 as shown in Figure 2, and list area is divided into a plurality of entries.After list area is divided into a plurality of entries, execution step S212, compares to a plurality of entries in list area, and records distinct entry, using distinct entry as valuable information.
From the above mentioned, in the embodiment of the present invention, by obtaining difference region and list area is divided into a plurality of entries between list page, find out valuable information, then learn valuable information and generate corresponding decimation rule.Preferably, in the embodiment of the present invention, valuable communication, to automatic learning program, is learnt above-mentioned valuable information by automatic learning program basis, and then generated corresponding decimation rule.The automatic learning program providing in the embodiment of the present invention can run on the machine arbitrarily such as computer, server terminal or mobile terminal, and the embodiment of the present invention is not limited this.
According to mentioned above, after getting valuable information, can to valuable information, be learnt by automatic learning program, and generate corresponding decimation rule, in embodiments of the present invention, by center machine, stored decimation rule and extracted accordingly task, wherein, extraction task comprises the associated description of the info web to extracting.Center machine is sent to Work machine according to default dispatching principle by decimation rule and extraction task accordingly after decimation rule and extraction task are accordingly stored.
In embodiments of the present invention, the quantity of Work machine as required the quantity of the webpage of information extraction determine.When needing the webpage quantity of information extraction larger, the embodiment of the present invention can be used the decimation rule of a plurality of Work machine receiving center machines transmissions and extract accordingly task, guarantee when a large amount of webpages carry out information extraction, can carry out in time info web extraction, and reduce the error rate of information extraction.
In addition, when webpage framework changes, the embodiment of the present invention can the new webpage framework of automatic analysis, upgrade decimation rule, and the decimation rule of real-time update center machine, assurance center machine can send in real time and upgrade decimation rule to Work machine, and according to the decimation rule after upgrading, carries out info web extraction by Work machine.After Work machine obtains extraction result according to decimation rule, the extraction result getting is sent to next machine, the operation that next machine sets in advance according to center machine is carried out re-scheduling or classification or deletes the operations such as redundancy extracting result, and the info web that utility decimation rule extracts more succinctly, accurately.
When utilizing decimation rule to carry out info web extraction, the embodiment of the present invention is used the station location marker information in decimation rule one by one info web to be extracted, wherein, the station location marker message identification in decimation rule goes out reference position and the final position of info web.For example, the HTML (Hypertext Markup Language) code of webpage (Hyper Text Markup Language, hereinafter to be referred as html) is: <html><bodyGreat T.GreaT.GT<a href=' www.so.com ' > search </a></bodyGreatT .GreaT.GT</html>.When user needs to extract the url address of " search " this word and the value of " search " field itself, the extracting rule generating by automatic learning can be 1, station location marker information 1: " href=' ", station location marker information 2: " ' " and 2, station location marker information 1: " > ", station location marker information 2: " < ".
After generating said extracted rule, utilize said extracted rule 1, can extract href=' and ' between " www.so.com ", extract the url address of " search " this word, utilize said extracted rule 2, can extract " search " between > and <, extract the value of " search " field itself.
Mention above, utilizing before decimation rule carries out info web extraction, need to search valuable information, and utilize valuable Information generation decimation rule.When webpage is list page, valuable information is included in information distinct in different lists page, or distinct information in the different entries of same list page.Now introduce when webpage is list page in detail, search the method for valuable information.
Fig. 3 shows the processing flow chart of obtaining the method in difference region between list page in accordance with a preferred embodiment of the present invention.As shown in Figure 3, adopt the algorithm providing in this preferred embodiment can obtain difference region between list page, in this algorithm, algorithm is input as a plurality of original lists, and algorithm is output as the difference region of list page.Algorithm input and output are herein only examples, and algorithm and information extraction flow process itself are not caused to restriction.
As shown in Figure 3, this flow process originates in step S302, loads the list page page.After loading multiple list page pages, perform step S304, obtain DOM Document Object Model (Document Object Model, hereinafter to be referred as the dom) tree construction of all pages, tree construction is comprised of a plurality of nodes (wherein each node is called node, and a plurality of nodes are nodes).
According to the step S306 shown in Fig. 3, after obtaining nodes, judge that whether nodes is completely equal, if so, return to step S304, continue to obtain tree construction, if not, perform step S308, judge whether child's number of nodes equates.According to step S308, if it is unequal to judge child's number of nodes, perform step S312, all nodes are inserted into difference region, if judge child's number of nodes, equate, continue execution step S310, judge that whether child's nodes label and text be identical.
As shown in Figure 3, in step S310, if the determination result is NO, the child of nodes is distinct, and execution step S312, is inserted into difference region by the child of all nodes, if the determination result is YES, perform step S314, the child of nodes same position is carried out to recursive operation as new nodes, until obtain all differences region of list page.
By process as shown in Figure 3, the embodiment of the present invention can be obtained the difference region between list page, and get the longest difference region as list area, further obtain information distinct in different lists page, in addition, list area is divided into a plurality of entries, and in different entries, distinct information is also valuable information.Fig. 4 shows division list area in accordance with a preferred embodiment of the present invention to the processing flow chart of a plurality of entry methods.As shown in Figure 4, adopt the algorithm providing in this preferred embodiment list area can be divided into entry, in this algorithm, algorithm is input as list area, and algorithm is output as a plurality of entries.Algorithm input and output are herein only examples, and algorithm and information extraction flow process itself are not caused to restriction.
As shown in Figure 4, first perform step S402, obtain list area as described above.After getting list area, execution step S404, obtains the dom tree construction of list area, and a plurality of node that form this tree construction.Each node in dom tree construction is joined in array p_nodes, i.e. step S406 as shown in Figure 4.Execution of step S406, carries out the operation of step S408 as shown in Figure 4, judges whether p_nodes array is empty, and wherein, p_nodes array is only meaningful in the situation that not being empty.According to judged result, when p_nodes array is not sky, execution step S410, takes out an element from the end of p_nodes data and is designated as p_node, then performs step S412, judges whether p_node has child.
Step S412 as shown in Figure 4, if the determination result is NO, returns to step S408, if the determination result is YES, continues execution step S414, judges whether child's p_node number is greater than 1.If child's p_node number is less than or equal to 1, perform step S416, the child of p_node is joined to p_nodes and returns to step S408.If child's p_node number is greater than 1, perform step S418, the child of p_node is divided into different groups, each group is comprised of N adjacent child, and any child only belongs to some groups.After step S418 is finished, as shown in Figure 4, execution step S420, calculates the similarity of any two adjacent groups, and judges whether similarity meets threshold value.
As shown in Figure 4, in step S420, the determination result is NO if judge, similarity whether to meet threshold value, performs step S422, N is set and equals N and subtract 1, and return to step S418, if the determination result is YES, performs step S424.As shown in Figure 4, step S424, for continuing segmentation to meeting the Liang Ge group of similarity, finds out the minimum burst that meets similarity threshold, i.e. entry, and wherein, the element number that entry comprises is n.According to Fig. 4, after step S424 is finished, execution step S426, take number n as basis, at the group edge that meets similarity, to both sides expansion, finds out all entries end operation.According to method flow diagram as shown in Figure 4, the embodiment of the present invention can be divided list area to a plurality of entries, and then finds out the valuable information of list page, carries out info web extraction.
In the embodiment of the present invention, webpage comprises list page and/or detail page, introduced the present invention above and list page extracted to several preferred embodiments of info web, below the present invention is carried out to info web extraction to detail page method be introduced.
Fig. 5 shows the processing flow chart of info web extracting method in accordance with another embodiment of the present invention.Especially, in this preferred embodiment, webpage is detail page, and detail page is the page that in webpage, valuable information can not change within a certain period of time.As shown in Figure 5, this flow process originates in step S502, obtains the url of multiple detail pages.After step S502 is finished, execution step S504, carries out denoising in the page to each detail page.In the embodiment of the present invention, to the detail page concrete mode of denoising in the page of carrying out, be: the url of same detail page is downloaded repeatedly, to repeatedly downloading the page obtaining, generate the root directory of dom tree construction, then compare all descendants's nodes in root directory.If distinct between the node with one-level, this node to be left out from dom tree, the dom tree obtaining through this processing mode, is the dom tree through the detail page after denoising.
After step S504 is finished, as shown in Figure 5, execution step S506, asks the different information between detail page, and this different information is the valuable information in detail page.
According to mentioned above, the embodiment of the present invention can get the valuable information in detail page.After getting valuable information, the embodiment of the present invention continues to learn according to the valuable information getting, and further can generate corresponding decimation rule.In embodiments of the present invention, by center machine, stored decimation rule and extracted accordingly task, this extraction task comprises the associated description of the info web to extracting.Center machine is sent to Work machine according to default dispatching principle by decimation rule and extraction task accordingly after decimation rule and extraction task are accordingly stored.
In embodiments of the present invention, the quantity of Work machine as required the quantity of the webpage of information extraction determine.When needing the webpage quantity of information extraction larger, use the decimation rule of a plurality of Work machine receiving center machines transmissions and extract accordingly task, guarantee, when a large amount of webpages carry out information extraction, can carry out in time info web extraction, and reduce the error rate of information extraction.
In addition, when webpage framework changes, the embodiment of the present invention can the new webpage framework of automatic analysis, upgrade decimation rule, and the decimation rule of real-time update center machine, assurance center machine can send in real time and upgrade decimation rule to Work machine, and according to the decimation rule after upgrading, carries out info web extraction by Work machine.After Work machine obtains extraction result according to decimation rule, the extraction result getting is sent to next machine, the operation that next machine sets in advance according to center machine is carried out re-scheduling or classification or deletes the operations such as redundancy extracting result, and the info web that utility decimation rule extracts more succinctly, accurately.
In the embodiment of the present invention, in detail page, valuable information can be included in the constant information of value in duration of specifying, can also be included in information distinct in different detail pages, wherein, in detail page, in specifying duration, be worth constant information and comprise the information with certain information content, and the information that can access other links by this information.
For example, in detail page one, there is the information that can enter current political news by clicking, in detail page two, there is the information that can enter entertainment news by clicking, above-mentioned two information all have certain quantity of information, and, by clicking two information, all can enter the website of current political news or entertainment news, above-mentioned two information are valuable information in detail page, and in detail page, also there is the information of some product placement, user accesses this webpage at every turn, in webpage, the information of this series advertisements is all different, the information of this series advertisements is not considered to valuable information.
In addition, when webpage framework changes, the embodiment of the present invention can the new webpage framework of automatic analysis, upgrade decimation rule, guarantee in the process of utilizing decimation rule to extract info web, decimation rule can upgrade in time according to webpage framework, further guarantees accuracy and high efficiency that info web extracts.
Info web extracting method based on above each preferred embodiment provides, based on same inventive concept, the embodiment of the present invention provides a kind of info web extraction equipment, for realizing above-mentioned info web extracting method.Fig. 6 shows the structural representation of info web extraction equipment according to an embodiment of the invention.Referring to Fig. 6, the info web extraction equipment of the embodiment of the present invention at least comprises: regular generation module 610, acquisition module 620 and extraction module 630.
Now introduce each device of info web extraction equipment or the function of composition and the annexation between each several part of the embodiment of the present invention:
Rule generation module 610, is configured to automatic analyzing web page content, finds out valuable information, wherein, and modifiable information in valuable packets of information purse rope page framework; Identify described valuable information, learn and generate corresponding described decimation rule.
Acquisition module 620, with regular generation module 610 couplings, is configured to obtain the decimation rule that regular generation module 610 generates automatically according to web page contents.
Extraction module 630, with acquisition module 620 couplings, is configured to utilize the decimation rule that acquisition module 620 gets to extract info web.
According to the embodiment of the present invention, can obtain the decimation rule automatically generating according to web page contents, and utilize the decimation rule of above-mentioned automatic generation to extract info web, solve the problem that available technology adopting is manually obtained decimation rule and extracted the certain error rate of existence that info web causes, reduced the cost that extracts info web.In addition, the embodiment of the present invention can generate the decimation rule that extracts info web foundation automatically, solved in prior art when change occurs page structure, artificial find that page structure change causes info web to extract the problem that the decimation rule of foundation cannot real-time update not in time, improves the accuracy that info web extracts.
Particularly, regular generation module 610 as shown in Figure 6 obtains the url of multiple webpages, the webpage of downloading according to above-mentioned urls, and webpage is carried out to denoising.Because web data is semi-structured, dispersion and isomery, therefore, there is not unified management in web data conventionally, and layout style and content change are very fast, in the embodiment of the present invention, before analyzing web page is searched valuable information automatically, need preferentially webpage to be carried out to denoising.By denoising, the embodiment of the present invention can be removed some unworthy information (for example web advertisement) in web page contents, to extract more expeditiously info web in follow-up operation.
Preferably, in the embodiment of the present invention, the concrete mode of webpage being carried out to denoising is: the url to same webpage repeatedly downloads, and compares, and obtain comparative result to downloading the data of the page obtaining.According to different part in content at the same level in comparative result deletion webpage, delete the unworthy information such as web advertisement in webpage, obtain the webpage through denoising.
After webpage is carried out to denoising, regular generation module 610 is searched valuable information in webpage, wherein, and modifiable information in valuable packets of information purse rope page framework.In the embodiment of the present invention, webpage comprises detail page and list page, and the valuable information of webpage is included in information distinct in different lists page, also comprise valuable information in information distinct in the different entries of same list page and detail page.
Rule generation module 610 identifies valuable information, and learns above-mentioned valuable information after finding out the valuable information of webpage, finally generates corresponding decimation rule.Rule generation module 610 triggers acquisition module 620 after generating decimation rule, by acquisition module 620, obtains the decimation rule that regular generation module 610 generates automatically.After getting decimation rule, as shown in Figure 6, by extraction module 630, utilize the decimation rule that acquisition module 620 gets to extract info web.In addition, in the embodiment of the present invention, when webpage framework changes, the webpage framework that automatic analysis is new, upgrade decimation rule, guarantee in the process of utilizing decimation rule to extract info web, decimation rule can upgrade in time according to webpage framework, further guarantees accuracy and high efficiency that info web extracts.
According to the combination of above-mentioned any one preferred embodiment or a plurality of preferred embodiments, the embodiment of the present invention can reach following beneficial effect:
According to the embodiment of the present invention, can obtain the decimation rule automatically generating according to web page contents, and utilize the decimation rule of above-mentioned automatic generation to extract info web, solve the problem that available technology adopting is manually obtained decimation rule and extracted the certain error rate of existence that info web causes, reduced the cost that extracts info web.The embodiment of the present invention can generate the decimation rule that extracts info web foundation automatically, solved in prior art when change occurs page structure, artificial find that page structure change causes info web to extract the problem that the decimation rule of foundation cannot real-time update not in time, improves the accuracy that info web extracts.
In addition, the embodiment of the present invention can reduce greatly manually checks webpage source code, and writes the workload of decimation rule, the situation that can avoid artificial extracting rule to make mistakes simultaneously.When there is change in structure of web page, adopt the embodiment of the present invention can automatically extract valuable information, without artificial participation, thereby further reduce cost of labor and extract the loss that error message causes.
In the instructions that provided herein, a large amount of details have been described.Yet, can understand, embodiments of the invention can not put into practice in the situation that there is no these details.In some instances, be not shown specifically known method, structure and technology, so that not fuzzy understanding of this description.
Similarly, be to be understood that, in order to simplify the disclosure and to help to understand one or more in each inventive aspect, in the above in the description of exemplary embodiment of the present invention, each feature of the present invention is grouped together into single embodiment, figure or sometimes in its description.Yet, the method for the disclosure should be construed to the following intention of reflection: the present invention for required protection requires than the more feature of feature of clearly recording in each claim.Or rather, as reflected in claims below, inventive aspect is to be less than all features of disclosed single embodiment above.Therefore, claims of following embodiment are incorporated to this embodiment thus clearly, and wherein each claim itself is as independent embodiment of the present invention.
Those skilled in the art are appreciated that and can the module in the equipment in embodiment are adaptively changed and they are arranged in one or more equipment different from this embodiment.Module in embodiment or unit or assembly can be combined into a module or unit or assembly, and can put them into a plurality of submodules or subelement or sub-component in addition.At least some in such feature and/or process or unit are mutually repelling, and can adopt any combination to combine all processes or the unit of disclosed all features in this instructions (comprising claim, summary and the accompanying drawing followed) and disclosed any method like this or equipment.Unless clearly statement in addition, in this instructions (comprising claim, summary and the accompanying drawing followed) disclosed each feature can be by providing identical, be equal to or the alternative features of similar object replaces.
In addition, those skilled in the art can understand, although embodiment more described herein comprise some feature rather than further feature included in other embodiment, the combination of the feature of different embodiment means within scope of the present invention and forms different embodiment.For example, in claims, the one of any of embodiment required for protection can be used with array mode arbitrarily.
All parts embodiment of the present invention can realize with hardware, or realizes with the software module moved on one or more processor, or realizes with their combination.It will be understood by those of skill in the art that and can use in practice microprocessor or digital signal processor (DSP) to realize according to the some or all functions of the some or all parts in the info web extraction equipment of the embodiment of the present invention.The present invention for example can also be embodied as, for carrying out part or all equipment or device program (, computer program and computer program) of method as described herein.Realizing program of the present invention and can be stored on computer-readable medium like this, or can there is the form of one or more signal.Such signal can be downloaded and obtain from internet website, or provides on carrier signal, or provides with any other form.
It should be noted above-described embodiment the present invention will be described rather than limit the invention, and those skilled in the art can design alternative embodiment in the situation that do not depart from the scope of claims.In the claims, any reference symbol between bracket should be configured to limitations on claims.Word " comprises " not to be got rid of existence and is not listed as element or step in the claims.Being positioned at word " " before element or " one " does not get rid of and has a plurality of such elements.The present invention can be by means of including the hardware of some different elements and realizing by means of the computing machine of suitably programming.In having enumerated the unit claim of some devices, several in these devices can be to carry out imbody by same hardware branch.The use of word first, second and C grade does not represent any order.Can be title by these word explanations.
So far, those skilled in the art will recognize that, although detailed, illustrate and described a plurality of exemplary embodiment of the present invention herein, but, without departing from the spirit and scope of the present invention, still can directly determine or derive many other modification or the modification that meets the principle of the invention according to content disclosed by the invention.Therefore, scope of the present invention should be understood and regard as and cover all these other modification or modifications.

Claims (13)

1. an info web extracting method, comprising:
Obtain the decimation rule automatically generating according to web page contents;
Utilize described decimation rule to extract info web;
Wherein, the generation method of described decimation rule is as follows:
Automatically analyzing web page content, finds out valuable information, wherein, and modifiable information in described valuable packets of information purse rope page framework;
Identify described valuable information, automatic learning also generates corresponding described decimation rule.
2. method according to claim 1, wherein, utilizes described decimation rule to extract info web, comprising:
Utilize station location marker information in described decimation rule to determine the position of extractible described info web, wherein, described decimation rule comprises station location marker information;
According to definite position, info web is extracted one by one.
3. method according to claim 2, wherein, reference position and the final position of the extractible described info web of described station location marker message identification.
4. according to the method described in claims 1 to 3 any one, wherein, also comprise: when described webpage framework changes, the webpage framework that automatic analysis is new, upgrades described decimation rule.
5. according to the method described in claim 1 to 4 any one, wherein, described webpage comprises list page and/or detail page.
6. method according to claim 5, wherein, in described list page, valuable information comprises:
Distinct information in different lists page; Or
Distinct information in the different entries of same list.
7. method according to claim 6, wherein, analyzing web page content, finds out valuable information automatically, comprising:
Between different lists page, search difference region, described difference region comprises distinct information in described different lists page;
Get the longest difference region, as list area, the information recording in described list area is valuable information.
8. method according to claim 7, wherein, analyzing web page content, finds out valuable information automatically, comprising:
A plurality of entries in described list area are compared;
Record distinct different entries, using it as valuable information.
9. according to the method described in claim 5 to 8 any one, wherein, in described detail page, valuable information comprises:
In specifying duration, be worth constant information, wherein, the constant information of described value at least comprises the information with certain information content, and the information that can access other links by the constant information of described value; Or
Distinct information in different detail pages.
10. according to the method described in claim 1 to 9 any one, wherein, automatically, before analyzing web page content, also comprise: webpage to be resolved is carried out to denoising in the page.
11. 1 kinds of info web extraction equipments, comprising:
Rule generation module, is configured to automatic analyzing web page content, finds out valuable information, wherein, and modifiable information in described valuable packets of information purse rope page framework; Identify described valuable information, learn and generate corresponding described decimation rule;
Acquisition module, is configured to obtain the decimation rule automatically generating according to web page contents;
Extraction module, is configured to utilize described decimation rule to extract info web.
12. equipment according to claim 11, wherein, described regular generation module is also configured to when described webpage framework changes, and the webpage framework that automatic analysis is new, upgrades described decimation rule.
13. according to the equipment described in claim 11 or 12, and wherein, described regular generation module carries out denoising in the page to webpage to be resolved before being also configured to automatic analyzing web page content.
CN201310529500.XA 2013-10-31 2013-10-31 Webpage information extracting method and webpage information extracting equipment Pending CN103714116A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310529500.XA CN103714116A (en) 2013-10-31 2013-10-31 Webpage information extracting method and webpage information extracting equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310529500.XA CN103714116A (en) 2013-10-31 2013-10-31 Webpage information extracting method and webpage information extracting equipment

Publications (1)

Publication Number Publication Date
CN103714116A true CN103714116A (en) 2014-04-09

Family

ID=50407091

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310529500.XA Pending CN103714116A (en) 2013-10-31 2013-10-31 Webpage information extracting method and webpage information extracting equipment

Country Status (1)

Country Link
CN (1) CN103714116A (en)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104462532A (en) * 2014-12-23 2015-03-25 北京奇虎科技有限公司 Method and device for extracting webpage text
CN105468730A (en) * 2015-11-20 2016-04-06 广州华多网络科技有限公司 Webpage information extraction method and equipment
CN106570133A (en) * 2016-10-27 2017-04-19 任子行网络技术股份有限公司 Method and device for constructing visual webpage information extracting rule
CN106649392A (en) * 2015-11-03 2017-05-10 任子行网络技术股份有限公司 Method and apparatus for obtaining information based on what-you-see-is-what-you-get technology
CN106776693A (en) * 2016-11-10 2017-05-31 福建中金在线信息科技有限公司 A kind of website data acquisition method and device
CN109727050A (en) * 2017-10-31 2019-05-07 北京国双科技有限公司 A kind of method and system obtaining monitoring of the advertisement analysis data
CN109783728A (en) * 2018-12-29 2019-05-21 安徽听见科技有限公司 Page crawler rule update method and system
CN110083754A (en) * 2019-04-23 2019-08-02 重庆紫光华山智安科技有限公司 The self-adapting data abstracting method of structure change webpage
CN111966881A (en) * 2020-10-14 2020-11-20 成都数联铭品科技有限公司 Webpage information extraction method and system and electronic equipment

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101727498A (en) * 2010-01-15 2010-06-09 西安交通大学 Automatic extraction method of web page information based on WEB structure
CN101968819A (en) * 2010-11-05 2011-02-09 中国传媒大学 Audio/video intelligent catalog information acquisition method facing to wide area network
CN102446225A (en) * 2012-01-11 2012-05-09 深圳市爱咕科技有限公司 Real-time search method, device and system
CN102567530A (en) * 2011-12-31 2012-07-11 凤凰在线(北京)信息技术有限公司 Intelligent extraction system and intelligent extraction method for article type web pages
CN102779170A (en) * 2012-06-25 2012-11-14 北京奇虎科技有限公司 System and method for identifying text floor of webpage

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101727498A (en) * 2010-01-15 2010-06-09 西安交通大学 Automatic extraction method of web page information based on WEB structure
CN101968819A (en) * 2010-11-05 2011-02-09 中国传媒大学 Audio/video intelligent catalog information acquisition method facing to wide area network
CN102567530A (en) * 2011-12-31 2012-07-11 凤凰在线(北京)信息技术有限公司 Intelligent extraction system and intelligent extraction method for article type web pages
CN102446225A (en) * 2012-01-11 2012-05-09 深圳市爱咕科技有限公司 Real-time search method, device and system
CN102779170A (en) * 2012-06-25 2012-11-14 北京奇虎科技有限公司 System and method for identifying text floor of webpage

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104462532A (en) * 2014-12-23 2015-03-25 北京奇虎科技有限公司 Method and device for extracting webpage text
CN104462532B (en) * 2014-12-23 2017-07-07 北京奇虎科技有限公司 The method and apparatus that Web page text is extracted
CN106649392A (en) * 2015-11-03 2017-05-10 任子行网络技术股份有限公司 Method and apparatus for obtaining information based on what-you-see-is-what-you-get technology
CN105468730A (en) * 2015-11-20 2016-04-06 广州华多网络科技有限公司 Webpage information extraction method and equipment
CN106570133A (en) * 2016-10-27 2017-04-19 任子行网络技术股份有限公司 Method and device for constructing visual webpage information extracting rule
CN106570133B (en) * 2016-10-27 2019-07-23 任子行网络技术股份有限公司 A kind of construction method and device of visual webpage information extracting rule
CN106776693A (en) * 2016-11-10 2017-05-31 福建中金在线信息科技有限公司 A kind of website data acquisition method and device
CN109727050A (en) * 2017-10-31 2019-05-07 北京国双科技有限公司 A kind of method and system obtaining monitoring of the advertisement analysis data
CN109783728A (en) * 2018-12-29 2019-05-21 安徽听见科技有限公司 Page crawler rule update method and system
CN110083754A (en) * 2019-04-23 2019-08-02 重庆紫光华山智安科技有限公司 The self-adapting data abstracting method of structure change webpage
CN111966881A (en) * 2020-10-14 2020-11-20 成都数联铭品科技有限公司 Webpage information extraction method and system and electronic equipment

Similar Documents

Publication Publication Date Title
CN103714116A (en) Webpage information extracting method and webpage information extracting equipment
CN107423391B (en) Information extraction method of webpage structured data
CN103092817A (en) Data collection method and data collection device based on script engine
CN104077388A (en) Summary information extraction method and device based on search engine and search engine
CN102722563A (en) Method and device for displaying page
CN103631875A (en) Method for carrying out network search on browser side and browser
CN108334508B (en) Webpage information extraction method and device
CN105095067A (en) User interface element object identification and automatic test method and apparatus
CN103577552A (en) Webpage picture processing method and device
CN103761079A (en) Method and device for automatically graying page
US11263062B2 (en) API mashup exploration and recommendation
CN102902784B (en) Web page classification storage system and method
CN104598536B (en) A kind of distributed network information structuring processing method
US11403078B2 (en) Interface layout interference detection
CN104331438A (en) Method and device for selectively extracting content of novel webpage
CN104317931A (en) Webpage title determining method and device
CN102902794B (en) Web page classification system and method
CN102902792B (en) list page identification system and method
CN103853770A (en) Method and system for abstracting information of posts from forum website
CN104166545A (en) Webpage resource sniffing method and device
CN104361007A (en) Browser and processing method for browser favorites
CN105468776A (en) Method, device and system for operating database
CN102902791B (en) Web page classification storage system and method
CN104021193A (en) Search switching method and search switching device
CN105426500A (en) Extraction method and device of link dynamically generated by webpage scripts

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20140409