CN110020312A

CN110020312A - The method and apparatus for extracting Web page text

Info

Publication number: CN110020312A
Application number: CN201711306108.3A
Authority: CN
Inventors: 贾宝玉; 李�杰; 周旭
Original assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Jingdong Shangke Information Technology Co Ltd
Current assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Jingdong Shangke Information Technology Co Ltd
Priority date: 2017-12-11
Filing date: 2017-12-11
Publication date: 2019-07-16
Anticipated expiration: 2037-12-11
Also published as: CN110020312B

Abstract

The invention discloses a kind of method and apparatus for extracting Web page text, are related to field of computer technology.One specific embodiment of this method includes: to construct Access Model according to webpage to be extracted；Calculate the constituent parts region of main part and the similar value of characteristic；According to the first index value of similar value and constituent parts region, unit text region is screened from Access Model；Determine the beginning and end of the text of webpage to be extracted, according to unit text region to obtain the complete text of webpage to be extracted.The embodiment accurately can completely extract Web page text, reduce cost of labor, improve the efficiency for extracting Web page text.

Description

The method and apparatus for extracting Web page text

Technical field

The present invention relates to field of computer technology more particularly to a kind of method and apparatus for extracting Web page text.

Background technique

With the rapid development of society, internet is increasingly becoming the main platform of information publication and acquisition, thereon data one Directly increase by geometric progression.Internet data has covered the every field of the real worlds such as economy, politics, culture, constitutes very much The important information source of application.But the content of webpage, other than the text that people need, there are also copyright information, advertisement, navigation The content unrelated with text such as column, decoration information, referred to as noise information.How shielding noise information, the text in webpage is mentioned It takes out, has become a hot spot of current research.

The method for extracting Web page text at present has following three categories: one, the method that the Web page text based on template extracts； Two, the method based on block text Density extraction text；Three, the method that view-based access control model Web-page segmentation extracts text.Wherein, it is based on mould In the method that the Web page text of plate extracts, one Template Information of manual maintenance is needed, is then extracted in text according to Template Information Hold；In method based on block text Density extraction text, row block distribution function is obtained according to text ratio in the row of every row first, Then it is calculated over the high row block of the text ratio of threshold value, so that it is determined that body matter；View-based access control model Web-page segmentation extracts text Method in, be first multiple page blocks by Web-page segmentation according to visual information, then will using the separator bar in html tag Page merged block, to obtain Web page text.

In realizing process of the present invention, at least there are the following problems in the prior art for inventor's discovery: one, being mentioned based on template Web page text is taken, needs manually to participate in, heavy workload, and needs to reconfigure template when structure of web page variation；Two, it is based on block Text density extracts text, is difficult to determine that the beginning and end of text, percentage of head rice be not high；Three, view-based access control model Web-page segmentation extracts The method of text needs the engines such as javascript, and complexity is high, very time-consuming；Four, prior art none of these methods is applicable in It is extracted in all types of Web page texts.

Summary of the invention

In view of this, the embodiment of the present invention provides a kind of method and apparatus for extracting Web page text, it can be accurately complete Web page text is extracted, cost of labor is reduced, improves the efficiency for extracting Web page text.

To achieve the above object, according to an aspect of an embodiment of the present invention, a kind of side for extracting Web page text is provided Method.

A kind of method of extraction Web page text of the embodiment of the present invention includes: to construct Access Model according to webpage to be extracted, The Access Model includes: characteristic and main part；Calculate the main part constituent parts region and the features The similar value divided；According to the first index value of the similar value and constituent parts region, unit is being screened just from the Access Model Literary region；The beginning and end of the text of the webpage to be extracted is determined according to unit text region, with obtain it is described to Extract the complete text of webpage.

Optionally, before constructing Access Model according to webpage to be extracted, the method also includes: by the net to be extracted The source code of page is standardized.

Optionally, the similar value in the constituent parts region and the characteristic that calculate the main part includes: to calculate institute State the second index value of the second index value of characteristic and the constituent parts region of the main part；Utilize the features Second index value of the second index value and the constituent parts region divided, calculates the characteristic and the constituent parts region Similar value.

Optionally, it according to the first index value of the similar value and constituent parts region, is screened from the Access Model single Position text region includes: to be selected from the Access Model doubtful text filed according to first index value；Utilize the phase Like value from the doubtful text filed middle screening unit text region.

Optionally, using the similar value from the doubtful text filed middle screening unit text region include: ratio The size of the similar value in the doubtful text filed middle constituent parts region chooses the maximum unit area of similar value as unit Text region.

Optionally, the beginning and end for determining the text of the webpage to be extracted according to unit text region includes: Iterating over for unit area up and down is carried out centered on unit text region, judges that each unit area is It is no to meet default text condition, if not meeting default text condition, stop iteration, so that it is determined that the webpage to be extracted is just The beginning and end of text.

Optionally, judging whether each unit area meets default text condition includes: to judge each unit area Similar value whether be greater than default similarity threshold, if more than, it is determined that the unit area meets default text condition；With/ Or, judging whether the Link Ratio of each unit area is less than default Link Ratio threshold value, if being less than, it is determined that the unit area Meet default text condition；And/or judge the symbol of each unit area than whether being greater than predetermined symbol than threshold value, if greatly In, it is determined that the unit area meets default text condition.

Optionally, the beginning and end that the text of the webpage to be extracted is determined according to unit text region it Afterwards, the method also includes: obtain the text additional information of the webpage to be extracted, wherein the text additional information includes It is following at least one: text title, author, date and source.

Optionally, the Access Model is document object model.

Optionally, the constituent parts region is with behavior unit.

Optionally, first index value is used to indicate the attribute information in constituent parts region, comprising: the list in constituent parts region Bit density.

Optionally, second index value is used to indicate the attribute information in certain region in webpage, comprising: feature vector value.

To achieve the above object, according to another aspect of an embodiment of the present invention, a kind of dress for extracting Web page text is provided It sets.

A kind of device of extraction Web page text of the embodiment of the present invention, comprising: building module, for according to webpage to be extracted Access Model is constructed, the Access Model includes: characteristic and main part；Computing module, for calculating the main part The similar value in the constituent parts region and the characteristic divided；Screening module, for according to the similar value and constituent parts region The first index value, from the Access Model screen unit text region；Determining module, for according to the unit text area Domain determines the beginning and end of the text of the webpage to be extracted, to obtain the complete text of the webpage to be extracted.

Optionally, the building module is also used to: before constructing Access Model according to webpage to be extracted, by described wait mention The source code of webpage is taken to be standardized.

Optionally, the computing module is also used to: calculate the characteristic the second index value and the main part Second index value in the constituent parts region divided；Using the characteristic the second index value and the constituent parts region Two index values calculate the similar value of the characteristic Yu the constituent parts region.

Optionally, the screening module is also used to: according to first index value, being selected from the Access Model doubtful It is text filed；Using the similar value from the doubtful text filed middle screening unit text region.

Optionally, the screening module is also used to: the similar value in the doubtful text filed middle constituent parts region Size chooses the maximum unit area of similar value as unit text region.

Optionally, the determining module is also used to: carrying out unit up and down centered on unit text region Region iterates over, and judges whether each unit area meets default text condition, if not meeting default text condition, Stop iteration, so that it is determined that the beginning and end of the text of the webpage to be extracted.

Optionally, the determining module is also used to: judge the similar value of each unit area whether be greater than preset it is similar Property threshold value, if more than, it is determined that the unit area meets default text condition；And/or judge the chain of each unit area It connects than whether being less than default Link Ratio threshold value, if being less than, it is determined that the unit area meets default text condition；And/or sentence Break each unit area symbol than whether being greater than predetermined symbol than threshold value, if more than, it is determined that the unit area meets Default text condition.

Optionally, the determining module is also used to: obtain the text additional information of the webpage to be extracted, wherein it is described just Literary additional information includes following at least one: text title, author, date and source.

Optionally, the Access Model is document object model.

Optionally, the constituent parts region is with behavior unit.

To achieve the above object, according to an embodiment of the present invention in another aspect, providing a kind of electronic equipment.

The a kind of electronic equipment of the embodiment of the present invention, comprising: one or more processors；Storage device, for storing one A or multiple programs, when one or more of programs are executed by one or more of processors, so that one or more The method that a processor realizes the extraction Web page text of the embodiment of the present invention.

To achieve the above object, another aspect according to an embodiment of the present invention, provides a kind of computer-readable medium.

A kind of computer-readable medium of the embodiment of the present invention, is stored thereon with computer program, and program is held by processor The method of the extraction Web page text of the embodiment of the present invention is realized when row.

One embodiment in foregoing invention have the following advantages that or the utility model has the advantages that can determine Web page text beginning and Ending, so as to the intelligentized complete text for extracting webpage, reduces cost of labor, improves and extracts Web page text Efficiency；The source code of webpage to be extracted is standardized in the embodiment of the present invention, to be conducive to according to standardized source Code building Access Model, reduces the time for extracting Web page text, and the method for the embodiment of the present invention can be adapted for respectively The text of the webpage of seed type extracts；Pass through the second index value and main part of calculating characteristic in the embodiment of the present invention Constituent parts region the second index value, so as to easily the second index value be utilized to calculate characteristic and constituent parts region Similar value；The first index value in the embodiment of the present invention by constituent parts region select it is doubtful text filed, so as to contract The selection range of small text improves the extraction efficiency of Web page text；In the embodiment of the present invention by comparing it is doubtful it is text filed in The similar value in constituent parts region, so as to improve text using the maximum unit area of similar value as unit text region The accuracy rate of extraction；The iteration time of unit area up and down is carried out in the embodiment of the present invention centered on unit text region It goes through, may thereby determine that the beginning and end of text, it is ensured that extract the complete text of webpage；In the embodiment of the present invention, from phase Judge whether each unit area meets default text condition like multiple angles such as value, Link Ratio and/or symbol ratios, so as to To further increase the accuracy rate of text extraction；The text additional information of webpage to be extracted is obtained in the embodiment of the present invention, is improved The integrality of text；The first index value may include the unit intensity in constituent parts region in the embodiment of the present invention, so as to This attribute information of tenant activity density is selected doubtful text filed；The second index value may include feature in the embodiment of the present invention Vector value, so as to calculate similar value by feature vector value.

Further effect possessed by above-mentioned non-usual optional way adds hereinafter in conjunction with specific embodiment With explanation.

Detailed description of the invention

Attached drawing for a better understanding of the present invention, does not constitute an undue limitation on the present invention.Wherein:

Fig. 1 is the schematic diagram of the key step of the method according to an embodiment of the present invention for extracting Web page text；

Fig. 2 is the schematic diagram of the main flow of the method according to an embodiment of the present invention for extracting Web page text；

Fig. 3 is the schematic diagram of standardized source code dom tree corresponding with its；

Fig. 4 is the calculating each line of text and the phase of characteristic information of the method according to an embodiment of the present invention for extracting Web page text Like the schematic diagram of the key step of value；

Fig. 5 is showing for the key step for filtering out line of text of the method according to an embodiment of the present invention for extracting Web page text It is intended to；

Fig. 6 is the schematic diagram of the line density function of the acquisition of the method according to an embodiment of the present invention for extracting Web page text；

Fig. 7 is the main of the beginning and end of the determination text of the method according to an embodiment of the present invention for extracting Web page text The schematic diagram of step；

Fig. 8 is the schematic diagram of the main modular of the device according to an embodiment of the present invention for extracting Web page text；

Fig. 9 is that the embodiment of the present invention can be applied to exemplary system architecture figure therein；

Figure 10 is adapted for showing for the structure of the computer system of the terminal device or server of realizing the embodiment of the present invention It is intended to.

Specific embodiment

Below in conjunction with attached drawing, an exemplary embodiment of the present invention will be described, including the various of the embodiment of the present invention Details should think them only exemplary to help understanding.Therefore, those of ordinary skill in the art should recognize It arrives, it can be with various changes and modifications are made to the embodiments described herein, without departing from scope and spirit of the present invention.Together Sample, for clarity and conciseness, descriptions of well-known functions and structures are omitted from the following description.

The current method for extracting Web page text do not reach it is intended that degree, the present invention is each from current Web page text Kind feature is set out, and in conjunction with the advantage and disadvantage of the prior art, a kind of method for devising intelligent extraction Web page text can be accurately complete Extract Web page text, reduce cost of labor, improve extract Web page text efficiency.Wherein, the characteristics of Web page text May include: text sentence it is long, the sentence number of text is more, title with text has certain correlation, text in webpage The punctuation mark of the ratio of middle position, hyperlink in the body of the email less, in text is more than other modules.

Fig. 1 is the schematic diagram of the key step of the method according to an embodiment of the present invention for extracting Web page text, such as Fig. 1 institute Show, the method for the extraction Web page text of the embodiment of the present invention mainly comprises the steps that

Step S101: Access Model is constructed according to webpage to be extracted.Wherein, Access Model may include: spy in the present invention Sign part and main part.Characteristic can store the characteristic information of webpage, for example, the information such as title, keyword and abstract. Main part can store the text message of webpage.

Step S102: constituent parts region and the similar value of characteristic of main part are calculated.It, can in the embodiment of the present invention The main part of Access Model is divided into multiple unit areas, it is similar to characteristic then to calculate each unit area Value.

Step S103: according to the first index value of similar value and constituent parts region, unit text is screened from Access Model Region.In the embodiment of the present invention, the similar value of each unit area and characteristic is obtained by step S102, and is combined each First index value of unit area judges whether each unit area is unit text region.

Step S104: the beginning and end of the text of webpage to be extracted is determined, according to unit text region to obtain wait mention Take the complete text of webpage.

In the embodiment of the present invention, before constructing Access Model according to webpage to be extracted, the method for Web page text is extracted also It may include: to be standardized the source code of webpage to be extracted.In the embodiment of the present invention, standardization includes: removal Scripting language, spcial character conversion.In order to meet the visual experience of user, a large amount of JS can be embedded in (i.e. in webpage source code JavaScript is a kind of scripting language for belonging to network, is used to add miscellaneous dynamic function for webpage, mention for user For the result of browse of more smooth beauty) and CSS (i.e. Cascading Style Sheets, a kind of machine word of file pattern Speech, not only can statically modified web page, various scripting languages can also be cooperated dynamically to be formatted to webpage each element) Equal scripting languages, the effect of these scripting languages is modified web page, unrelated with Web page text content, and these scripting languages are It extracts text and brings very big interference, therefore the scripting language unrelated with text can be removed.In addition, for subsequent processing, it can To convert conventionally form for the spcial character in source code, for example, convert &lt to<, by &gt be converted into>etc..

In the embodiment of the present invention, the similar value in the constituent parts region and characteristic that calculate main part may include: meter Calculate the second index value of the second index value of characteristic and the constituent parts region of main part；Utilize the second of characteristic Index value and second index value in constituent parts region calculate the similar value of characteristic and constituent parts region.The present invention is implemented In example, characteristic can store the characteristic information of webpage, for example, the information such as title, keyword and abstract, therefore can basis These characteristic informations generate the second index value of characteristic as the second index model value.Then, the second index model is utilized Value and second index value in constituent parts region, calculate the similar value of characteristic and constituent parts region.In the embodiment of the present invention, By cosine law formula, the cosine value conduct of second index value of second index model value and each unit area can be calculated The similar value of characteristic and constituent parts region, wherein cosine value 1 illustrates that similarity is higher more leveling off to.Certainly, the present invention is real The similar value that can also obtain characteristic and constituent parts region in example by other algorithms is applied, this is not construed as limiting.

In the embodiment of the present invention, according to the first index value of similar value and constituent parts region, screened from Access Model single Position text region may include: to be selected from Access Model doubtful text filed according to the first index value；Using similar value from doubt Like text filed middle screening unit text region.Wherein, it is doubtful it is text filed can for one or more, unit text region can Think one or more unit areas.

In the embodiment of the present invention, using similar value from doubtful text filed middle screening unit text region may include: ratio The size of the similar value in more doubtful text filed middle constituent parts region chooses the maximum unit area of similar value as unit text Region.If it is doubtful it is text filed in, the maximum unit area of similar value has multiple, then the maximum unit area of multiple similar values can To be unit text region, it also can choose any one unit area in the maximum unit area of multiple similar values and be used as list Position text region, naturally it is also possible to be selected by other methods, this is not limited by the present invention.

In the embodiment of the present invention, determine that the beginning and end of the text of webpage to be extracted can wrap according to unit text region It includes: carrying out iterating over for unit area up and down centered on unit text region, judge that each unit area is It is no to meet default text condition, if not meeting default text condition, stop iteration, so that it is determined that the text of webpage to be extracted Beginning and end.After filtering out unit text region in step s 103, then carried out centered on unit text region to Upper unit area iterates over.First determine whether unit text region a upward unit area whether symbol preset condition, If meeting preset condition, illustrate to belong to Web page text, and continue up iteration, if not meeting, is illustrating to be not belonging to webpage just Text has determined the beginning of webpage.Similarly, same method can be used, is carried out centered on unit text region to placing an order Iterating over for position region, determines the ending of Web page text.

In the embodiment of the present invention, judging whether each unit area meets default text condition may include: that judgement is every Whether the similar value of one unit area is greater than default similarity threshold, if more than, it is determined that unit area meets default text Condition；And/or judge whether the Link Ratio of each unit area is less than default Link Ratio threshold value, if being less than, it is determined that unit Region meets default text condition；And/or judge the symbol of each unit area than whether being greater than predetermined symbol than threshold value, If more than, it is determined that unit area meets default text condition.Wherein, default similarity threshold can be by calculating constituent parts area The arithmetic average of the similar value in domain obtains, and can also be calculated and be obtained by other methods.Link Ratio can be link number and word Accord with the ratio of number, ratio of the symbol than can be symbolic number and number of characters.

In the embodiment of the present invention, the beginning and end that the text of webpage to be extracted is determined according to unit text region it Afterwards, the method for extracting Web page text can also include: to obtain the text additional information of webpage to be extracted.Wherein, the additional letter of text Breath may include following at least one: text title, author, date and source.In the embodiment of the present invention, features can be passed through The characteristic information divided searches text title from main part.In the embodiment of the present invention, it is determined that the position of text title and text After setting, can by regular expression (also known as regular expression, a concept of computer science, be usually used to retrieval, Replace those texts for meeting some rule) extract the information such as author, date and source.Wherein, the date of text general position It stores, therefore can be extracted using regular expression in the centre of title and body matter, and with a regular pattern.Text comes The centre or the position below text that the information such as source and author are normally at title and body matter, and with a regular pattern Storage, therefore can be extracted using regular expression.

In the embodiment of the present invention, Access Model can be document object model, such as dom (DOM Document Object Model Document Object Model, abbreviation DOM are the standard programs for the expansible markup language of processing that World Wide Web Consortium is recommended Interface) tree.

In the embodiment of the present invention, constituent parts region can be with behavior unit.Certainly, it also can choose in the embodiment of the present invention Other unit.

In the embodiment of the present invention, it may include: constituent parts that the first index value, which is used to indicate the attribute information in constituent parts region, The unit intensity in region.In order to facilitate understanding, it is that row carries out unit of account density with unit area, " unit intensity " is taken as " line density " is described in detail, and certain " row " is not used to be defined the protection scope of technical solution of the present invention, this hair " line density " can be adaptively adjusted according to specific business scenario in bright.In the embodiment of the present invention, unit intensity can lead to Following calculation method is crossed to obtain.Firstly, obtaining the row block of every a line, with the 1st behavior example explanation, k row is taken downwards, k is according to tool Body situation setting, take k be 3 when, then the row block of the 1st row be " text of the 1st row to the 4th row ".Then, every a line is calculated Row block length illustrates by taking the row block of the 1st row as an example, after the blank character for removing the row block of the 1st row, counts the row block of the 1st row Character sum, then add (punctuation mark number * k of the 1st row).In view of the text in webpage has punctuation mark, other ground Side does not have punctuation mark, and (punctuation mark number * k) is the equal of weighting.Finally, obtaining the line density of every a line are as follows: row block length/ (k+1).In the present invention, it also can choose other methods unit of account density, this be not construed as limiting.

In the embodiment of the present invention, it may include: feature that the second index value, which is used to indicate the attribute information in certain region in webpage, Vector value.In the present invention, the feature vector value of characteristic and the feature vector value in constituent parts region can use, calculate special The similar value of sign part and constituent parts region.

In order to facilitate understanding, Fig. 2 to Fig. 7 is described the embodiment of the present invention with behavior unit, and " Access Model " is taken Be taken as " line density " for " dom tree ", " the first index value ", " the second index value " is taken as " feature vector value " and is described in detail, Certainly be not used to be defined the protection scope of technical solution of the present invention " with behavior unit ", the present invention in " dom tree ", " line density ", " feature vector value " can be adaptively adjusted according to specific business scenario.

Fig. 2 is the schematic diagram of the main flow of the method according to an embodiment of the present invention for extracting Web page text, such as Fig. 2 institute Show, the method for the extraction Web page text of the embodiment of the present invention mainly includes following below scheme: step S201 loads webpage to be extracted Source code, and source code is standardized；Step S202 constructs text dom tree according to standardized source code；Step S203, The characteristic information of webpage is extracted according to dom tree, and determines the heading message of Web page text；Step S204, calculate each line of text with The similar value of characteristic information and the line density of each line of text；Step S205 selects doubtful text according to similar value and line density Then block filters out line of text from doubtful text block；Step S206 carries out iteration time capable up and down to line of text It goes through, determines the beginning and end of text；Step S207 determines the additional information of text.

Step S201 is the source code for loading webpage to be extracted, and is standardized to source code, and detailed process can wrap It includes: loading the source code of webpage to be extracted by Jsoup (software package for analyzing web page content)；Analyze source code, conversion load Source code format；Remove the scripting languages such as JS and CSS；Spcial character is handled.

Step S202 is to construct text dom tree according to standardized source code.Fig. 3 is that standardized source code is corresponding with its The schematic diagram of dom tree.In the present invention, dom tree can be constructed by Jsoup, then by dom tree with text information corresponding node mark The form of label group is stored, and forms a text list, every a line is done an object and is handled, a line is a text, right Answer a label, while link number, punctuate number, the number of characters of sequence of the row in the page, the row are maintained in text list In.Wherein, the dom tree of Fig. 3 corresponding " form of text information corresponding node set of tags " can be as follows:

" HTML Tree ": html → head → title → text；

" hello！": html → body → table → tr → td → text；

" this is a HTML tree.": html → body → table → tr → td → text.Wherein, text information " you It is good！" and " this is a HTML tree." corresponding node label group is identical.

Step S203 is the characteristic information that webpage is extracted according to dom tree, and determines the heading message of Web page text.Wherein, Dom tree shows the characteristic information and subject information of webpage, and it is the characteristic information of webpage, example that the head label of dom tree is corresponding Such as, title content, keyword and abstract, and the text message of webpage is corresponding in body label.According to dom tree, pass through html The text informations such as tag extraction title content, keyword and abstract as → head → title → text.According to the mark of extraction Content is inscribed, position of the title content in body label can be found.

Step S204 is the line density for calculating the similar value and each line of text of each line of text and characteristic information.In step In S203, the characteristic information of webpage, the i.e. information such as title content, keyword and abstract, Web page text and these information are obtained It is to have certain correlation.

Fig. 4 is the calculating each line of text and the phase of characteristic information of the method according to an embodiment of the present invention for extracting Web page text Like the schematic diagram of the key step of value.As shown in figure 4, the key step for calculating the similar value of each line of text and characteristic information can be with Include: step S401, stop words is carried out to characteristic information and word segmentation processing obtains n Feature Words, and counts these Feature Words Word frequency, wherein stop words refers in information retrieval, for save memory space and improve search efficiency, processing natural language Certain words or word are fallen in meeting automatic fitration before or after data；Step S402 calculates each Feature Words according to TF-IDF algorithm TF-IDF value, wherein TF-IDF, that is, term frequency-inverse document frequency is a kind of for believing The common weighting technique of breath retrieval and data mining, the TF-IDF value an of word can be calculated according to TF-IDF algorithm, some Word is higher to the importance of article, its TF-IDF value is bigger；Step S403 obtains one group of feature vector as net to be extracted The model eigenvectors value D=D (W of page₁,W₂,…,W_n), wherein W₁For word frequency * the 1st Feature Words of the 1st Feature Words TF-IDF value；Step S404 traverses each style of writing originally and is segmented, calculates the feature vector value of every a line；Step S405, meter Similar value of the cosine value of the feature vector value and model eigenvectors value of calculating every a line as each style of writing this and characteristic information, Wherein cosine law formula indicates are as follows:

Wherein Sim (D, D_i) represent the similar value of the i-th style of writing sheet and feature vector, D_i=D (W_i1,W_i2,…,W_in) represent The feature vector value of i row.

Step S205 is to select doubtful text block according to similar value and line density, is then filtered out just from doubtful text block Wen Hang.Fig. 5 is the signal of the key step for filtering out line of text of the method according to an embodiment of the present invention for extracting Web page text Figure.As shown in figure 5, the key step for filtering out line of text may include: step S501, obtained by the line density of each line of text One line density function；Step S502 obtains doubtful text block by the rapid drawdown region that rises sharply of line density function；Step S503, It traverses doubtful text block and finds out that maximum line of text of similitude as line of text.

Fig. 6 is the schematic diagram of the line density function of the acquisition of the method according to an embodiment of the present invention for extracting Web page text. In Fig. 6, horizontal axis is the line number of every a line, and the longitudinal axis is the line density of each row.Pass through the rapid drawdown meeting that rises sharply of this journey density function Obtain each piece of position of doubtful text.For example, the point of horizontal axis is X1 ... Xn, the point of the longitudinal axis is Y (X1) ..., and Y (Xn) is needed Initial position Xstart and end position Xend it is confirmed that text are wanted, specifically determines that the algorithm of doubtful text block can be as Under:

(1) determination rises sharply point Xstart (Y (Xstart)-Y (X (start-1)) > Y (Xt) * 30%), and wherein Y (Xt) is capable The maximum value of density；

(2) in order to avoid noise, there is Y (X (start+1)) ≠ 0；

(3) Y (Xend)=0, i.e. rapid drawdown point are 0, indicate to terminate；

(4) guarantee between Xstart and Xend there are 80 the percent of line density maximum value, i.e. Y (Xt) * 80%.

By above-mentioned algorithm, 49 rows to 73 rows and 91 rows to 97 rows are exactly doubtful text block in available Fig. 6.When So, in the embodiment of the present invention, other methods is also can choose and obtain doubtful text block, this is not limited by the present invention.

Step S206 is to carry out row up and down to line of text to iterate over, and determines the beginning and end of text.Fig. 7 It is the signal of the key step of the beginning and end of the determination text of the method according to an embodiment of the present invention for extracting Web page text Figure.As shown in fig. 7, the key step for determining the beginning and end of text according to embodiments of the present invention may include: step S701, The corresponding node label group of line of text can be determined by line of text；Step S702 determines position of the node label group on dom tree It sets, and text is extracted to the node label group；Step S703 is carried out centered on the node label group to uplink and to downlink It iterates over；Step S704, judges whether the similar value of every a line is greater than default similarity threshold；Step S705, if more than then Text is extracted to the corresponding node label group of this article current row and continues iteration；Step S706 stops iteration, really if being not more than The beginning and end of the text of fixed webpage to be extracted.

It is to be to determine the row by comparing the similar value of every a line and the size of similarity threshold in the embodiment of the present invention No to meet default text condition, certainly, the symbol that also can use every a line Link Ratio or every a line in the present invention is more true than coming Whether the fixed row meets default text condition.

Step S207 is the additional information of determining text.Wherein, additional information may include: author, date and source.On The position for finding title content and text in step in dom tree is stated, therefore author, day can be extracted by regular expression The information such as phase and source.

The technical solution according to an embodiment of the present invention for extracting Web page text, which can be seen that, can determine opening for Web page text Head and ending, so as to the intelligentized complete text for extracting webpage, reduce cost of labor, improve and are extracting webpage just The efficiency of text；The source code of webpage to be extracted is standardized in the embodiment of the present invention, to be conducive to according to standardization Source code construct Access Model, reduce the time for extracting Web page text, and the method for the embodiment of the present invention be applicable in It is extracted in the text of various types of webpages；Pass through the second index value and main body of calculating characteristic in the embodiment of the present invention Second index value in partial constituent parts region, so as to easily the second index value be utilized to calculate characteristic and constituent parts The similar value in region；The first index value in the embodiment of the present invention by constituent parts region select it is doubtful text filed, so as to To reduce the selection range of text, the extraction efficiency of Web page text is improved；By comparing doubtful text area in the embodiment of the present invention The similar value in constituent parts region in domain, so as to improve using the maximum unit area of similar value as unit text region The accuracy rate that text extracts；Carried out centered on unit text region in the embodiment of the present invention unit area up and down repeatedly Generation traversal, may thereby determine that the beginning and end of text, it is ensured that extract the complete text of webpage；In the embodiment of the present invention, Judge whether each unit area meets default text condition from multiple angles such as similar value, Link Ratio and/or symbol ratios, from And it can be further improved the accuracy rate of text extraction；The text additional information of webpage to be extracted is obtained in the embodiment of the present invention, Improve the integrality of text；The first index value may include the unit intensity in constituent parts region in the embodiment of the present invention, thus It can be selected with tenant activity density this attribute information doubtful text filed；The second index value may include in the embodiment of the present invention Feature vector value, so as to calculate similar value by feature vector value.

Fig. 8 is the schematic diagram of the main modular of the device according to an embodiment of the present invention for extracting Web page text.Such as Fig. 8 institute Show, the device 800 of extraction Web page text of the invention mainly includes following module: building module 801, computing module 802, screening Module 803 and determining module 804.

Wherein, building module 801 can be used for: construct Access Model according to webpage to be extracted.Access Model may include: spy Sign part and main part.Computing module 802 can be used for: the constituent parts region for calculating main part is similar to characteristic Value.Screening module 803 can be used for: according to the first index value of similar value and constituent parts region, unit is screened from Access Model Text region.Determining module 804 can be used for: the beginning and end of the text of webpage to be extracted is determined according to unit text region, To obtain the complete text of webpage to be extracted.

In the embodiment of the present invention, building module 801 can also be used in: before constructing Access Model according to webpage to be extracted, The source code of webpage to be extracted is standardized.

In the embodiment of the present invention, computing module 802 can also be used in: calculate the second index value and main body for stating characteristic Second index value in partial constituent parts region；Utilize the second index value of characteristic and second index in constituent parts region Value calculates the similar value of characteristic and constituent parts region.

In the embodiment of the present invention, screening module 803 can also be used in: according to the first index value, select from Access Model doubtful Like text filed；Using similar value from doubtful text filed middle screening unit text region.

In the embodiment of the present invention, screening module 803 can also be used in: more doubtful text filed middle constituent parts region it is similar The size of value chooses the maximum unit area of similar value as unit text region.

In the embodiment of the present invention, determining module 804 can also be used in: be carried out up and down centered on unit text region Unit area iterates over, and judges whether each unit area meets default text condition, if not meeting default text item Part then stops iteration, so that it is determined that the beginning and end of the text of webpage to be extracted.

In the embodiment of the present invention, determining module 804 can also be used in: judge whether the similar value of each unit area is greater than Default similarity threshold, if more than, it is determined that unit area meets default text condition；And/or judge each unit area Link Ratio whether be less than default Link Ratio threshold value, if being less than, it is determined that unit area meets default text condition；And/or sentence Break each unit area symbol than whether being greater than predetermined symbol than threshold value, if more than, it is determined that unit area meets default Text condition.

In the embodiment of the present invention, determining module 804 can also be used in: obtain the text additional information of webpage to be extracted.Wherein, Text additional information may include following at least one: text title, author, date and source.

In the embodiment of the present invention, Access Model can be document object model.

In the embodiment of the present invention, constituent parts region can be with behavior unit.

In the embodiment of the present invention, the first index value can be used to indicate that the attribute information in constituent parts region, comprising: constituent parts The unit intensity in region.

In the embodiment of the present invention, the second index value can be used to indicate that the attribute information in certain region in webpage, comprising: feature Vector value.

From the above, it can be seen that can determine the beginning and end of Web page text, so as to intelligentized extraction The complete text of webpage out, reduces cost of labor, improves the efficiency for extracting Web page text；It treats and mentions in the embodiment of the present invention It takes the source code of webpage to be standardized, to be conducive to construct Access Model according to standardized source code, reduces and extract net The time of page text, and the text for making the method for the embodiment of the present invention can be adapted for various types of webpages extracts；This By calculating the second index value of the second index value of characteristic and the constituent parts region of main part in inventive embodiments, So as to easily utilize the second index value to calculate the similar value of characteristic and constituent parts region；Lead in the embodiment of the present invention The first index value for crossing constituent parts region is selected doubtful text filed, so as to reduce the selection range of text, improves webpage The extraction efficiency of text；In the embodiment of the present invention by comparing doubtful text filed middle constituent parts region similar value, so as to The maximum unit area of similar value as unit text region, is improved the accuracy rate of text extraction；The embodiment of the present invention In iterating over for unit area up and down is carried out centered on unit text region, may thereby determine that the beginning of text And ending, it is ensured that extract the complete text of webpage；It is more from similar value, Link Ratio and/or symbol ratio etc. in the embodiment of the present invention A angle judges whether each unit area meets default text condition, so as to further increase the accurate of text extraction Rate；The text additional information that webpage to be extracted is obtained in the embodiment of the present invention, improves the integrality of text；The embodiment of the present invention In the first index value may include constituent parts region unit intensity, so as to tenant activity density, this attribute information is selected It is doubtful text filed；The second index value may include feature vector value in the embodiment of the present invention, so as to by feature vector Value calculates similar value.

Fig. 9 is shown can be using the method for the extraction Web page text of the embodiment of the present invention or the device of extraction Web page text Exemplary system architecture 900.

As shown in figure 9, system architecture 900 may include terminal device 901,902,903, network 904 and server 905. Network 904 between terminal device 901,902,903 and server 905 to provide the medium of communication link.Network 904 can be with Including various connection types, such as wired, wireless communication link or fiber optic cables etc..

User can be used terminal device 901,902,903 and be interacted by network 904 with server 905, to receive or send out Send message etc..Various telecommunication customer end applications, such as the application of shopping class, net can be installed on terminal device 901,902,903 (merely illustrative) such as the application of page browsing device, searching class application, instant messaging tools, mailbox client, social platform softwares.

Terminal device 901,902,903 can be the various electronic equipments with display screen and supported web page browsing, packet Include but be not limited to smart phone, tablet computer, pocket computer on knee and desktop computer etc..

Server 905 can be to provide the server of various services, such as utilize terminal device 901,902,903 to user The shopping class website browsed provides the back-stage management server (merely illustrative) supported.Back-stage management server can be to reception To the data such as information query request analyze etc. processing, and by processing result (such as target push information, product letter Breath -- merely illustrative) feed back to terminal device.

It should be noted that the method for extracting Web page text provided by the embodiment of the present invention is generally held by server 905 Row, correspondingly, the device for extracting Web page text is generally positioned in server 905.

It should be understood that the number of terminal device, network and server in Fig. 9 is only schematical.According to realization need It wants, can have any number of terminal device, network and server.

Below with reference to Figure 10, it illustrates the computer systems for the terminal device for being suitable for being used to realize the embodiment of the present invention 1000 structural schematic diagram.Terminal device shown in Figure 10 is only an example, should not function to the embodiment of the present invention and Use scope brings any restrictions.

As shown in Figure 10, computer system 1000 include central processing unit (CPU) 1001, can according to be stored in only It reads the program in memory (ROM) 1002 or is loaded into random access storage device (RAM) 1003 from storage section 1008 Program and execute various movements appropriate and processing.In RAM 1003, also it is stored with system 1000 and operates required various journeys Sequence and data.CPU 1001, ROM 1002 and RAM 1003 are connected with each other by bus 1004.Input/output (I/O) interface 1005 are also connected to bus 1004.

I/O interface 1005 is connected to lower component: the importation 1006 including keyboard, mouse etc.；Including such as cathode The output par, c 1007 of ray tube (CRT), liquid crystal display (LCD) etc. and loudspeaker etc.；Storage section including hard disk etc. 1008；And the communications portion 1009 of the network interface card including LAN card, modem etc..Communications portion 1009 passes through Communication process is executed by the network of such as internet.Driver 1010 is also connected to I/O interface 1005 as needed.It is detachable to be situated between Matter 1011, such as disk, CD, magneto-optic disk, semiconductor memory etc. are mounted on as needed on driver 1010, so as to In being mounted into storage section 1008 as needed from the computer program read thereon.

Particularly, disclosed embodiment, the process described above with reference to flow chart may be implemented as counting according to the present invention Calculation machine software program.For example, embodiment disclosed by the invention includes a kind of computer program product comprising be carried on computer Computer program on readable medium, the computer program include the program code for method shown in execution flow chart.? In such embodiment, which can be downloaded and installed from network by communications portion 1009, and/or from can Medium 1011 is dismantled to be mounted.When the computer program is executed by central processing unit (CPU) 1001, executes and of the invention be The above-mentioned function of being limited in system.

It should be noted that computer-readable medium shown in the present invention can be computer-readable signal media or meter Calculation machine readable storage medium storing program for executing either the two any combination.Computer readable storage medium for example can be --- but not Be limited to --- electricity, magnetic, optical, electromagnetic, infrared ray or semiconductor system, device or device, or any above combination.Meter The more specific example of calculation machine readable storage medium storing program for executing can include but is not limited to: have the electrical connection, just of one or more conducting wires Taking formula computer disk, hard disk, random access storage device (RAM), read-only memory (ROM), erasable type may be programmed read-only storage Device (EPROM or flash memory), optical fiber, portable compact disc read-only memory (CD-ROM), light storage device, magnetic memory device, Or above-mentioned any appropriate combination.In the present invention, computer readable storage medium can be it is any include or storage journey The tangible medium of sequence, the program can be commanded execution system, device or device use or in connection.And at this In invention, computer-readable signal media may include in a base band or as carrier wave a part propagate data-signal, Wherein carry computer-readable program code.The data-signal of this propagation can take various forms, including but unlimited In electromagnetic signal, optical signal or above-mentioned any appropriate combination.Computer-readable signal media can also be that computer can Any computer-readable medium other than storage medium is read, which can send, propagates or transmit and be used for By the use of instruction execution system, device or device or program in connection.Include on computer-readable medium Program code can transmit with any suitable medium, including but not limited to: wireless, electric wire, optical cable, RF etc. are above-mentioned Any appropriate combination.

Flow chart and block diagram in attached drawing are illustrated according to the system of various embodiments of the invention, method and computer journey The architecture, function and operation in the cards of sequence product.In this regard, each box in flowchart or block diagram can generation A part of one module, program segment or code of table, a part of above-mentioned module, program segment or code include one or more Executable instruction for implementing the specified logical function.It should also be noted that in some implementations as replacements, institute in box The function of mark can also occur in a different order than that indicated in the drawings.For example, two boxes succeedingly indicated are practical On can be basically executed in parallel, they can also be executed in the opposite order sometimes, and this depends on the function involved.Also it wants It is noted that the combination of each box in block diagram or flow chart and the box in block diagram or flow chart, can use and execute rule The dedicated hardware based systems of fixed functions or operations is realized, or can use the group of specialized hardware and computer instruction It closes to realize.

Being described in module involved in the embodiment of the present invention can be realized by way of software, can also be by hard The mode of part is realized.Described module also can be set in the processor, for example, can be described as: a kind of processor packet Include building module, computing module, screening module and determining module.Wherein, the title of these modules not structure under certain conditions The restriction of the pairs of module itself, for example, building module is also described as " constructing Access Model according to webpage to be extracted Module ".

As on the other hand, the present invention also provides a kind of computer-readable medium, which be can be Included in equipment described in above-described embodiment；It is also possible to individualism, and without in the supplying equipment.Above-mentioned calculating Machine readable medium carries one or more program, when said one or multiple programs are executed by the equipment, makes Obtaining the equipment includes: to construct Access Model according to webpage to be extracted；Calculate constituent parts regions and the characteristic of main part Similar value；According to the first index value of similar value and constituent parts region, unit text region is screened from Access Model；According to list Position text region determines the beginning and end of the text of webpage to be extracted, to obtain the complete text of webpage to be extracted.

Technical solution according to an embodiment of the present invention can determine the beginning and end of Web page text, so as to intelligence The complete text for extracting webpage changed, reduces cost of labor, improves the efficiency for extracting Web page text；The embodiment of the present invention In the source code of webpage to be extracted is standardized, thus be conducive to according to standardized source code construct Access Model, subtract The time of Web page text is extracted less, and the method for the embodiment of the present invention is made to can be adapted for the texts of various types of webpages It extracts；By calculating the second of the second index value of characteristic and the constituent parts region of main part in the embodiment of the present invention Index value, so as to easily utilize the second index value to calculate the similar value of characteristic and constituent parts region；The present invention is real Apply the first index value in example by constituent parts region select it is doubtful text filed, so as to reduce the selection range of text, Improve the extraction efficiency of Web page text；By comparing the similar of doubtful text filed middle constituent parts region in the embodiment of the present invention Value, so as to improve the accuracy rate of text extraction using the maximum unit area of similar value as unit text region；This hair Iterating over for unit area up and down is carried out in bright embodiment centered on unit text region, may thereby determine that just The beginning and end of text, it is ensured that extract the complete text of webpage；In the embodiment of the present invention, from similar value, Link Ratio and/or symbol Number than etc. multiple angles judge whether each unit area meets default text condition, mentioned so as to further increase text The accuracy rate taken；The text additional information that webpage to be extracted is obtained in the embodiment of the present invention, improves the integrality of text；This hair The first index value may include the unit intensity in constituent parts region in bright embodiment, so as to this attribute of tenant activity density Information is selected doubtful text filed；The second index value may include feature vector value in the embodiment of the present invention, so as to by Feature vector value calculates similar value.

Above-mentioned specific embodiment, does not constitute a limitation on the scope of protection of the present invention.Those skilled in the art should be bright It is white, design requirement and other factors are depended on, various modifications, combination, sub-portfolio and substitution can occur.It is any Made modifications, equivalent substitutions and improvements etc. within the spirit and principles in the present invention, should be included in the scope of the present invention Within.

Claims

1. a kind of method for extracting Web page text characterized by comprising

Access Model is constructed according to webpage to be extracted, the Access Model includes: characteristic and main part；

Calculate the constituent parts region of the main part and the similar value of the characteristic；

According to the first index value of the similar value and constituent parts region, unit text region is screened from the Access Model；

The beginning and end of the text of the webpage to be extracted is determined according to unit text region, it is described to be extracted to obtain The complete text of webpage.

2. the method according to claim 1, wherein according to webpage to be extracted construct Access Model before, institute State method further include: be standardized the source code of the webpage to be extracted.

3. the method according to claim 1, wherein calculate the main part constituent parts region and the spy Sign part similar value include:

Calculate the second index value of the second index value of the characteristic and the constituent parts region of the main part；

Using the second index value of the characteristic and second index value in the constituent parts region, the features are calculated Divide the similar value with the constituent parts region.

4. the method according to claim 1, wherein according to the first index of the similar value and constituent parts region Value, screening unit text region from the Access Model includes:

According to first index value, selected from the Access Model doubtful text filed；

Using the similar value from the doubtful text filed middle screening unit text region.

5. according to the method described in claim 4, it is characterized in that, using the similar value from the doubtful text filed middle sieve The unit text region is selected to include:

Compare the size of the similar value in the doubtful text filed middle constituent parts region, chooses the maximum unit area of similar value and make For unit text region.

6. the method according to claim 1, wherein determining the net to be extracted according to unit text region The beginning and end of text of page includes:

Iterating over for unit area up and down is carried out centered on unit text region, judges each unit area Whether domain meets default text condition, if not meeting default text condition, stops iteration, so that it is determined that the webpage to be extracted Text beginning and end.

7. according to the method described in claim 6, it is characterized in that, judging whether each unit area meets default text item Part includes:

Judge whether the similar value of each unit area is greater than default similarity threshold, if more than, it is determined that the unit area Domain meets default text condition；And/or

Judge whether the Link Ratio of each unit area is less than default Link Ratio threshold value, if being less than, it is determined that the unit area Domain meets default text condition；And/or

The symbol of each unit area is judged than whether being greater than predetermined symbol than threshold value, if more than, it is determined that the unit area Domain meets default text condition.

8. the method according to claim 1, wherein described to be extracted being determined according to unit text region After the beginning and end of the text of webpage, the method also includes: the text additional information of the webpage to be extracted is obtained, In, the text additional information includes following at least one: text title, author, date and source.

9. the method according to claim 1, wherein the Access Model is document object model.

10. the method according to claim 1, wherein the constituent parts region is with behavior unit.

11. the method according to claim 1, wherein first index value is for indicating constituent parts region Attribute information, comprising: the unit intensity in constituent parts region.

12. according to the method described in claim 3, it is characterized in that, second index value is for indicating certain region in webpage Attribute information, comprising: feature vector value.

13. a kind of device for extracting Web page text characterized by comprising

Module is constructed, for constructing Access Model according to webpage to be extracted, the Access Model includes: characteristic and main part Point；

Computing module, for calculating the constituent parts region of the main part and the similar value of the characteristic；

Screening module is screened from the Access Model for the first index value according to the similar value and constituent parts region Unit text region；

Determining module, the beginning and end of the text for determining the webpage to be extracted according to unit text region, with Obtain the complete text of the webpage to be extracted.

14. device according to claim 13, which is characterized in that the building module is also used to: according to net to be extracted Before page building Access Model, the source code of the webpage to be extracted is standardized.

15. device according to claim 13, which is characterized in that the computing module is also used to:

16. device according to claim 13, which is characterized in that the screening module is also used to:

17. device according to claim 16, which is characterized in that the screening module is also used to:

18. device according to claim 13, which is characterized in that the determining module is also used to:

19. device according to claim 18, which is characterized in that the determining module is also used to:

20. device according to claim 13, which is characterized in that the determining module is also used to: obtaining described to be extracted The text additional information of webpage, wherein the text additional information includes following at least one: text title, author, the date and Source.

21. device according to claim 13, which is characterized in that the Access Model is document object model.

22. device according to claim 13, which is characterized in that the constituent parts region is with behavior unit.

23. device according to claim 13, which is characterized in that first index value is for indicating constituent parts region Attribute information, comprising: the unit intensity in constituent parts region.

24. device according to claim 15, which is characterized in that second index value is for indicating certain region in webpage Attribute information, comprising: feature vector value.

25. a kind of electronic equipment characterized by comprising

One or more processors；

Storage device, for storing one or more programs,

When one or more of programs are executed by one or more of processors, so that one or more of processors are real The now method as described in any in claim 1-12.

26. a kind of computer-readable medium, is stored thereon with computer program, which is characterized in that described program is held by processor The method as described in any in claim 1-12 is realized when row.