CN109271598A

CN109271598A - A kind of method, apparatus and storage medium extracting news web page content

Info

Publication number: CN109271598A
Application number: CN201810863031.8A
Authority: CN
Inventors: 陈贺
Original assignee: Data Horizon (guangzhou) Technology Co Ltd
Current assignee: Data Horizon (guangzhou) Technology Co Ltd
Priority date: 2018-08-01
Filing date: 2018-08-01
Publication date: 2019-01-25
Anticipated expiration: 2038-08-01
Also published as: CN109271598B

Abstract

The present invention discloses a kind of method, apparatus and storage medium for extracting news web page content, it is related to news web page content extraction technical field, comprising: obtain webpage HTML code, webpage HTML linear reconstruction, the removal of HTML noise label, data set filtering division, absorb pseudo noise paragraph, generate text paragraph；Wherein, webpage HTML linear reconstruction linearizes mutually nested in tree-shaped div tag, and processing linear structure facilitates positioning when a div tag, and eliminating nested label influences subsequent step；The removal of HTML noise label will reduce influence of the noise text to paragraphs clustering；Data set filtering, which divides, further decreases influence of the noise to text paragraph；Absorb the recall rate that pseudo noise paragraph improves text paragraph.The method overcome the defect of the specific crawl of specific website, the versatility for extracting news web page content is enhanced；Prior art is compared, news content can be accurately and efficiently extracted, there is good effect.

Description

A kind of method, apparatus and storage medium extracting news web page content

Technical field

The present invention relates to news web page content extraction technical field more particularly to a kind of sides for extracting news web page content Method, device and storage medium.

Background technique

In News Field, the extraction of news web page content be wherein core the step of, wherein body, issuing time and The accuracy that title extracts is directly related to the quality and user experience of news search.In addition, in financial field, news web page Accurate extract is also the key for carrying out quantization transaction.News content is analyzed and processed based on natural language processing technique, is located It manages result and is used for economic behavior analysis.Therefore, how to extract news web page content becomes the critical issue that the present invention studies.

Currently, the method for news web page contents extraction is varied, it is broadly divided into following two major classes method, is advised based on template News web page then extracts and the news web page based on non-template extracts.

In the news web page based on pattern rule extracts, the position of html tag where finding the text of major Website News (page layout) is set to be different, the position even if under identical website where different network address bodies sometimes can be Difference.Therefore, unused website needs to write different templates, and the workload for constructing template is huge.

News web page based on non-template, which extracts, based on piecemeal, based on label window and logic-based row and maximum receiving Distance.However these algorithmic rules are complicated, performance is low, and the news web page for being not suitable for extensive website extracts.

Therefore, it is necessary to a kind of methods of the high extraction news web page of general, performance efficiency, accuracy rate.

Summary of the invention

The present invention provides a kind of method, apparatus and storage medium for extracting news web page content, and solving different websites needs Otherwise same rule template goes to extract news content problem.

To achieve the goals above, the present invention proposes a kind of method for extracting news web page content, comprising the following steps:

Linear reconstruction is carried out to targeted news webpage html source code；

Text fragment is extracted from html source code is filtered division raw data set；

Cluster text paragraph；

Absorb pseudo noise paragraph；

Generate text train of thought paragraph.

Preferably, before the progress linear reconstruction step to targeted news webpage html source code, further include

Obtain webpage html source code.

Preferably, described linear reconstruction step to be carried out to targeted news webpage html source code and from html source code Text fragment is extracted to be filtered between division raw data set step, further includes:

Denoising is carried out to the html source code after linear reconstruction.

It is preferably, described that linear reconstruction is carried out to targeted news webpage html source code, specifically: by targeted news net In page source code<body>with<div>label removes the nesting of webpage, carries out linear reconstruction, obtains the webpage of linear reconstruction Html source code.

Preferably, the text fragment that extracts from html source code is filtered division raw data set, specifically:

It presses paragraph sequence and extracts text fragment；

According to the punctuation mark number in the text fragment extracted, affiliated set is determined.

Preferably, the punctuation mark number according in the text fragment extracted determines affiliated set, specifically Are as follows:

If the punctuation mark number of text fragment is more than or equal to threshold value, it is partitioned into cluster paragraph set；

If the punctuation mark number of text fragment is less than threshold value, it is partitioned into and absorbs paragraph set.

Preferably, the cluster text paragraph, specifically:

It regards each paragraph clustered in paragraph set as an independent unit, is clustered according to web page tag.

Preferably, the absorption pseudo noise paragraph, specifically:

According to the punctuation mark number threshold value of setting and with absorb the distance between paragraph first section threshold value, draw be located at respectively Before text, neutralization after noise paragraph.

The present invention also proposes a kind of device for extracting news web page content, comprising:

Processor；

Memory is coupled to the processor and is stored with instruction, and the instruction is executing reality by the processor The method and step of the existing extraction news web page content.

The present invention also proposes that a kind of computer-readable storage medium, the computer-readable storage medium are stored with pumping The application program of the method for news web page content is taken, the application program realizes the method for extracting news web page content as mentioned Step.

The present invention proposes a kind of method, apparatus and storage medium for extracting news web page content, passes through webpage punctuation mark With web page tag cluster realize extract news web page content, comprising: obtain webpage HTML code, webpage HTML linear reconstruction, The removal of HTML noise label, data set filtering divide, absorb pseudo noise paragraph, generate text paragraph；Wherein, webpage HTML is linear Reconstruct linearizes mutually nested in tree-shaped div tag, and processing linear structure facilitates positioning when a div tag, eliminates Nested label influences subsequent step；The removal of HTML noise label will reduce influence of the noise text to paragraphs clustering；Data set Filtering divides and further decreases influence of the noise to text paragraph；Absorb the recall rate that pseudo noise paragraph improves text paragraph. The method overcome the defect of the specific crawl of specific website, the versatility for extracting news web page content is enhanced；Compare existing skill Art can accurately and efficiently extract news content, have good effect.

Detailed description of the invention

In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, to embodiment or will show below There is attached drawing needed in technical description to be briefly described, it should be apparent that, the accompanying drawings in the following description is only this Some embodiments of invention for those of ordinary skill in the art without creative efforts, can be with The structure shown according to these attached drawings obtains other attached drawings.

Fig. 1 is the method flow diagram that news web page content is extracted in an embodiment of the present invention；

Fig. 2 is linear reconstruction flow chart in an embodiment of the present invention；

Fig. 3 is that filtering divides raw data set flow diagram in an embodiment of the present invention；

Fig. 4 is the apparatus structure schematic diagram that news web page content is extracted in an embodiment of the present invention；

Fig. 5 is computer-readable storage medium structural schematic diagram in an embodiment of the present invention；

The embodiments will be further described with reference to the accompanying drawings for the realization, the function and the advantages of the object of the present invention.

Specific embodiment

Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete Site preparation description, it is clear that described embodiment is only a part of the embodiments of the present invention, instead of all the embodiments.Base Embodiment in the present invention, it is obtained by those of ordinary skill in the art without making creative efforts it is all its His embodiment, shall fall within the protection scope of the present invention.

It is to be appreciated that if relating to directionality instruction (such as up, down, left, right, before and after ...) in the embodiment of the present invention, Then directionality instruction be only used for explain under a certain particular pose (as shown in the picture) between each component relative positional relationship, Motion conditions etc., if the particular pose changes, directionality instruction is also correspondingly changed correspondingly.

In addition, being somebody's turn to do " first ", " second " etc. if relating to the description of " first ", " second " etc. in the embodiment of the present invention Description be used for description purposes only, be not understood to indicate or imply its relative importance or implicitly indicate indicated skill The quantity of art feature." first " is defined as a result, the feature of " second " can explicitly or implicitly include at least one spy Sign.It in addition, the technical solution between each embodiment can be combined with each other, but must be with those of ordinary skill in the art's energy It is enough realize based on, will be understood that the knot of this technical solution when conflicting or cannot achieve when occurs in the combination of technical solution Conjunction is not present, also not the present invention claims protection scope within.

The present invention proposes a kind of method for extracting news web page content；

In a kind of preferred embodiment of the present invention, as shown in Figure 1, comprising the following steps:

S00, webpage html source code is obtained；

In the embodiment of the present invention, web page source code is made of HTML markup language, and body is exactly by HTML markup And it forms.It when browsing web page news, usually issues and requests to Web server, browser can obtain the response of server.It is logical When crossing program language automation collection, acquisition is webpage HTML code；How automation collection webpage HTML code is for this Field technical staff is well known, therefore herein without detailed description.

S10, linear reconstruction is carried out to targeted news webpage html source code；

It, will be in web page source code in order to preferably carry out Web page text extraction in the embodiment of the present invention<body>with< Div > label removal webpage nesting, carries out linear reconstruction, obtains the webpage html source code of linear reconstruction.

Concrete thought are as follows:

As shown in Figure 2, p1 in figure, p2, p3 are web page contents；First in webpage html source code original state such as Fig. 2 A block diagram；

S101, beginning label is encountered<div>, end-tag is just added before it</div>；Encounter end mark Label</div>, just plus beginning label behind it<div>, such as second block diagram in Fig. 2；Specific implementation process can be with Using HTMLParser, the tools such as canonical are operated.

S102, delete first end-tag and the last one start label, such as third block diagram in Fig. 2；

S20, denoising is carried out to the html source code after linear reconstruction；

In the embodiment of the present invention, in the source HTML generation, must be denoised by HTML removal noise according to obtained linear html source code HTMLParser can be used in code, specific implementation process, and the tools such as canonical are operated.The step can reduce noise text pair The influence of body.The label of the HTML of deletion mainly have<script>,<style>,<iframe>,<aside>,<nav>, <footer>etc..According to webpage literary style specification,<script>label is for defining client script；<style>label is for being Html document defines style information；<iframe>label will create the inline frame (inside casing at once comprising another document Frame)；Content except its locating content of<aside>tag definition；The part of<nav>tag definition navigation link；<footer> The footer of tag definition document or section.

S30, extraction text fragment is filtered division raw data set from html source code；

In the embodiment of the present invention, filtering divides raw data set, by the text fragment in the page according to<div>and<table > extracted and stored for unit.Simple filtration is carried out to each text fragment obtained；It will be according to Chinese punctuation character collection The number for the punctuation character for being included is divided into two paragraph set, cluster paragraph set and absorption paragraph set.

As shown in figure 3, concrete thought are as follows:

S301, according to regular expression (<div>.*?</div>) press paragraph sequence extraction p1, p2, p3；

If the punctuation mark number of S302, text fragment is more than or equal to threshold value, it is partitioned into cluster paragraph set；Here it takes Threshold value is 6.

If the punctuation mark number of S303, text fragment is less than threshold value, it is partitioned into and absorbs paragraph set.

S40, cluster text paragraph；

In the embodiment of the present invention, text paragraph, advertising information, net exploxer comment are clustered, as long as website statement is not included in net In page text, it is all defined as noise.In order to remove noise paragraph, Web page text is generated using clustering technique.

Firstly, some common paragraphs separate labels such as label<form>to Web page text and advertising information in html language Apparent separation mark is played, train of thought paragraph collection is divided into smaller paragraph set using these labels.

Secondly, bottom-up clustering, by each paragraph in text paragraph set regard as an independent unit into Row cluster；By the most paragraph of paragraph punctuation mark as cluster centre, unsupervised learning goes out the label and label category of the paragraph Property；

For example, first label in the paragraph of center is<p>it is with attribute<p style=text-inde rt:2em>； So this section composition vector (<p>,<p style=text-inde rt:2em>)；According to this feature, respectively to center paragraph Front and back cluster the paragraph containing this feature.

S50, pseudo noise paragraph is absorbed；

In the embodiment of the present invention, according to the threshold value of setting, (threshold value is there are two parameter, punctuation mark number and apart from suction here Receive paragraph first section distance) respectively draw be located at text before, among, noise paragraph later.It is described as follows,

Firstly, the end index B for starting to index A and endpiece and falling that the first section for obtaining cluster text paragraph is fallen；

Start to index the paragraph stacking of A secondly, drawing small Yu in noise paragraph, successively takes stack top paragraph, if paragraph Punctuation mark number is more than or equal to 3, and less than 5 at a distance from index A, into the text train of thought after cluster, and updates and open Begin index A.

Then, it draws in noise paragraph to be greater than and starts to index A and be less than the paragraph enqueue for terminating index B, successively take team Column paragraph, if the punctuation mark number of paragraph is more than or equal to 3, into the text train of thought after cluster.

Finally, drawing the paragraph enqueue for being greater than in noise paragraph and terminating index B, queue paragraph is successively taken, if paragraph Punctuation mark number be more than or equal to 3, and with index B at a distance from less than 5, into the text train of thought after cluster, and update Terminate index B.

S60, text train of thought paragraph is generated, completes the extraction of body.

The present invention also proposes a kind of device for extracting news web page content；

In a kind of preferred embodiment of the present invention, as shown in Figure 4；

Include:

Processor；

Memory is coupled to the processor and is stored with instruction, and the instruction is executing reality by the processor The step of method of the existing extraction news web page content, for example,

S00, webpage html source code is obtained；

S40, cluster text paragraph；

S50, pseudo noise paragraph is absorbed；

Detail in step, has above elaborated, and no longer repeats herein；

In the embodiment of the present invention, the extraction news web page content device internal processor can be by integrated circuit group At such as being made of the integrated circuit of single package, be also possible to be encapsulated by multiple identical functions or different function Integrated circuit is formed, including one or more central processing unit (Central Processing unit, CPU), micro process Device, digital processing chip, graphics processor and combination of various control chips etc..Processor utilizes various interfaces and connection All parts are taken, by running or execute the program being stored in memory or unit, and calls and is stored in memory Data, with execute extract news web page content various functions and processing data；

Memory is mounted on and extracts in news web page content device, and transporting for storing program code and various data The access realized high speed during row, be automatically completed program or data.The memory includes read-only memory (Read- Only Memory, ROM), random access memory (Random Access Memory, RAM), programmable read only memory (Programmable Read-Only Memory, PROM), Erasable Programmable Read Only Memory EPROM (Erasable Programmable Read-Only Memory, EPROM), disposable programmable read-only memory (One-time Programmable Read-Only Memory, OTPROM), electronics erasing type can make carbon copies read-only memory (Electrically-Erasable Programmable Read-Only Memory, EEPROM), CD-ROM (Compact Disc Read-Only Memory, CD-ROM) or other disc memories, magnetic disk storage, magnetic tape storage or can For carrying or any other computer-readable medium of storing data.

The present invention also proposes a kind of computer-readable storage medium；

In a kind of preferred embodiment of the present invention, as shown in Figure 5；

The computer-readable storage medium is stored with the application program for extracting the method for news web page content, described to answer The step of realizing the method for extracting news web page content as mentioned with program, for example,

S00, webpage html source code is obtained；

S40, cluster text paragraph；

S50, pseudo noise paragraph is absorbed；

Detail in step, has above elaborated, and no longer repeats herein；

In the description of embodiments of the present invention, it should be noted that in flow chart or described otherwise above herein Any process or method description be construed as, indicate to include one or more for realizing specific logical function or mistake Module, segment or the part of the code of the executable instruction of the step of journey, and the range packet of the preferred embodiment of the present invention Include other realization, wherein sequence shown or discussed can not be pressed, including according to related function by it is basic simultaneously Mode or in the opposite order, to execute function, this should be managed by the embodiment of the present invention person of ordinary skill in the field Solution.

Expression or logic and/or step described otherwise above herein in flow charts, for example, being considered use In the order list for the executable instruction for realizing logic function, may be embodied in any computer-readable medium, for Instruction execution system, device or equipment (such as computer based system, including the system of processing module or other can be from instruction Execute system, device or equipment instruction fetch and the system that executes instruction) use, or combine these instruction execution systems, device or Equipment and use.For the purpose of this specification, " computer-readable medium ", which can be, any may include, store, communicating, propagating Or transfer program uses for instruction execution system, device or equipment or in conjunction with these instruction execution systems, device or equipment Device.The more specific example (non-exhaustive list) of computer-readable medium include the following: there are one or more wirings Electrical connection section (electronic device), portable computer diskette box (magnetic device), random access memory (RAM), read-only memory (ROM), erasable edit read-only storage (EPROM or flash memory), fiber device and portable optic disk is read-only deposits Reservoir (CDROM).In addition, computer-readable medium can even is that the paper that can print described program on it or other are suitable Medium because can then be edited for example by carrying out optical scanner to paper or other media, interpret or when necessary with Other suitable methods are handled electronically to obtain described program, are then stored in computer storage.

The above description is only a preferred embodiment of the present invention, is not intended to limit the scope of the invention, all at this Under the inventive concept of invention, using equivalent structure transformation made by description of the invention and accompanying drawing content, or directly/use indirectly It is included in other related technical areas in scope of patent protection of the invention.

Claims

1. a kind of method for extracting news web page content, which comprises the following steps:

Cluster text paragraph；

Absorb pseudo noise paragraph；

Generate text train of thought paragraph.

2. the method according to claim 1 for extracting news web page content, which is characterized in that described to targeted news net Before page html source code carries out linear reconstruction step, further include

Obtain webpage html source code.

3. the method according to claim 1 for extracting news web page content, which is characterized in that described to targeted news net Page html source code carries out linear reconstruction step and extraction text fragment is filtered division initial data from html source code Collect between step, further includes:

Denoising is carried out to the html source code after linear reconstruction.

4. the method according to claim 1 for extracting news web page content, which is characterized in that described to targeted news net Page html source code carries out linear reconstruction, specifically: it will be in targeted news web page source code<body>with<div>label removal The nesting of webpage carries out linear reconstruction, obtains the webpage html source code of linear reconstruction.

5. the method according to claim 1 for extracting news web page content, which is characterized in that described from html source code Middle extraction text fragment is filtered division raw data set, specifically:

It presses paragraph sequence and extracts text fragment；

6. the method according to claim 5 for extracting news web page content, which is characterized in that described according to being extracted Punctuation mark number in text fragment determines affiliated set, specifically:

7. the method according to claim 1 for extracting news web page content, which is characterized in that the cluster text segment It falls, specifically:

8. the method according to claim 1 for extracting news web page content, which is characterized in that the absorption pseudo noise section It falls, specifically:

According to the punctuation mark number threshold value of setting and with absorb the distance between paragraph first section threshold value, draw be located at text respectively Before, the noise paragraph after neutralization.

9. a kind of device for extracting news web page content characterized by comprising

Processor；

Memory is coupled to the processor and is stored with instruction, and the instruction is executing the power of realization by the processor Benefit require any one of 1 to 8 described in extract news web page content method and step.

10. a kind of computer-readable storage medium, which is characterized in that the computer-readable storage medium is stored with extraction The application program of the method for news web page content, the application program realize such as extraction described in any item of the claim 1 to 8 The method and step of news web page content.