CN109271598A - A kind of method, apparatus and storage medium extracting news web page content - Google Patents
A kind of method, apparatus and storage medium extracting news web page content Download PDFInfo
- Publication number
- CN109271598A CN109271598A CN201810863031.8A CN201810863031A CN109271598A CN 109271598 A CN109271598 A CN 109271598A CN 201810863031 A CN201810863031 A CN 201810863031A CN 109271598 A CN109271598 A CN 109271598A
- Authority
- CN
- China
- Prior art keywords
- web page
- paragraph
- page content
- news web
- source code
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Abstract
The present invention discloses a kind of method, apparatus and storage medium for extracting news web page content, it is related to news web page content extraction technical field, comprising: obtain webpage HTML code, webpage HTML linear reconstruction, the removal of HTML noise label, data set filtering division, absorb pseudo noise paragraph, generate text paragraph;Wherein, webpage HTML linear reconstruction linearizes mutually nested in tree-shaped div tag, and processing linear structure facilitates positioning when a div tag, and eliminating nested label influences subsequent step;The removal of HTML noise label will reduce influence of the noise text to paragraphs clustering;Data set filtering, which divides, further decreases influence of the noise to text paragraph;Absorb the recall rate that pseudo noise paragraph improves text paragraph.The method overcome the defect of the specific crawl of specific website, the versatility for extracting news web page content is enhanced;Prior art is compared, news content can be accurately and efficiently extracted, there is good effect.
Description
Technical field
The present invention relates to news web page content extraction technical field more particularly to a kind of sides for extracting news web page content
Method, device and storage medium.
Background technique
In News Field, the extraction of news web page content be wherein core the step of, wherein body, issuing time and
The accuracy that title extracts is directly related to the quality and user experience of news search.In addition, in financial field, news web page
Accurate extract is also the key for carrying out quantization transaction.News content is analyzed and processed based on natural language processing technique, is located
It manages result and is used for economic behavior analysis.Therefore, how to extract news web page content becomes the critical issue that the present invention studies.
Currently, the method for news web page contents extraction is varied, it is broadly divided into following two major classes method, is advised based on template
News web page then extracts and the news web page based on non-template extracts.
In the news web page based on pattern rule extracts, the position of html tag where finding the text of major Website News
(page layout) is set to be different, the position even if under identical website where different network address bodies sometimes can be
Difference.Therefore, unused website needs to write different templates, and the workload for constructing template is huge.
News web page based on non-template, which extracts, based on piecemeal, based on label window and logic-based row and maximum receiving
Distance.However these algorithmic rules are complicated, performance is low, and the news web page for being not suitable for extensive website extracts.
Therefore, it is necessary to a kind of methods of the high extraction news web page of general, performance efficiency, accuracy rate.
Summary of the invention
The present invention provides a kind of method, apparatus and storage medium for extracting news web page content, and solving different websites needs
Otherwise same rule template goes to extract news content problem.
To achieve the goals above, the present invention proposes a kind of method for extracting news web page content, comprising the following steps:
Linear reconstruction is carried out to targeted news webpage html source code;
Text fragment is extracted from html source code is filtered division raw data set;
Cluster text paragraph;
Absorb pseudo noise paragraph;
Generate text train of thought paragraph.
Preferably, before the progress linear reconstruction step to targeted news webpage html source code, further include
Obtain webpage html source code.
Preferably, described linear reconstruction step to be carried out to targeted news webpage html source code and from html source code
Text fragment is extracted to be filtered between division raw data set step, further includes:
Denoising is carried out to the html source code after linear reconstruction.
It is preferably, described that linear reconstruction is carried out to targeted news webpage html source code, specifically: by targeted news net
In page source code<body>with<div>label removes the nesting of webpage, carries out linear reconstruction, obtains the webpage of linear reconstruction
Html source code.
Preferably, the text fragment that extracts from html source code is filtered division raw data set, specifically:
It presses paragraph sequence and extracts text fragment;
According to the punctuation mark number in the text fragment extracted, affiliated set is determined.
Preferably, the punctuation mark number according in the text fragment extracted determines affiliated set, specifically
Are as follows:
If the punctuation mark number of text fragment is more than or equal to threshold value, it is partitioned into cluster paragraph set;
If the punctuation mark number of text fragment is less than threshold value, it is partitioned into and absorbs paragraph set.
Preferably, the cluster text paragraph, specifically:
It regards each paragraph clustered in paragraph set as an independent unit, is clustered according to web page tag.
Preferably, the absorption pseudo noise paragraph, specifically:
According to the punctuation mark number threshold value of setting and with absorb the distance between paragraph first section threshold value, draw be located at respectively
Before text, neutralization after noise paragraph.
The present invention also proposes a kind of device for extracting news web page content, comprising:
Processor;
Memory is coupled to the processor and is stored with instruction, and the instruction is executing reality by the processor
The method and step of the existing extraction news web page content.
The present invention also proposes that a kind of computer-readable storage medium, the computer-readable storage medium are stored with pumping
The application program of the method for news web page content is taken, the application program realizes the method for extracting news web page content as mentioned
Step.
The present invention proposes a kind of method, apparatus and storage medium for extracting news web page content, passes through webpage punctuation mark
With web page tag cluster realize extract news web page content, comprising: obtain webpage HTML code, webpage HTML linear reconstruction,
The removal of HTML noise label, data set filtering divide, absorb pseudo noise paragraph, generate text paragraph;Wherein, webpage HTML is linear
Reconstruct linearizes mutually nested in tree-shaped div tag, and processing linear structure facilitates positioning when a div tag, eliminates
Nested label influences subsequent step;The removal of HTML noise label will reduce influence of the noise text to paragraphs clustering;Data set
Filtering divides and further decreases influence of the noise to text paragraph;Absorb the recall rate that pseudo noise paragraph improves text paragraph.
The method overcome the defect of the specific crawl of specific website, the versatility for extracting news web page content is enhanced;Compare existing skill
Art can accurately and efficiently extract news content, have good effect.
Detailed description of the invention
In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, to embodiment or will show below
There is attached drawing needed in technical description to be briefly described, it should be apparent that, the accompanying drawings in the following description is only this
Some embodiments of invention for those of ordinary skill in the art without creative efforts, can be with
The structure shown according to these attached drawings obtains other attached drawings.
Fig. 1 is the method flow diagram that news web page content is extracted in an embodiment of the present invention;
Fig. 2 is linear reconstruction flow chart in an embodiment of the present invention;
Fig. 3 is that filtering divides raw data set flow diagram in an embodiment of the present invention;
Fig. 4 is the apparatus structure schematic diagram that news web page content is extracted in an embodiment of the present invention;
Fig. 5 is computer-readable storage medium structural schematic diagram in an embodiment of the present invention;
The embodiments will be further described with reference to the accompanying drawings for the realization, the function and the advantages of the object of the present invention.
Specific embodiment
Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete
Site preparation description, it is clear that described embodiment is only a part of the embodiments of the present invention, instead of all the embodiments.Base
Embodiment in the present invention, it is obtained by those of ordinary skill in the art without making creative efforts it is all its
His embodiment, shall fall within the protection scope of the present invention.
It is to be appreciated that if relating to directionality instruction (such as up, down, left, right, before and after ...) in the embodiment of the present invention,
Then directionality instruction be only used for explain under a certain particular pose (as shown in the picture) between each component relative positional relationship,
Motion conditions etc., if the particular pose changes, directionality instruction is also correspondingly changed correspondingly.
In addition, being somebody's turn to do " first ", " second " etc. if relating to the description of " first ", " second " etc. in the embodiment of the present invention
Description be used for description purposes only, be not understood to indicate or imply its relative importance or implicitly indicate indicated skill
The quantity of art feature." first " is defined as a result, the feature of " second " can explicitly or implicitly include at least one spy
Sign.It in addition, the technical solution between each embodiment can be combined with each other, but must be with those of ordinary skill in the art's energy
It is enough realize based on, will be understood that the knot of this technical solution when conflicting or cannot achieve when occurs in the combination of technical solution
Conjunction is not present, also not the present invention claims protection scope within.
The present invention proposes a kind of method for extracting news web page content;
In a kind of preferred embodiment of the present invention, as shown in Figure 1, comprising the following steps:
S00, webpage html source code is obtained;
In the embodiment of the present invention, web page source code is made of HTML markup language, and body is exactly by HTML markup
And it forms.It when browsing web page news, usually issues and requests to Web server, browser can obtain the response of server.It is logical
When crossing program language automation collection, acquisition is webpage HTML code;How automation collection webpage HTML code is for this
Field technical staff is well known, therefore herein without detailed description.
S10, linear reconstruction is carried out to targeted news webpage html source code;
It, will be in web page source code in order to preferably carry out Web page text extraction in the embodiment of the present invention<body>with<
Div > label removal webpage nesting, carries out linear reconstruction, obtains the webpage html source code of linear reconstruction.
Concrete thought are as follows:
As shown in Figure 2, p1 in figure, p2, p3 are web page contents;First in webpage html source code original state such as Fig. 2
A block diagram;
S101, beginning label is encountered<div>, end-tag is just added before it</div>;Encounter end mark
Label</div>, just plus beginning label behind it<div>, such as second block diagram in Fig. 2;Specific implementation process can be with
Using HTMLParser, the tools such as canonical are operated.
S102, delete first end-tag and the last one start label, such as third block diagram in Fig. 2;
S20, denoising is carried out to the html source code after linear reconstruction;
In the embodiment of the present invention, in the source HTML generation, must be denoised by HTML removal noise according to obtained linear html source code
HTMLParser can be used in code, specific implementation process, and the tools such as canonical are operated.The step can reduce noise text pair
The influence of body.The label of the HTML of deletion mainly have<script>,<style>,<iframe>,<aside>,<nav>,
<footer>etc..According to webpage literary style specification,<script>label is for defining client script;<style>label is for being
Html document defines style information;<iframe>label will create the inline frame (inside casing at once comprising another document
Frame);Content except its locating content of<aside>tag definition;The part of<nav>tag definition navigation link;<footer>
The footer of tag definition document or section.
S30, extraction text fragment is filtered division raw data set from html source code;
In the embodiment of the present invention, filtering divides raw data set, by the text fragment in the page according to<div>and<table
> extracted and stored for unit.Simple filtration is carried out to each text fragment obtained;It will be according to Chinese punctuation character collection
The number for the punctuation character for being included is divided into two paragraph set, cluster paragraph set and absorption paragraph set.
As shown in figure 3, concrete thought are as follows:
S301, according to regular expression (<div>.*?</div>) press paragraph sequence extraction p1, p2, p3;
If the punctuation mark number of S302, text fragment is more than or equal to threshold value, it is partitioned into cluster paragraph set;Here it takes
Threshold value is 6.
If the punctuation mark number of S303, text fragment is less than threshold value, it is partitioned into and absorbs paragraph set.
S40, cluster text paragraph;
In the embodiment of the present invention, text paragraph, advertising information, net exploxer comment are clustered, as long as website statement is not included in net
In page text, it is all defined as noise.In order to remove noise paragraph, Web page text is generated using clustering technique.
Firstly, some common paragraphs separate labels such as label<form>to Web page text and advertising information in html language
Apparent separation mark is played, train of thought paragraph collection is divided into smaller paragraph set using these labels.
Secondly, bottom-up clustering, by each paragraph in text paragraph set regard as an independent unit into
Row cluster;By the most paragraph of paragraph punctuation mark as cluster centre, unsupervised learning goes out the label and label category of the paragraph
Property;
For example, first label in the paragraph of center is<p>it is with attribute<p style=text-inde rt:2em>;
So this section composition vector (<p>,<p style=text-inde rt:2em>);According to this feature, respectively to center paragraph
Front and back cluster the paragraph containing this feature.
S50, pseudo noise paragraph is absorbed;
In the embodiment of the present invention, according to the threshold value of setting, (threshold value is there are two parameter, punctuation mark number and apart from suction here
Receive paragraph first section distance) respectively draw be located at text before, among, noise paragraph later.It is described as follows,
Firstly, the end index B for starting to index A and endpiece and falling that the first section for obtaining cluster text paragraph is fallen;
Start to index the paragraph stacking of A secondly, drawing small Yu in noise paragraph, successively takes stack top paragraph, if paragraph
Punctuation mark number is more than or equal to 3, and less than 5 at a distance from index A, into the text train of thought after cluster, and updates and open
Begin index A.
Then, it draws in noise paragraph to be greater than and starts to index A and be less than the paragraph enqueue for terminating index B, successively take team
Column paragraph, if the punctuation mark number of paragraph is more than or equal to 3, into the text train of thought after cluster.
Finally, drawing the paragraph enqueue for being greater than in noise paragraph and terminating index B, queue paragraph is successively taken, if paragraph
Punctuation mark number be more than or equal to 3, and with index B at a distance from less than 5, into the text train of thought after cluster, and update
Terminate index B.
S60, text train of thought paragraph is generated, completes the extraction of body.
The present invention also proposes a kind of device for extracting news web page content;
In a kind of preferred embodiment of the present invention, as shown in Figure 4;
Include:
Processor;
Memory is coupled to the processor and is stored with instruction, and the instruction is executing reality by the processor
The step of method of the existing extraction news web page content, for example,
S00, webpage html source code is obtained;
S10, linear reconstruction is carried out to targeted news webpage html source code;
S20, denoising is carried out to the html source code after linear reconstruction;
S30, extraction text fragment is filtered division raw data set from html source code;
S40, cluster text paragraph;
S50, pseudo noise paragraph is absorbed;
S60, text train of thought paragraph is generated, completes the extraction of body.
Detail in step, has above elaborated, and no longer repeats herein;
In the embodiment of the present invention, the extraction news web page content device internal processor can be by integrated circuit group
At such as being made of the integrated circuit of single package, be also possible to be encapsulated by multiple identical functions or different function
Integrated circuit is formed, including one or more central processing unit (Central Processing unit, CPU), micro process
Device, digital processing chip, graphics processor and combination of various control chips etc..Processor utilizes various interfaces and connection
All parts are taken, by running or execute the program being stored in memory or unit, and calls and is stored in memory
Data, with execute extract news web page content various functions and processing data;
Memory is mounted on and extracts in news web page content device, and transporting for storing program code and various data
The access realized high speed during row, be automatically completed program or data.The memory includes read-only memory (Read-
Only Memory, ROM), random access memory (Random Access Memory, RAM), programmable read only memory
(Programmable Read-Only Memory, PROM), Erasable Programmable Read Only Memory EPROM (Erasable
Programmable Read-Only Memory, EPROM), disposable programmable read-only memory (One-time
Programmable Read-Only Memory, OTPROM), electronics erasing type can make carbon copies read-only memory
(Electrically-Erasable Programmable Read-Only Memory, EEPROM), CD-ROM (Compact
Disc Read-Only Memory, CD-ROM) or other disc memories, magnetic disk storage, magnetic tape storage or can
For carrying or any other computer-readable medium of storing data.
The present invention also proposes a kind of computer-readable storage medium;
In a kind of preferred embodiment of the present invention, as shown in Figure 5;
The computer-readable storage medium is stored with the application program for extracting the method for news web page content, described to answer
The step of realizing the method for extracting news web page content as mentioned with program, for example,
S00, webpage html source code is obtained;
S10, linear reconstruction is carried out to targeted news webpage html source code;
S20, denoising is carried out to the html source code after linear reconstruction;
S30, extraction text fragment is filtered division raw data set from html source code;
S40, cluster text paragraph;
S50, pseudo noise paragraph is absorbed;
S60, text train of thought paragraph is generated, completes the extraction of body.
Detail in step, has above elaborated, and no longer repeats herein;
In the description of embodiments of the present invention, it should be noted that in flow chart or described otherwise above herein
Any process or method description be construed as, indicate to include one or more for realizing specific logical function or mistake
Module, segment or the part of the code of the executable instruction of the step of journey, and the range packet of the preferred embodiment of the present invention
Include other realization, wherein sequence shown or discussed can not be pressed, including according to related function by it is basic simultaneously
Mode or in the opposite order, to execute function, this should be managed by the embodiment of the present invention person of ordinary skill in the field
Solution.
Expression or logic and/or step described otherwise above herein in flow charts, for example, being considered use
In the order list for the executable instruction for realizing logic function, may be embodied in any computer-readable medium, for
Instruction execution system, device or equipment (such as computer based system, including the system of processing module or other can be from instruction
Execute system, device or equipment instruction fetch and the system that executes instruction) use, or combine these instruction execution systems, device or
Equipment and use.For the purpose of this specification, " computer-readable medium ", which can be, any may include, store, communicating, propagating
Or transfer program uses for instruction execution system, device or equipment or in conjunction with these instruction execution systems, device or equipment
Device.The more specific example (non-exhaustive list) of computer-readable medium include the following: there are one or more wirings
Electrical connection section (electronic device), portable computer diskette box (magnetic device), random access memory (RAM), read-only memory
(ROM), erasable edit read-only storage (EPROM or flash memory), fiber device and portable optic disk is read-only deposits
Reservoir (CDROM).In addition, computer-readable medium can even is that the paper that can print described program on it or other are suitable
Medium because can then be edited for example by carrying out optical scanner to paper or other media, interpret or when necessary with
Other suitable methods are handled electronically to obtain described program, are then stored in computer storage.
The above description is only a preferred embodiment of the present invention, is not intended to limit the scope of the invention, all at this
Under the inventive concept of invention, using equivalent structure transformation made by description of the invention and accompanying drawing content, or directly/use indirectly
It is included in other related technical areas in scope of patent protection of the invention.
Claims (10)
1. a kind of method for extracting news web page content, which comprises the following steps:
Linear reconstruction is carried out to targeted news webpage html source code;
Text fragment is extracted from html source code is filtered division raw data set;
Cluster text paragraph;
Absorb pseudo noise paragraph;
Generate text train of thought paragraph.
2. the method according to claim 1 for extracting news web page content, which is characterized in that described to targeted news net
Before page html source code carries out linear reconstruction step, further include
Obtain webpage html source code.
3. the method according to claim 1 for extracting news web page content, which is characterized in that described to targeted news net
Page html source code carries out linear reconstruction step and extraction text fragment is filtered division initial data from html source code
Collect between step, further includes:
Denoising is carried out to the html source code after linear reconstruction.
4. the method according to claim 1 for extracting news web page content, which is characterized in that described to targeted news net
Page html source code carries out linear reconstruction, specifically: it will be in targeted news web page source code<body>with<div>label removal
The nesting of webpage carries out linear reconstruction, obtains the webpage html source code of linear reconstruction.
5. the method according to claim 1 for extracting news web page content, which is characterized in that described from html source code
Middle extraction text fragment is filtered division raw data set, specifically:
It presses paragraph sequence and extracts text fragment;
According to the punctuation mark number in the text fragment extracted, affiliated set is determined.
6. the method according to claim 5 for extracting news web page content, which is characterized in that described according to being extracted
Punctuation mark number in text fragment determines affiliated set, specifically:
If the punctuation mark number of text fragment is more than or equal to threshold value, it is partitioned into cluster paragraph set;
If the punctuation mark number of text fragment is less than threshold value, it is partitioned into and absorbs paragraph set.
7. the method according to claim 1 for extracting news web page content, which is characterized in that the cluster text segment
It falls, specifically:
It regards each paragraph clustered in paragraph set as an independent unit, is clustered according to web page tag.
8. the method according to claim 1 for extracting news web page content, which is characterized in that the absorption pseudo noise section
It falls, specifically:
According to the punctuation mark number threshold value of setting and with absorb the distance between paragraph first section threshold value, draw be located at text respectively
Before, the noise paragraph after neutralization.
9. a kind of device for extracting news web page content characterized by comprising
Processor;
Memory is coupled to the processor and is stored with instruction, and the instruction is executing the power of realization by the processor
Benefit require any one of 1 to 8 described in extract news web page content method and step.
10. a kind of computer-readable storage medium, which is characterized in that the computer-readable storage medium is stored with extraction
The application program of the method for news web page content, the application program realize such as extraction described in any item of the claim 1 to 8
The method and step of news web page content.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810863031.8A CN109271598B (en) | 2018-08-01 | 2018-08-01 | Method, device and storage medium for extracting news webpage content |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810863031.8A CN109271598B (en) | 2018-08-01 | 2018-08-01 | Method, device and storage medium for extracting news webpage content |
Publications (2)
Publication Number | Publication Date |
---|---|
CN109271598A true CN109271598A (en) | 2019-01-25 |
CN109271598B CN109271598B (en) | 2021-03-12 |
Family
ID=65153215
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810863031.8A Active CN109271598B (en) | 2018-08-01 | 2018-08-01 | Method, device and storage medium for extracting news webpage content |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109271598B (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111046302A (en) * | 2019-12-30 | 2020-04-21 | 珠海趣印科技有限公司 | Method and device for extracting webpage content |
CN112199499A (en) * | 2020-09-29 | 2021-01-08 | 京东方科技集团股份有限公司 | Text division method, text classification method, device, equipment and storage medium |
CN112328928A (en) * | 2020-11-27 | 2021-02-05 | 山东省计算中心(国家超级计算济南中心) | Text venation extraction method and system based on structure sequence |
CN116028618A (en) * | 2022-12-27 | 2023-04-28 | 百度国际科技(深圳)有限公司 | Text processing method, text searching method, text processing device, text searching device, electronic equipment and storage medium |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101470728A (en) * | 2007-12-25 | 2009-07-01 | 北京大学 | Method and device for automatically abstracting text of Chinese news web page |
US20090216739A1 (en) * | 2008-02-22 | 2009-08-27 | Yahoo! Inc. | Boosting extraction accuracy by handling training data bias |
CN102479181A (en) * | 2010-11-22 | 2012-05-30 | 中国电信股份有限公司 | Method and device for extracting webpage text based on DIV (Division) position |
US20130138655A1 (en) * | 2011-11-30 | 2013-05-30 | Microsoft Corporation | Web Knowledge Extraction for Search Task Simplification |
CN104077273A (en) * | 2013-03-27 | 2014-10-01 | 腾讯科技(深圳)有限公司 | Method and device for extracting webpage contents |
CN105630941A (en) * | 2015-12-23 | 2016-06-01 | 成都电科心通捷信科技有限公司 | Statistics and webpage structure based Wen body text content extraction method |
CN106776561A (en) * | 2016-12-20 | 2017-05-31 | 四川长虹电器股份有限公司 | Car networking system body extracting method |
-
2018
- 2018-08-01 CN CN201810863031.8A patent/CN109271598B/en active Active
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101470728A (en) * | 2007-12-25 | 2009-07-01 | 北京大学 | Method and device for automatically abstracting text of Chinese news web page |
US20090216739A1 (en) * | 2008-02-22 | 2009-08-27 | Yahoo! Inc. | Boosting extraction accuracy by handling training data bias |
CN102479181A (en) * | 2010-11-22 | 2012-05-30 | 中国电信股份有限公司 | Method and device for extracting webpage text based on DIV (Division) position |
US20130138655A1 (en) * | 2011-11-30 | 2013-05-30 | Microsoft Corporation | Web Knowledge Extraction for Search Task Simplification |
CN104077273A (en) * | 2013-03-27 | 2014-10-01 | 腾讯科技(深圳)有限公司 | Method and device for extracting webpage contents |
CN105630941A (en) * | 2015-12-23 | 2016-06-01 | 成都电科心通捷信科技有限公司 | Statistics and webpage structure based Wen body text content extraction method |
CN106776561A (en) * | 2016-12-20 | 2017-05-31 | 四川长虹电器股份有限公司 | Car networking system body extracting method |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111046302A (en) * | 2019-12-30 | 2020-04-21 | 珠海趣印科技有限公司 | Method and device for extracting webpage content |
CN112199499A (en) * | 2020-09-29 | 2021-01-08 | 京东方科技集团股份有限公司 | Text division method, text classification method, device, equipment and storage medium |
CN112328928A (en) * | 2020-11-27 | 2021-02-05 | 山东省计算中心(国家超级计算济南中心) | Text venation extraction method and system based on structure sequence |
CN116028618A (en) * | 2022-12-27 | 2023-04-28 | 百度国际科技(深圳)有限公司 | Text processing method, text searching method, text processing device, text searching device, electronic equipment and storage medium |
CN116028618B (en) * | 2022-12-27 | 2023-10-27 | 百度国际科技(深圳)有限公司 | Text processing method, text searching method, text processing device, text searching device, electronic equipment and storage medium |
Also Published As
Publication number | Publication date |
---|---|
CN109271598B (en) | 2021-03-12 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109271598A (en) | A kind of method, apparatus and storage medium extracting news web page content | |
WO2018000998A1 (en) | Interface generation method, apparatus and system | |
US8819028B2 (en) | System and method for web content extraction | |
US7941420B2 (en) | Method for organizing structurally similar web pages from a web site | |
US10824628B2 (en) | Method, terminal device and storage medium for mining entity description tag | |
WO2022078308A1 (en) | Method and apparatus for generating judgment document abstract, and electronic device and readable storage medium | |
CN103425765A (en) | Method and device for extracting webpage text and method and system for webpage preview | |
CN103279457B (en) | A kind of method and device generating chart based on Excel | |
CN106547895B (en) | Webpage information extraction method and device | |
CN110347390B (en) | Method, storage medium, equipment and system for rapidly generating WEB page | |
CN111737623A (en) | Webpage information extraction method and related equipment | |
CN111797630A (en) | PDF-format-paper-oriented biomedical entity identification method | |
CN108874870A (en) | A kind of data pick-up method, equipment and computer can storage mediums | |
CN107590288A (en) | Method and apparatus for extracting webpage picture and text block | |
WO2015154680A1 (en) | File processing method, device, and network system | |
CN115391711B (en) | Webpage text information extraction method, device, equipment and medium | |
JP2023010805A (en) | Method for training document information extraction model and extracting document information, device, electronic apparatus, storage medium and computer program | |
CN114637505A (en) | Page content extraction method and device | |
CN113282218A (en) | Multi-dimensional report generation method, device, equipment and storage medium | |
CN110610001A (en) | Short text integrity identification method and device, storage medium and computer equipment | |
WO2022215433A1 (en) | Information representation structure analysis device, and information representation structure analysis method | |
US20070283258A1 (en) | User-implemented handwritten content in a recursive browser system | |
CN114386407B (en) | Word segmentation method and device for text | |
JP2004303097A (en) | Partial document extraction program and partial document extraction method of structured document | |
CN116758565B (en) | OCR text restoration method, equipment and storage medium based on decision tree |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |