CN103425765A - Method and device for extracting webpage text and method and system for webpage preview - Google Patents

Method and device for extracting webpage text and method and system for webpage preview Download PDF

Info

Publication number
CN103425765A
CN103425765A CN2013103395554A CN201310339555A CN103425765A CN 103425765 A CN103425765 A CN 103425765A CN 2013103395554 A CN2013103395554 A CN 2013103395554A CN 201310339555 A CN201310339555 A CN 201310339555A CN 103425765 A CN103425765 A CN 103425765A
Authority
CN
China
Prior art keywords
web page
character
webpage
character string
setting
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN2013103395554A
Other languages
Chinese (zh)
Inventor
梁捷
赵闯
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ucweb Inc
Original Assignee
Ucweb Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ucweb Inc filed Critical Ucweb Inc
Priority to CN2013103395554A priority Critical patent/CN103425765A/en
Publication of CN103425765A publication Critical patent/CN103425765A/en
Pending legal-status Critical Current

Links

Images

Abstract

The invention provides a method for extracting webpage text. The method comprises the following steps of extracting data of a webpage main body part, screening characters related to the webpage text from the data of the webpage main body part, carrying out cutting processing and HTML label removing processing on the characters related to the webpage text to obtain string arrays in all lines, carrying out line by line scanning from the first line of the string arrays according to set line block sizes, and outputting characters in the set line block when the number of characters in the set line block is scanned to be larger than or equal to a set character number threshold value. The invention further provides a device for extracting the webpage text, the device can be used for quickly extracting content of the webpage text, and the occupancy rate of system memory is lowered. The invention further provides a method and system for webpage preview, webpage display speed can be improved, and the waiting time of webpage requesting is reduced.

Description

The extracting method of Web page text and device, web preview method and system
Technical field
The present invention relates to the mobile communication technology field, more specifically, relate to the extracting method of Web page text and device, web preview method and system.
Background technology
Development along with internet, web webpage quantity is day by day huge, increasing people carrys out obtaining information by request web webpage, but also exist many data to disturb, requestor to the web webpage causes the access obstacle, and the requestor really wants Useful Information in the web Web page text, therefore in the urgent need to a kind of extractions web Web page text technology, effectively obtain the textual data of web webpage with the help requestor.
Traditional webpage context extraction method, need to set up DOM(Document Object Model, document dbject model through browser kernel) tree, resolve to extract text by the JavaScript engine; One of them typical Web page text leaching process is roughly as follows: the specific label item of first finding out the webpage html document, utilize this specific label item html document to be expressed as to the structure of a dom tree, then resolve effective tree node data in the extraction dom tree by the JavaScript engine and extract text, therefore this traditional Web page text extraction rate is very slow.
Summary of the invention
In view of above-mentioned traditional very slow problem of Web page text extraction rate, the present invention proposes a kind of extracting method and device of Web page text, the body matter that the method and device can the rapid extraction webpages, reduce the occupancy to Installed System Memory.
According to an aspect of the present invention, provide a kind of extracting method of Web page text, comprised the following steps:
Extract the data of webpage main body block;
The screening character relevant to Web page text from the data of described webpage main body block;
To the described character relevant to Web page text, gone html tag to process and cutting process, obtained each line character string array;
From the first row of described character string array, start to line by line scan by the capable block size of setting;
When the number of characters in the capable piece that scans described setting is greater than or equal to the number of words threshold value of setting, export the character in the capable piece of described setting.
According to another aspect of the present invention, provide a kind of extraction element of Web page text, it comprises:
Extraction unit, for extracting the data of webpage main body block;
The screening unit, for the screening of the data from the described webpage main body block character relevant to Web page text;
The standardization unit, processed and cutting process for the described character relevant to Web page text removing to html tag, obtains each line character string array;
Scanning element, start to line by line scan by the capable block size of setting for the first row from described character string array;
Output unit, while for the number of characters of the capable piece when scanning described setting, being greater than or equal to the number of words threshold value of setting, export the character in the capable piece of described setting.
Utilize said method and system, after the data of extracting the webpage main body block, advanced line character screening, character filtering that will be irrelevant with Web page text falls, and obtains the character relevant to Web page text; Carry out cutting process and go html tag to process at the character to relevant to webpage, obtain each line character string array, from the first row of character string array, start to line by line scan by the capable block size of setting, when the number of characters in the row piece is greater than or equal to default number of words threshold value, the character in output row piece; Do not need to set up dom tree, do not need the JavaScript engine to resolve to extract text yet; The text extraction rate obviously improves, and has also reduced the memory usage of system.
Another aspect of the present invention, propose a kind of web preview method and system, can improve the speed of web displaying, the stand-by period while reducing requested webpage.
The present invention proposes a kind of web preview method, comprises step:
When receiving the web preview request message, the webpage main document returned according to the web page interlinkage request is decoded, obtain webpage main document character string;
Utilize the extracting method of Web page text as claimed in claim 1 to extract Web page text from described webpage main document character string;
Described Web page text is carried out after display format is processed showing.
The present invention also proposes a kind of web preview system, comprising:
Decoding device, for when receiving the web preview request message, decoded to the webpage main document returned according to the web page interlinkage request, obtains webpage main document character string;
The extraction element of Web page text as claimed in claim 6, for extracting Web page text from described webpage main document character string;
Display device, carry out described Web page text after display format is processed to show.
Adopt the method and system of above-mentioned web preview, the webpage main document that the request according to web preview is returned is decoded, obtain webpage main document character string, then the Web page text extractive technique of utilizing the present invention to propose, can improve the extraction rate of Web page text, therefore, can improve the speed of web displaying, the stand-by period while reducing requested webpage.
In order to realize above-mentioned and relevant purpose, one or more aspects of the present invention comprise the feature that back will describe in detail and particularly point out in the claims.Following explanation and accompanying drawing describe some illustrative aspects of the present invention in detail.Yet, the indication of these aspects be only some modes that can use in the variety of way of principle of the present invention.In addition, the present invention is intended to comprise all these aspects and their equivalent.
The accompanying drawing explanation
According to following detailed description of carrying out with reference to accompanying drawing, above and other purpose of the present invention, feature and advantage will become more apparent.In the accompanying drawings:
Fig. 1 shows a process flow diagram of the inventive method;
Fig. 2 A to 2C and Fig. 3 A to Fig. 3 B show according to the inventive method and carry out the schematic diagram of being lined by line scan by the capable piece of setting;
Fig. 4 shows a block diagram of apparatus of the present invention;
Fig. 5 A shows the block diagram of apparatus of the present invention Plays processing unit;
Fig. 5 B and Fig. 5 C show the block diagram of scanning element in apparatus of the present invention;
Fig. 6 shows another block diagram of apparatus of the present invention;
Fig. 7 shows a block diagram of the web preview system of the present invention's proposition;
Fig. 8 shows another block diagram of the web preview system of the present invention's proposition.
In institute's drawings attached, identical label is indicated similar or corresponding feature or function.
Embodiment
Various aspects of the present disclosure are described below.Should be understood that, the instruction of this paper can be with varied form imbody, and disclosed any concrete structure, function or both are only representational in this article.Instruction based on this paper, those skilled in the art should be understood that, an aspect disclosed herein can be independent of any other side and realize, and the two or more aspects in these aspects can combine according to variety of way.For example, can use the aspect of any number described in this paper, implement device or hands-on approach.In addition, can use other structure, function or except one or more aspects described in this paper or be not the 26S Proteasome Structure and Function of one or more aspects described in this paper, realize this device or put into practice this method.In addition, any aspect described herein can comprise at least one element of claim.
Each embodiment of the present invention is described below with reference to accompanying drawings.
The extracting method of a kind of Web page text that the present invention proposes, its implementing procedure can, with reference to figure 1, comprise the following steps:
The data of S1, extraction webpage main body block;
Concrete, comprise step: the web page interlinkage acquisition request webpage according to input is divided into head and two region units of body by webpage; Wherein, the body region unit is the webpage main body block, from the webpage main body block, extracts corresponding data.
S2, from the data of webpage main body block the screening character relevant to Web page text;
HTML (Hypertext Markup Language) (hyper text markup language, HTML) is the basic language of program.Usually structure of web page can be expressed as follows:
<html><head>
The information that web page title and other and web page title are irrelevant
</head><body>
The text title, body matter and other and Web page text title, the information that body matter is irrelevant
</body></html>
After the data of extracting the webpage main body block, carry out data screening, delete the data irrelevant with Web page text, remove useless label, remove and mean JavaScript, CSS(Cascading Style Sheet, Cascading Style Sheet) etc. with the irrelevant character of Web page text content.In the text of information in the webpage main body block that need to obtain due to the user, therefore at this, carry out data screening, obtain the data relevant to Web page text.
S3, the character relevant to Web page text carried out cutting process and go html tag to process, obtain each line character string array;
Concrete, while implementing this step, can carry out with reference to first kind of way:
The character relevant to Web page text cut by carriage return character, obtained the several rows character string; Delete the blank character of each line character string front and back;
Delete the html tag in each line character string;
Obtain each line character string array.
While implementing this step, can also carry out with reference to the second way:
Delete the html tag in the character relevant to Web page text;
The character relevant to Web page text after the removal html tag cut by carriage return character, obtained the several rows character string; Delete the blank character of each line character string front and back;
Obtain each line character string array.
Comparatively speaking, adopt the first embodiment, the text obtained is more accurate.Wherein, if, when the newline adopted during program is other characters, cut by other line feed characters.
S4, from the first row of character string array, start to line by line scan by the capable block size of setting;
Concrete, when implementing the S4 step, below can adopting wherein a kind of mode carry out:
Mode one:
Read the capable piece of setting; The capable piece of this setting is for store character; Wherein, the attribute of the capable piece of setting comprises the amount of capacity of capable piece; Can implement to carry out setup of attribute by backstage before the inventive method; The size of row piece also can be adjusted according to actual demand; The size of row piece can affect the accuracy of extracting Web page text to a certain extent, and when the row piece is larger, the extraction rate of Web page text is fast, but the extraction accuracy mode that but webpage hour extracts not as good as the row piece.
Lined by line scan by the capable block size of setting from capable beginning of n+1 of character string array, scanning result is stored in the capable piece of setting; Repeat this step, until last column that n is the character string array; Wherein, while scanning for the first time, n is 0.
For ease of understanding, below in conjunction with accompanying drawing, above-mentioned embodiment of lining by line scan is described.
Please refer to Fig. 2 A to Fig. 2 C; At first referring to Fig. 2 A, after processing, step S3 obtains 8 line character string arrays shown in Fig. 2 A; The schematic diagram that it will be appreciated by those skilled in the art that the present embodiment is only to set for convenience of description, obtains the character string array of 8 row after step S3 processes, and multirow more or capable character string array still less can be arranged in practice.
Read the capable piece of setting, the capable block size of setting in this agreement as shown in Figure 2 A; Then from the first row of character string array, scanned, the character that scanning is obtained is stored in the capable piece of setting, and in the capable piece of setting, again during store character, while having scanned as shown in Figure 2 A the 4th row, the capable piece of setting is store character again; The magnitude relationship of the number of words threshold value of significant character number and setting in the capable piece that judgement is set; Wherein, the significant character number refers to the character except blank character and space character; Because space character also can occupy storage space, therefore, when the capable piece of setting can not be stored into character again, do not represent that the character of storing in capable piece is all significant character; Space character also is not equal to blank character.If in the capable piece of setting, the significant character number is greater than or equal to the number of words threshold value of setting, this threshold value can be arranged according to actual needs; Show that the character of storing in the row piece is Web page text, carry out step S5; If in the capable piece of setting, the significant character number is less than the number of words threshold value of setting, show that the character of storing in the row piece is not Web page text, do not export the character in capable piece, empty the character of storing in capable piece.
Then from the 2nd of the character string array line scanning of advancing, its schematic diagram can be with reference to figure 2B, and the character that scanning is obtained is stored in the capable piece of setting, in the capable piece of setting again during store character, while having scanned as shown in Figure 2 B the 5th row, the capable piece of setting is store character again; The magnitude relationship of the number of words threshold value of significant character number and setting in the capable piece that judgement is set; If in the capable piece of setting, the significant character number is greater than or equal to the number of words threshold value of setting, show that the character of storing in the row piece is Web page text, carry out step S5; If in the capable piece of setting, the significant character number is less than the number of words threshold value of setting, show that the character of storing in the row piece is not Web page text, do not export the character in capable piece, empty the character of storing in capable piece.
Then from the 3rd row of character string array, with reference to above-mentioned scanning process, until last column of character string array.
Preferably in embodiment, if when certain once scans, while being scanned up to last column, finish scanning process at one, its schematic diagram please refer to Fig. 2 C.
It should be noted that, in the embodiment of Fig. 2 A to Fig. 2 C, to start to be scanned by the capable block size of setting from the first row of character string array, although now the 2nd content that walks to the 4th row is scanned, but in Fig. 2 B, while being scanned by the capable block size of setting since the 2nd row, still scan the 2nd content that walks to the 4th row; By that analogy, until last column of character string array.The purpose of doing like this is in order fully to guarantee that the character of storing in the row piece is Web page text content accurately.
Mode two:
Read the capable piece of setting;
Start to be lined by line scan by the capable block size of setting from the first row of character string array, scanning result is stored in the capable piece of setting, and the writing scan position;
From current scanning position, by the capable block size of setting, lined by line scan, scanning result is stored in the capable piece of setting, and the writing scan position; Repeat this step until by character string array been scanned.
For ease of understanding, below in conjunction with Fig. 3 A and Fig. 3 B, above-mentioned scan mode is described.
At first referring to Fig. 3 A, after processing, step S3 obtains 8 line character string arrays shown in Fig. 3 A; The schematic diagram that it will be appreciated by those skilled in the art that the present embodiment is only for convenience of description and sets the character string array that obtains 8 row after step S3 processes, and multirow more or capable character string array still less can be arranged in practice.
Read the capable piece of setting, the capable block size of setting in this agreement as shown in Figure 2 A; Then from the first row of character string array, scanned, the character that scanning is obtained is stored in the capable piece of setting, and in the capable piece of setting, again during store character, while as shown in Fig. 3 A, having scanned the 4th row, the capable piece of setting is store character again; Scanning position now of record, and the magnitude relationship of the number of words threshold value of significant character number and setting in the capable piece set of judgement; Wherein, the significant character number refers to the character except blank character and space character; Because space character also can occupy storage space, therefore, when the capable piece of setting can not be stored into character again, do not represent that the character of storing in capable piece is all significant character.If in the capable piece of setting, the significant character number is greater than or equal to the number of words threshold value of setting, show that the character of storing in the row piece is Web page text, carry out step S5; If in the capable piece of setting, the significant character number is less than the number of words threshold value of setting, show that the character of storing in the row piece is not Web page text, do not export the character in capable piece, empty the character of storing in capable piece.
According to the scanning position of record, from the 5th row of character string array, start to be scanned according to the capable block size of setting.The character that scanning is obtained is stored in the capable piece of setting, in the capable piece of setting again during store character, and record scanning position now, and the magnitude relationship of the number of words threshold value of significant character number and setting in the capable piece of judgement setting; If in the capable piece of setting, the significant character number is greater than or equal to the number of words threshold value of setting, show that the character of storing in the row piece is Web page text, carry out step S5; If in the capable piece of setting, the significant character number is less than the number of words threshold value of setting, show that the character of storing in the row piece is not Web page text, do not export the character in capable piece, empty the character of storing in capable piece.
According to the scanning position of the last time record, line by line scan, until, by character string array been scanned, as shown in Figure 3 B, after the last column that has scanned the character string array, stop scanning next time.
When S5, the number of characters in the capable piece of setting are greater than or equal to the number of words threshold value of setting, the character in the capable piece that output is set.
The extracting method of the Web page text that the present invention proposes, after the data of extracting the webpage main body block, advanced line character screening, character filtering that will be irrelevant with Web page text falls, and obtains the character relevant to Web page text; Carry out cutting process and go html tag to process at the character to relevant to webpage, obtain each line character string array, from the first row of character string array, start to line by line scan by the capable block size of setting, when the number of characters in the row piece is greater than or equal to default number of words threshold value, the character in output row piece; Make and do not need to set up dom tree by above-mentioned processing, also do not need the JavaScript engine to resolve to extract text, therefore realize the effect that the text extraction rate obviously improves, also reduced the memory usage of system.
Preferably in embodiment, in order to save storage space and to guarantee the real-time shown, also comprise step after the step of the character in the capable piece of setting in step S5 output at one:
Obtain sweep record;
According to sweep record, delete the character that in the character in the capable piece of each setting, multiple scanning obtains;
Preserve remaining character in the capable piece of each setting.
Due to when the character scan, character that likely can the multiple scanning same position, by deleting the character of multiple scanning same position, can save storage space on the one hand, can guarantee on the other hand the correctness shown.
The present invention also proposes a kind of extraction element of Web page text, and its block diagram please refer to Fig. 4, comprising:
Extraction unit, for extracting the data of webpage main body block;
The screening unit, for the screening of the data from the webpage main body block character relevant to Web page text;
The standardization unit, carry out cutting process and go html tag to process for the character to relevant to Web page text, obtains each line character string array;
Scanning element, start to line by line scan by the capable block size of setting for the first row from standardized character string array;
Output unit, while for the number of characters of the capable piece when scanning setting, being greater than or equal to the number of words threshold value of setting, the character in the capable piece that output is set.
The extraction element of the Web page text that the present invention proposes, after the data of extracting the webpage main body block, advanced line character screening, character filtering that will be irrelevant with Web page text falls, and obtains the character relevant to Web page text; Carry out cutting process and go html tag to process at the character to relevant to webpage, obtain each line character string array, from the first row of character string array, start to line by line scan by the capable block size of setting, when the number of characters in the row piece is greater than or equal to default number of words threshold value, the character in output row piece; Do not need to set up dom tree, do not need the JavaScript engine to resolve to extract text yet; The text extraction rate obviously improves, and has also reduced the memory usage of system.
In one embodiment, please refer to Fig. 5 A, the standardization unit comprises:
Cutting module, cut by line feed character for the character to relevant to Web page text, obtains each line character string;
The first removing module, for deleting the blank character of each line character string front and back;
The second removing module, for deleting the html tag of each line character string; Obtain the character string array.
In one embodiment, please refer to Fig. 5 B, scanning element comprises:
The first read module, for reading the capable piece of setting;
The first scan module, lined by line scan by the capable block size of setting for the n+1 since the first scan module from the character string array is capable, scanning result is stored in the capable piece of setting; Repeat this process, until last column that n is the character string array; Wherein, while scanning for the first time, n is 0.
In one embodiment, please refer to Fig. 5 C, scanning element comprises:
The second read module, for reading the capable piece of setting;
The second scan module, start to be lined by line scan by the capable block size of setting for the first row from the character string array, scanning result is stored in the capable piece of setting, and the writing scan position;
The second scan module, from current scanning position, is lined by line scan by the capable block size of setting, scanning result is stored in the capable piece of setting, and the writing scan position; Repeat this process, until by character string array been scanned.
In one embodiment, in order to save the real-time of storage space and demonstration, please refer to Fig. 6, the extraction element of Web page text also comprises:
The first acquiring unit, for obtaining sweep record;
Optimize unit, for according to sweep record, delete the character that in the character in the capable piece of each setting, multiple scanning obtains;
Storage unit, for preserving the remaining character of capable piece of each setting.
Due to when the character scan, character that likely can the multiple scanning same position, by deleting the character of multiple scanning same position, can save storage space on the one hand, can guarantee on the other hand the correctness shown.
In addition, the reading model of existing browser routine is to use the data structure of browser kernel webview when browsing page, this data structure is very big, initialization speed is slower, during use, EMS memory occupation is also larger, and the content extracted for JavaScript also will by browser kernel carry out typesetting with play up, lose time, period of reservation of number is long, and the web page contents of opening may not be that the user is interested.Based on this problem, the present invention also proposes a kind of web preview method, provide a kind of pattern of web preview fast to the user, the user does not need to open web page interlinkage, can be at the preview window preview page text, find that uninterested webpage do not open, avoid waste time and flow, can directly open webpage for the page that there is no text.The method is compared the implementation that has various browser reading models on mobile platform now, and preview speed is fast, reduces period of reservation of number, greatly improves the user and experiences.
Web preview method provided by the invention comprises step:
When receiving the web preview request message, the webpage main document returned according to the web page interlinkage request is decoded, obtain webpage main document character string;
Utilize the extracting method of the Web page text of Fig. 1 to Fig. 3 embodiment record to extract Web page text from webpage main document character string;
Web page text is carried out after display format is processed showing.Wherein, display format, can be adjusted according to the actual requirements.
When preview, Web page text shows at the preview window.
In one or more embodiments of the present invention, the webpage main document that the request based on web preview is returned is decoded, and the step of obtaining webpage main document character string can comprise the following steps:
Receive Client-initiated web preview request message;
Display mode according to described web preview request message judgement webpage;
When the display mode of judgement webpage be normal the demonstration, according to the web page interlinkage request message requests webpage of reception, to webpage resolved, typesetting and play up rear demonstration;
When the display mode according to described web preview request judgement webpage is preview mode, according to the web page interlinkage request message to server request webpage main document; The webpage main document that server is returned is decoded and is obtained webpage main document character string.
In one or more embodiments of the present invention, above-mentioned web preview method also comprises step:
After extracting Web page text, judge whether Web page text is null character (NUL); When the character string of judgement Web page text while being not null character (NUL), Web page text is carried out after the processing of display format, show.
A more complete embodiment, a kind of web preview method that the present invention proposes can comprise the following steps:
1, browser application layer unit receives Client-initiated web page interlinkage request message;
2, whether browser application layer unit judges receives the web preview request of user input, wherein, can obtain the message whether preview button built-in in browser be pressed and determine whether to receive the web preview request; When preview button is pressed, determine the web preview request of receiving; If receive, notify the browser kernel unit to cancel the load page flow process, after the browser kernel unit receives the notice of returning browser application layer unit so, cancel the load page flow process; Simultaneously, browser application layer unit starts initialization the preview window, and according to the web page interlinkage request message to server request webpage main document, carry out step 3; If do not receive, notify browser kernel unit load page, after the browser kernel unit receives notice, start the normal load page, and to webpage resolved, typesetting and play up rear demonstration, simultaneously, browser application layer unit process ends;
The contents such as the head response that 3, browser application layer unit reception server is beamed back, webpage main document; Obtain the coded format of main document by the resolution response head, according to this coded format, the webpage main document is decoded; Decoded main document content is left in character string;
4, Web page text extracting is carried out in the text extracting method of this character string calling graph 1 to Fig. 3 B embodiment record, obtain Web page text;
5, to Web page text, whether be empty judgement, empty if, illustrate that this webpage is without text, be the other types webpages such as navigation page or a large amount of pictures most probably; Now, browser application layer unit notice browser kernel unit load page, after the browser kernel unit receives notice so, start the normal load page, and to the page resolved, typesetting and playing up, then this page is normally shown;
If 6 Web page text non-NULLs, illustrate and be drawn into text, by extracting after the Web page text obtained formats processing, be presented on the preview window.
Can find, by above-mentioned processing, provide a kind of pattern of web preview fast to the user, the user does not need to open web page interlinkage, can be at the preview window preview page text, find that uninterested webpage do not open, avoid waste time and flow, also save the process that the dom tree is set up in the browser kernel unit simultaneously; According to test, this process can be saved the time of 0.5-1 second and a large amount of internal memories.For example, by test Sina, the link of portal website's homepages such as 163 and the link of Baidu's searched page, the inventive method clicks the body matter (if webpage has body matter) that shows preview from the user, and average velocity is below 0.3 second.And the speed of the abstracting method based on setting up the dom tree is substantially more than 1 second.
In addition, the reading model of existing browser routine is to use the data structure of browser kernel webview when browsing page, this data structure is very big, initialization speed is slower, during use, EMS memory occupation is also larger, and the content extracted for JavaScript also will by browser kernel carry out typesetting with play up, lose time, period of reservation of number is long, and the web page contents of opening may not be that the user is interested.Based on this problem, the present invention also proposes a kind of web preview system, provide a kind of pattern of web preview fast to the user, the user does not need to open web page interlinkage, can be at the preview window preview page text, find that uninterested webpage do not open, avoid waste time and flow, can directly open webpage for the page that there is no text.The method is compared the implementation that has various browser reading models on mobile platform now, and preview speed is fast, reduces period of reservation of number, greatly improves the user and experiences.
A kind of web preview system that the present invention proposes, please refer to Fig. 7, comprising:
Decoding device, for when receiving the web preview request message, decoded to the webpage main document returned according to the web page interlinkage request, obtains webpage main document character string;
As the extraction element of the Web page text of Fig. 4 to Fig. 6 embodiment record, for from webpage main document character string, extracting Web page text;
Display device, carry out Web page text after display format is processed to show.
Adopt the method and system of above-mentioned web preview, the webpage main document that the request according to web preview is returned is decoded, obtain webpage main document character string, then the Web page text extractive technique of utilizing the present invention to propose, can improve the extraction rate of Web page text, therefore, can improve the speed of web displaying, the stand-by period while reducing requested webpage.
In one or more embodiments of the present invention, with reference to figure 8, this decoding device comprises: receiving element and display mode judging unit, CPU (central processing unit) and second acquisition unit; Wherein,
Receiving element, for receiving the web preview request message;
The display mode judging unit, for the display mode according to web preview request judgement webpage;
When the display mode of display mode judgment unit judges webpage is normal the demonstration, CPU (central processing unit) is according to the web page interlinkage request message requests webpage received, and to webpage resolved, typesetting and playing up; Display device to webpage resolved, typesetting and play up after webpage shown;
When the display mode of display mode judgment unit judges webpage is preview mode, second acquisition unit, according to web page interlinkage request message requests webpage main document, is decoded and is obtained webpage main document character string the webpage main document.
Wherein, receiving the web page interlinkage request can be before receiving the web preview request; Receiving the web page interlinkage request also can be after receiving the web preview request.
In one or more embodiments of the present invention, with reference to figure 8, the web preview system also comprises the character judgment means; Wherein, the character judgment means judges whether Web page text is null character (NUL); When character judgment means judgement Web page text is not null character (NUL), display device carries out Web page text after display format is processed to show; When character judgment means judgement Web page text is null character (NUL), CPU (central processing unit) is according to web page interlinkage request message requests webpage, to webpage resolved, typesetting and play up after shown.
The present invention can use the objective-C language to realize, in order to obtain supporting preferably to introduce this storehouse of storehouse RegexKitLite(of increasing income, uses the objective-C language to write, and based on system icu storehouse, realizes) the parsing work of regular expression is provided.
The webpage context extraction method that the present invention proposes and device, web preview method and system, can apply to mobile terminal;
In addition, typically, mobile terminal can be the various hand-held terminal devices with Bluetooth function, for example there is mobile phone, the PDA(Personal Digital Assistant) of Bluetooth function.
In addition, the method according to this invention can also be implemented as the computer program that the processor (such as CPU) in mobile terminal is carried out, and is stored in the storer of mobile terminal.When this computer program is executed by processor, carry out the above-mentioned functions limited in method of the present invention.
In addition, the method according to this invention can also be embodied as a kind of computer program, this computer program comprises computer-readable medium, on this computer-readable medium, stores for carrying out the computer program of the above-mentioned functions that method of the present invention limits.
In addition, said method step and system unit also can utilize controller and make controller realize that the computer readable storage devices of the computer program of above-mentioned steps or Elementary Function realizes for storage.
Those skilled in the art will also understand is that, in conjunction with the described various illustrative logical blocks of disclosure herein, module, circuit and algorithm steps, may be implemented as electronic hardware, computer software or both combinations.For this interchangeability of hardware and software clearly is described, with regard to the function of various exemplary components, square, module, circuit and step, it has been carried out to general description.This function is implemented as software or is implemented as hardware and depends on concrete application and the design constraint that imposes on whole system.Those skilled in the art can realize described function in every way for every kind of concrete application, but this realization determines should not be interpreted as causing departing from the scope of the present invention.
Although the disclosed content in front shows exemplary embodiment of the present invention, it should be noted that under the prerequisite of the scope of the present invention that does not deviate from the claim restriction, can carry out multiple change and modification.According to function, step and/or the action of the claim to a method of inventive embodiments described herein, need not carry out with any particular order.In addition, although element of the present invention can be with individual formal description or requirement, also it is contemplated that a plurality of, unless clearly be restricted to odd number.
Although figure has described each embodiment according to the present invention and has been described above with reference to, it will be appreciated by those skilled in the art that each embodiment that the invention described above is proposed, and can also on the basis that does not break away from content of the present invention, make various improvement.Therefore, protection scope of the present invention should be determined by the content of appending claims.

Claims (16)

1. the extracting method of a Web page text comprises the following steps:
Extract the data of webpage main body block;
The screening character relevant to Web page text from the data of described webpage main body block;
The described character relevant to Web page text carried out cutting process and go html tag to process, obtain each line character string array;
From the first row of described character string array, start to line by line scan by the capable block size of setting;
When the number of characters in the capable piece that scans described setting is greater than or equal to the number of words threshold value of setting, export the character in the capable piece of described setting.
2. the extracting method of Web page text according to claim 1, describedly carry out cutting process and go html tag to process the character relevant to Web page text, and the step of obtaining each line character string array comprises:
The described character relevant to Web page text cut by line feed character, obtained the several rows character string; Delete the blank character of each line character string front and back;
Delete the html tag in each line character string;
Obtain the described character string array of each row.
3. the extracting method of Web page text according to claim 1, the step that the described the first row from the character string array starts to line by line scan by the capable block size of setting comprises:
Read the capable piece of described setting;
Lined by line scan by the capable block size of described setting from capable beginning of n+1 of described character string array, scanning result is stored in the capable piece of described setting; Repeat this step, until last column that n is described character string array; Wherein, while scanning for the first time, n is 0.
4. the extracting method of Web page text according to claim 1, the step that the described the first row from the character string array starts to line by line scan by the capable block size of setting comprises:
Read the capable piece of described setting;
Start to be lined by line scan by the capable block size of described setting from the first row of described character string array, scanning result is stored in the capable piece of described setting, and the writing scan position;
From current scanning position, by the capable block size of described setting, lined by line scan, scanning result is stored in the capable piece of described setting, and the writing scan position; Repeat this step until by described character string array been scanned.
5. according to the extracting method of the described Web page text of claim 1 to 4 any one, also comprise step after the step of the character in the capable piece of setting in described output:
Obtain sweep record;
According to described sweep record, delete the character that in the character in the capable piece of each described setting, multiple scanning obtains;
Preserve remaining character in the capable piece of each described setting.
6. the extraction element of a Web page text comprises:
Extraction unit, for extracting the data of webpage main body block;
The screening unit, for the screening of the data from the described webpage main body block character relevant to Web page text;
The standardization unit, for the described character relevant to Web page text being carried out to cutting process and going html tag to process, obtain each line character string array;
Scanning element, start to line by line scan by the capable block size of setting for the first row from described character string array;
Output unit, while for the number of characters of the capable piece when scanning described setting, being greater than or equal to the number of words threshold value of setting, export the character in the capable piece of described setting.
7. the extraction element of Web page text according to claim 6, described standardization unit comprises:
Cutting module, for the described character relevant to Web page text cut by line feed character, obtain each line character string;
The first removing module, for deleting the blank character of each line character string front and back;
The second removing module, for deleting the html tag of described each line character string; Obtain described character string array.
8. the extraction element of Web page text according to claim 6, described scanning element comprises:
The first read module, for reading the capable piece of described setting;
The first scan module, lined by line scan by the capable block size of described setting for capable beginning of the n+1 from described character string array, scanning result is stored in the capable piece of described setting; Repeat this process, until last column that n is described character string array; Wherein, while scanning for the first time, n is 0.
9. the extraction element of Web page text according to claim 6, described scanning element comprises:
The second read module, for reading the capable piece of described setting;
The second scan module, start to be lined by line scan by the capable block size of described setting for the first row from described character string array, scanning result is stored in the capable piece of described setting, and the writing scan position;
Described the second scan module, from current scanning position, is lined by line scan by the capable block size of described setting, scanning result is stored in the capable piece of described setting, and the writing scan position; Repeat this process, until by described character string array been scanned.
10. according to the extraction element of the described Web page text of claim 6 to 9 any one, the extraction element of described Web page text also comprises:
The first acquiring unit, for obtaining sweep record;
Optimize unit, for according to described sweep record, delete the character that in the character in the capable piece of each described setting, multiple scanning obtains;
Storage unit, for preserving the remaining character of capable piece of each described setting.
11. a web preview method comprises step:
When receiving the web preview request message, the webpage main document returned according to the web page interlinkage request is decoded, obtain webpage main document character string;
Utilize the extracting method of Web page text as claimed in claim 1 to extract Web page text from described webpage main document character string;
Described Web page text is carried out after display format is processed showing.
12. web preview method according to claim 11 is described when receiving the web preview request message, and the webpage main document returned according to the web page interlinkage request is decoded, the step of obtaining webpage main document character string comprises:
Receive the web preview request message;
Display mode according to described web preview request message judgement webpage;
When the display mode of judgement webpage be normal the demonstration, according to the web page interlinkage request message requests webpage of reception, to webpage resolved, typesetting and play up rear demonstration;
When the display mode of judgement webpage is preview mode, according to described web page interlinkage request message requests webpage main document; Described webpage main document is decoded, obtained webpage main document character string.
13. web preview method according to claim 12, described web preview method also comprises step:
Judge whether described Web page text is null character (NUL); When the described Web page text of judgement is not null character (NUL), described Web page text is shown; When the described Web page text of judgement while being null character (NUL), according to described web page interlinkage request message requests webpage, to webpage resolved, typesetting and play up rear demonstration.
14. a web preview system comprises:
Decoding device, for when receiving the web preview request message, decoded to the webpage main document returned according to the web page interlinkage request, obtains webpage main document character string;
The extraction element of Web page text as claimed in claim 6, for extracting Web page text from described webpage main document character string;
Display device, carry out described Web page text after display format is processed to show.
15. web preview system according to claim 14, described decoding device comprises: receiving element and display mode judging unit, CPU (central processing unit) and second acquisition unit; Wherein,
Described receiving element, for receiving the web preview request message;
Described display mode judging unit, for the display mode according to described web preview request judgement webpage;
When the display mode of described display mode judgment unit judges webpage is normal the demonstration, described CPU (central processing unit) is according to the web page interlinkage request message requests webpage received, and to webpage resolved, typesetting and playing up; Described display device to webpage resolved, typesetting and play up after web displaying;
When the display mode of described display mode judgment unit judges webpage is preview mode, described second acquisition unit is according to described web page interlinkage request message requests webpage main document; Described webpage main document is decoded, obtained webpage main document character string.
16. web preview system according to claim 15, described web preview system also comprises the character judgment means; Wherein, described character judgment means judges whether described Web page text is null character (NUL); When described character judgment means judges that described Web page text is not null character (NUL), described display device carries out described Web page text after display format is processed to show; When described character judgment means judges that described Web page text is null character (NUL), described CPU (central processing unit) is according to described web page interlinkage request message requests webpage, to webpage resolved, typesetting and play up rear demonstration.
CN2013103395554A 2013-08-06 2013-08-06 Method and device for extracting webpage text and method and system for webpage preview Pending CN103425765A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2013103395554A CN103425765A (en) 2013-08-06 2013-08-06 Method and device for extracting webpage text and method and system for webpage preview

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2013103395554A CN103425765A (en) 2013-08-06 2013-08-06 Method and device for extracting webpage text and method and system for webpage preview

Publications (1)

Publication Number Publication Date
CN103425765A true CN103425765A (en) 2013-12-04

Family

ID=49650504

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2013103395554A Pending CN103425765A (en) 2013-08-06 2013-08-06 Method and device for extracting webpage text and method and system for webpage preview

Country Status (1)

Country Link
CN (1) CN103425765A (en)

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104836779A (en) * 2014-02-12 2015-08-12 携程计算机技术(上海)有限公司 XSS vulnerability detection method, system and Web server
CN105022803A (en) * 2015-07-01 2015-11-04 广州市万隆证券咨询顾问有限公司 Method and system for extracting text content of webpage
WO2015165324A1 (en) * 2014-04-30 2015-11-05 广州市动景计算机科技有限公司 Webpage text extraction method and device, and webpage advertisement handling method and device
CN105335382A (en) * 2014-06-27 2016-02-17 优视科技有限公司 Webpage text extraction method and device
CN105512225A (en) * 2015-11-30 2016-04-20 北大方正集团有限公司 Method and device extracting main content from webpage
CN105574004A (en) * 2014-10-10 2016-05-11 阿里巴巴集团控股有限公司 Webpage deduplication method and device
CN105740355A (en) * 2016-01-26 2016-07-06 中国人民解放军国防科学技术大学 Aggregated text density based webpage body text extraction method and apparatus
CN105868363A (en) * 2016-03-29 2016-08-17 中国农业银行股份有限公司 Webpage page text extraction method and system based on fuzzy logic
CN106776886A (en) * 2016-11-29 2017-05-31 中国农业银行股份有限公司 A kind of Webpage body matter abstracting method and device
CN108829648A (en) * 2018-05-30 2018-11-16 北京小度信息科技有限公司 The conversion method and device of Web markup language
CN108984692A (en) * 2018-07-04 2018-12-11 龙马智芯(珠海横琴)科技有限公司 The processing method and processing device of webpage, storage medium, electronic device
CN111931113A (en) * 2020-09-16 2020-11-13 深圳壹账通智能科技有限公司 Data cleaning method and related equipment

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060085468A1 (en) * 2002-07-18 2006-04-20 Xerox Corporation Method for automatic wrapper repair
CN102200971A (en) * 2010-03-22 2011-09-28 腾讯科技(深圳)有限公司 Method and equipment for realizing webpage content previewing
CN102457817A (en) * 2010-10-15 2012-05-16 北大方正集团有限公司 Method and system for extracting news contents from mobile phone newspaper
CN102541874A (en) * 2010-12-16 2012-07-04 中国移动通信集团公司 Webpage text content extracting method and device
CN102779169A (en) * 2012-06-27 2012-11-14 江苏新瑞峰信息科技有限公司 Extracting method and device for webpage content based on HTML (Hypertext Markup Language) label
CN102929871A (en) * 2011-08-08 2013-02-13 腾讯科技(深圳)有限公司 Webpage browsing method and device and mobile terminal

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060085468A1 (en) * 2002-07-18 2006-04-20 Xerox Corporation Method for automatic wrapper repair
CN102200971A (en) * 2010-03-22 2011-09-28 腾讯科技(深圳)有限公司 Method and equipment for realizing webpage content previewing
CN102457817A (en) * 2010-10-15 2012-05-16 北大方正集团有限公司 Method and system for extracting news contents from mobile phone newspaper
CN102541874A (en) * 2010-12-16 2012-07-04 中国移动通信集团公司 Webpage text content extracting method and device
CN102929871A (en) * 2011-08-08 2013-02-13 腾讯科技(深圳)有限公司 Webpage browsing method and device and mobile terminal
CN102779169A (en) * 2012-06-27 2012-11-14 江苏新瑞峰信息科技有限公司 Extracting method and device for webpage content based on HTML (Hypertext Markup Language) label

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
胡金栋: ""网页正文提取及去重技术研究"", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *

Cited By (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104836779A (en) * 2014-02-12 2015-08-12 携程计算机技术(上海)有限公司 XSS vulnerability detection method, system and Web server
CN104836779B (en) * 2014-02-12 2019-07-26 上海携程商务有限公司 XSS leak detection method, system and Web server
WO2015165324A1 (en) * 2014-04-30 2015-11-05 广州市动景计算机科技有限公司 Webpage text extraction method and device, and webpage advertisement handling method and device
CN105335382A (en) * 2014-06-27 2016-02-17 优视科技有限公司 Webpage text extraction method and device
CN105335382B (en) * 2014-06-27 2018-11-16 优视科技有限公司 The extracting method and device of Web page text
CN105574004B (en) * 2014-10-10 2019-06-21 阿里巴巴集团控股有限公司 A kind of removing duplicate webpages method and apparatus
CN105574004A (en) * 2014-10-10 2016-05-11 阿里巴巴集团控股有限公司 Webpage deduplication method and device
CN105022803B (en) * 2015-07-01 2018-05-15 广州市万隆证券咨询顾问有限公司 A kind of method and system for extracting Web page text content
CN105022803A (en) * 2015-07-01 2015-11-04 广州市万隆证券咨询顾问有限公司 Method and system for extracting text content of webpage
CN105512225A (en) * 2015-11-30 2016-04-20 北大方正集团有限公司 Method and device extracting main content from webpage
CN105740355B (en) * 2016-01-26 2019-03-26 中国人民解放军国防科学技术大学 Webpage context extraction method and device based on aggregation text density
CN105740355A (en) * 2016-01-26 2016-07-06 中国人民解放军国防科学技术大学 Aggregated text density based webpage body text extraction method and apparatus
CN105868363A (en) * 2016-03-29 2016-08-17 中国农业银行股份有限公司 Webpage page text extraction method and system based on fuzzy logic
CN105868363B (en) * 2016-03-29 2018-12-14 中国农业银行股份有限公司 A kind of Webpage text extracting method and system based on fuzzy logic
CN106776886A (en) * 2016-11-29 2017-05-31 中国农业银行股份有限公司 A kind of Webpage body matter abstracting method and device
CN106776886B (en) * 2016-11-29 2019-09-24 中国农业银行股份有限公司 A kind of Webpage body matter abstracting method and device
CN108829648A (en) * 2018-05-30 2018-11-16 北京小度信息科技有限公司 The conversion method and device of Web markup language
CN108984692A (en) * 2018-07-04 2018-12-11 龙马智芯(珠海横琴)科技有限公司 The processing method and processing device of webpage, storage medium, electronic device
CN111931113A (en) * 2020-09-16 2020-11-13 深圳壹账通智能科技有限公司 Data cleaning method and related equipment

Similar Documents

Publication Publication Date Title
CN103425765A (en) Method and device for extracting webpage text and method and system for webpage preview
US8819028B2 (en) System and method for web content extraction
US10318095B2 (en) Reader mode presentation of web content
JP4248411B2 (en) Method, system, computer program and storage device for displaying a document
US20110302486A1 (en) Method and apparatus for obtaining the effective contents of web page
CN102346730A (en) Method and device for displaying catalog in electronic reader
CN103853729A (en) Page loading method and system
CN109492177B (en) web page blocking method based on web page semantic structure
CN103166981A (en) Wireless webpage transcoding method and device
US20220114269A1 (en) Page processing method, electronic apparatus and non-transitory computer-readable storage medium
CN110377796B (en) Text extraction method, device and equipment based on DOM tree and storage medium
CN109271598B (en) Method, device and storage medium for extracting news webpage content
CN106547895B (en) Webpage information extraction method and device
CN103942211A (en) Text page recognition method and device
CN110390037B (en) Information classification method, device and equipment based on DOM tree and storage medium
CN112433995A (en) File format conversion method, system, computer equipment and storage medium
JP5466133B2 (en) Document search apparatus with image and document search program with image
US20150347376A1 (en) Server-based platform for text proofreading
CN103870543A (en) Method and device for reconstructing document file
CN104731824A (en) Picture display method and picture display device
CN112685994B (en) Double-layer PDF file style formatting output method, device, equipment and medium
CN113934914B (en) Method for collecting batch encrypted data of news media
TWI451277B (en) Search tags visualization system and method therefore
CN110990671B (en) Page type discrimination device and method and readable storage medium
JP5396869B2 (en) Information processing apparatus, information retrieval apparatus, information processing method, information processing program, and recording medium

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information

Address after: 100083, Haidian District, Beijing, Fu Cheng Road, No. 28 excellent building, block A, floor 12

Applicant after: Excelle View Technology Co., Ltd.

Address before: 100080 Beijing City, Haidian District Suzhou Street No. 29 building 16 room 10-20 Scandinavia

Applicant before: Excelle View Technology Co., Ltd.

COR Change of bibliographic data
RJ01 Rejection of invention patent application after publication

Application publication date: 20131204

RJ01 Rejection of invention patent application after publication