CN108021598A - Page extraction template matching process, device and server - Google Patents
Page extraction template matching process, device and server Download PDFInfo
- Publication number
- CN108021598A CN108021598A CN201610977262.2A CN201610977262A CN108021598A CN 108021598 A CN108021598 A CN 108021598A CN 201610977262 A CN201610977262 A CN 201610977262A CN 108021598 A CN108021598 A CN 108021598A
- Authority
- CN
- China
- Prior art keywords
- target pages
- page
- default
- extraction
- extraction template
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/958—Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
- G06F16/986—Document structures and storage, e.g. HTML extensions
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Transfer Between Computers (AREA)
Abstract
The present invention provides a kind of page extraction template matching process, device and server, the described method includes:The page data of target pages is obtained, extracts the feature field in the page data;The matching label of the target pages is generated according to the feature field;Default extraction template corresponding with the target pages is searched in default page extraction template storehouse using the matching label;Template matching results mark is carried out to target pages according to lookup result.So, the matching label lookup default extraction template corresponding with the target pages generated according to feature field, the matching range of default extraction template is substantially reduced, more rapidly and accurately realizes the matching of target pages and default extraction template, improves the efficiency of page transcoding service.
Description
Technical field
The present invention relates to technical field of network information, in particular to a kind of page extraction template matching process, device
And server.
Background technology
With the rapid proliferation of intelligent mobile terminal, user is usually through mobile terminal come the page that browses web sites.But at present
It is still net that the browser based on PC (personal computer) version is presented that the webpage that major website provides to the user, which has many,
Page, is adapted to WAP versions (Wireless Application Protocol, Wireless Application Protocol) webpage of browser of mobile terminal
It is less.Since the factors such as the screen size of mobile terminal, mobile flow influence, cause directly to browse in browser of mobile terminal
The user experience of traditional WEB editions webpage is poor.In consideration of it, during user browses web sites the page, search engine or browse
The Website page that device often browses user using page transcoding technology carries out transcoding, so that Website page adapts to the clear of mobile terminal
Device of looking at is shown.
Transcoding service generally extracts user's content of pages to be browsed by server by extraction template, and content of pages is sieved
After choosing filtering, it is laid out again and is shown to user.So so that the layout of the page is more suitable for mobile terminal and is shown, also
Surfing flow is saved for user, improves page response speed.
Study and find through inventor, at present now, website quantity is various, it is necessary to make in traditional page Transcoding Scheme
Attempted to carry out content of pages extraction to the page for increasing website newly successively with multiple default extraction templates, to judge that extraction template is
The no page suitable for the newly-increased website.Time-consuming for the matching process of conventional method extraction template, takes server operation resource
It is more.
The content of the invention
In view of this, more rapidly and accurately to realize the matching of site page and extraction template, the present invention seeks to carry
For a kind of page extraction template matching process, applied to server, the described method includes:
The page data of target pages is obtained, extracts the feature field in the page data.
The matching label of the target pages is generated according to the feature field.
Default pumping corresponding with the target pages is searched in default page extraction template storehouse using the matching label
Modulus plate.
Template matching results mark is carried out to target pages according to lookup result.
Another object of the present invention is to provide a kind of page extraction template coalignment, applied to server, the dress
Put including:
Page feature acquisition module, for obtaining the page data of target pages, extracts the feature in the page data
Field.
Tag generation module is matched, for generating the matching label of the target pages according to the feature field.
Searching module, for being searched using the matching label in default page extraction template storehouse and the target pages
Corresponding default extraction template.
Matching result mark module, for carrying out template matching results mark to target pages according to lookup result.
Another object of the present invention is to provide a kind of server, the server includes:
Memory.
Processor.And
Page extraction template coalignment, described device are installed in the memory and including one or more by described
The software function module that processor performs, described device include:
Page feature acquisition module, for obtaining the page data of target pages, extracts the feature in the page data
Field.
Tag generation module is matched, for generating the matching label of the target pages according to the feature field.
Searching module, for being searched using the matching label in default page extraction template storehouse and the target pages
Corresponding default extraction template.
Matching result mark module, for carrying out template matching results mark to target pages according to lookup result.
In terms of existing technologies, the invention has the advantages that:
A kind of page extraction template matching process, device and server provided by the invention, by extracting target pages
Feature field in pagefile, matching label, and the matching label lookup passed through and the page object are generated according to feature field
The corresponding default extraction template in face.So as to substantially reduce the matching range of default extraction template, more rapidly and accurately realize
The matchings of target pages and default extraction template, improve the efficiency of page transcoding service.
Brief description of the drawings
In order to illustrate the technical solution of the embodiments of the present invention more clearly, below will be to needed in the embodiment attached
Figure is briefly described, it will be appreciated that the following drawings illustrate only certain embodiments of the present invention, therefore be not construed as pair
The restriction of scope, for those of ordinary skill in the art, without creative efforts, can also be according to this
A little attached drawings obtain other relevant attached drawings.
Fig. 1 is the schematic diagram of server provided in an embodiment of the present invention;
Fig. 2 is the flow diagram of page extraction template matching process provided in an embodiment of the present invention;
Fig. 3 is the sub-step flow diagram of the step S110 shown in Fig. 2;
Fig. 4 is the high-level schematic functional block diagram of page extraction template coalignment provided in an embodiment of the present invention;
Fig. 5 is the block diagram for the submodule that the page feature acquisition module shown in Fig. 4 includes;
Fig. 6 is the block diagram for the submodule that the matching result mark module shown in Fig. 4 includes.
Icon:100- servers;110- page extraction template coalignments;111- page feature acquisition modules;1111-
Head tag extraction submodules;The non-feature fields of 1112- reject submodule;112- matches tag generation module;1121- first is marked
Know submodule;1122- second identifier submodules;113- searching modules;114- matching result mark modules;120- memories;
130- processors;140- communication units.
Embodiment
To make the purpose, technical scheme and advantage of the embodiment of the present invention clearer, below in conjunction with the embodiment of the present invention
In attached drawing, the technical solution in the embodiment of the present invention is clearly and completely described, it is clear that described embodiment is
Part of the embodiment of the present invention, instead of all the embodiments.The present invention implementation being usually described and illustrated herein in the accompanying drawings
The component of example can be arranged and designed with a variety of configurations.
Therefore, below the detailed description of the embodiment of the present invention to providing in the accompanying drawings be not intended to limit it is claimed
The scope of the present invention, but be merely representative of the present invention selected embodiment.Based on the embodiments of the present invention, this area is common
Technical staff's all other embodiments obtained without creative efforts, belong to the model that the present invention protects
Enclose.
It should be noted that:Similar label and letter represents similar terms in following attached drawing, therefore, once a certain Xiang Yi
It is defined, then it further need not be defined and explained in subsequent attached drawing in a attached drawing.
In the description of the present invention, it is necessary to which explanation, term " first ", " second ", " the 3rd " etc. are only used for differentiation and retouch
State, and it is not intended that instruction or hint relative importance.
In the description of the present invention, it is also necessary to explanation, unless otherwise clearly defined and limited, term " setting ",
" installation ", " connected ", " connection " should be interpreted broadly, for example, it may be fixedly connected or be detachably connected, or one
Connect body;Can mechanically connect or be electrically connected;It can be directly connected, can also be indirect by intermediary
It is connected, can is the connection inside two elements.For the ordinary skill in the art, on being understood with concrete condition
State the concrete meaning of term in the present invention.
Fig. 1 is refer to, Fig. 1 is the schematic diagram of server 100 provided in an embodiment of the present invention.The server 100 includes
Page extraction template coalignment 110, memory 120, processor 130, communication unit 140.
The memory 120, processor 130 and 140 each element of communication unit are directly or indirectly electrical between each other
Connection, to realize the transmission of data or interaction.For example, these elements can pass through one or more communication bus or letter between each other
Number line, which is realized, to be electrically connected.The page extraction template coalignment 110 include it is at least one can be with software or firmware
(Firmware) form is stored in the memory 120 or is solidificated in the operating system of the server 100
Software function module in (Operating System, OS).The processor 130 is used to perform to deposit in the memory 120
The executable module of storage, such as software function module and computer journey included by the page extraction template coalignment 110
Sequence etc..
Wherein, the memory 120 may be, but not limited to, random access memory (Random Access
Memory, RAM), read-only storage (Read Only Memory, ROM), programmable read only memory (Programmable
Read-Only Memory, PROM), erasable read-only memory (Erasable Programmable Read-Only
Memory, EPROM), electricallyerasable ROM (EEROM) (Electric Erasable Programmable Read-Only
Memory, EEPROM) etc..Wherein, memory 120 is used for storage program, the processor 130 after execute instruction is received,
Perform described program.
The communication unit 140 is used to establish communication connection between the server 100 and other-end, with into line number
According to transmission and interaction.
Fig. 2 is refer to, Fig. 2 is that the flow of the page extraction template matching process applied to the server 100 shown in Fig. 1 is shown
It is intended to, the described method comprises the following steps.
Step S110, obtains the page data of target pages, extracts the feature field in the page data.
In general, the page of website is all based on the html file for the specification that html language is write.Through inventor
Research finds, type is identical, small and medium-sized web sites similar in display structure, often with there is the same or similar html data structure.It is right
The content of the page can be carried using same extraction template in these websites with same or similar html data structure
Take.
In addition, generally there is identical feature field in the same or like html file of data structure, therefore in this implementation
In example, the server 100 extracts the feature field of the html file of target pages, and mould is extracted for searching the corresponding page
Plate.
Specifically, Fig. 3 is refer to, to realize the purpose of step S110, in the present embodiment, step S110 can include sub-step
Rapid S111 and sub-step S112, below sub-paragraphs S111 and sub-step S112 be described in detail.
Sub-step S111, obtains the html page file of target pages, extracts the head marks in the html page file
Label.
Specifically, html file is generally divided into head label segments and body label segments.
The effect of head labels includes:Script, instruction browser that html file is quoted is defined to find style sheet position, carry
For metamessage etc..Head labels describe each attribute and information of page HTML file, including the page show title, in net
Position in network and with the relation of alternative document etc..Its general html file of website with same or similar data structure
Head labels have identical structure and similar content.
Therefore in this embodiment, the server 100 extracts the head labels of html file, and passes through following sub-step S112
To obtain the feature field in the head labels.
Sub-step S112, rejects non-feature field from the head labels, and using remaining field as the page object
The feature field in face.
Specifically, by taking a novel class website as an example, its head label is as follows:
<head>
<Meta charset=" gbk "/>
<title>【Page title】</title>
<Meta name=" keywords " content="【Page keyword】"/>
<Meta name=" description " content="【Main contents of webpages】"/>
<Meta name=" MobileOptimized " content=" 240 "/>
<Meta name=" applicable-device " content=" mobile "/>
<Meta name=" viewport " content=" width=device-width, initial-scale=
1.0, maximum-scale=1.0, minimum-scale=1.0 "/>
<Link rel=" shortcut icon " href="/favicon.ico "/>
<Meta http-equiv=" Cache-Control " content=" max-age=300 "/>
<Meta http-equiv=" Cache-Control " content=" no-transform "/>
<Link rel=" stylesheet " type=" text/css " href="/style/style.css "/>
<Script src="/wap/qijixs/wap.js "></script>
</head>
The head labels include multiple and different subtabs, such as title (title) subtab and metadata
(meta) subtab.,
The title subtabs are used for the title for defining the page, and the text within title subtabs can all browse
Shown on the title bar and taskbar (such as taskbar of Microsoft Windows) of device.For different websites, its title
Content in subtab is usually different.
The meta subtabs are used for the metamessage (Meta-Information) for providing the page, such as search engine
With the page-describing and keyword of update frequency.Meta subtabs are used to describe the page, wherein keyword (keywords) parameter
For showing the keyword of the page, the main contents brief introduction that page-describing (description) parameter is used to show the page (is such as plucked
Will).The content of keywords parameters and description parameters (content) value is not generally in the head labels of different websites
Together.
Inventor it has been investigated that, have same or similar structure html file head labels, except title subtabs
In text and meta subtabs in keywords parameters and description parameters content beyond, other parts are homogeneous
Together.Therefore in this embodiment, after the server 100 extracts the content in the head labels, reject in wherein title subtabs
Text message and meta subtabs in keywords parameters and the content of description parameter items, by the head labels
Remaining part is as the feature field.
In this way, the server 100 can be in the page with same or similar html data structure by step S110
In extract the shared feature field, search matched basis as step afterwards.
Step S120, the matching label of the target pages is generated according to the feature field.
Specifically, the feature field is converted into corresponding Hash (Hash) value by the server 100, and by described in
Matching label of the hash value as the target pages.In the present embodiment, the feature field can be subjected to md5 computings,
Base64 codings are carried out to the result that computing obtains, using the hash value formed after coding as the matching label.
In this way, the relatively large critical field of data volume is converted to data by the server 100 by step S120
After measuring the relatively small matching label, server 100 passes through the matching label lookup described in step after being more conducive to
With the matched default extraction template of target pages.
Step S130, is searched corresponding with the target pages using the matching label in default page extraction template storehouse
Default extraction template.
For the specific html data structure of different type website, corresponding default pumping is stored with the server 100
Modulus plate, the default extraction template carry template label, and the template label is obtained by the specific html data structure,
Specific steps refer to the present embodiment step S110 and step S120, and details are not described herein again.
The server 100 searches the matching with the target pages successively in the default page extraction template storehouse
The consistent template label of label, will default extraction template corresponding with the template label as matched default with the target pages
Extraction template.
Target pages are carried out template matching results mark by step S140 according to lookup result.
In the present embodiment, the server 100 carries out the target pages according to the lookup result of step S130
Mark with result, it is specific as follows.
If do not found in the default page extraction template storehouse and the matched default extraction mould of the target pages
The target pages, then is identified as the page of the extraction template of no adaptation by plate.So, then it represents that the target pages are not matched
Default extraction template according to the html data structure of the target pages, it is necessary to establish new extraction template again.
If found in the default page extraction template storehouse and the matched default extraction mould of the target pages
Plate, then carry out extraction verification by the default extraction template found and the target pages, and according to verification result to described
Target pages are identified accordingly.
Specifically, the default extraction template includes being used for the multiple extraction items for extracting different content of pages, the service
Device 100 by searching for the default extraction template to target pages carry out content of pages extraction, verify the page object
Whether face includes page data corresponding with the multiple extraction item.
If there are the target pages to include page data corresponding with the multiple extraction item, then it represents that to be verified.
In this way, the target pages are further identified as the page of the extraction template of adaptation by the server 100, and record the target
The domain name of the page and the correspondence of the default extraction template, so that when next time, needs showed the target pages, can
Directly invoke after corresponding default extraction template carries out the extraction of content of pages to the target pages and rendered and shown.
If page data corresponding with the multiple extraction item is not included in the target pages, then it represents that verification is obstructed
Cross, which is identified as the page of the extraction template of no adaptation by the server 100.
Fig. 4 is refer to, the present embodiment also provides a kind of page extraction template coalignment 110, applied to server 100,
The page extraction template coalignment 110 includes page feature acquisition module 111, matching tag generation module 112, searches mould
Block 113 and matching result mark module 114.
The page feature acquisition module 111, for obtaining the page data of target pages, is extracted in the page data
Feature field.In the present embodiment, which can be used for performing the step S110 shown in Fig. 2, on this
The specific descriptions of page feature acquisition module 111 can join the description to the step S110.
Specifically, refer to Fig. 5, the page feature acquisition module 111 include head tag extractions submodule 1111 and
Non- feature field rejects submodule 1112.
The head tag extractions submodule 1111, for obtaining the html page file of target pages, described in extraction
Head labels in html page file.In the present embodiment, head tag extractions submodule 1111 can be used for performing shown in Fig. 3
Step S111, the specific descriptions on the head tag extractions submodule 1111 can join the description to the step S111.
The non-feature field rejects submodule 1112, for rejecting non-feature field from the head labels, and will
Feature field of the remaining field as the target pages.
In the present embodiment, the non-feature field is rejected the non-feature field that submodule 1112 is rejected and is marked including the head
Page keyword parameter in text message and metadata subtab and page-describing parameter in label in title subtab it is interior
Hold.In the present embodiment, non-feature field rejects submodule 1112 and can be used for performing the step S112 shown in Fig. 3, on the non-spy
The specific descriptions of sign field rejecting submodule 1112 can join the description to the step S112.
The matching tag generation module 112, for generating the matching label of the target pages according to the feature field.
Specifically, the matching tag generation module 112 generates the mode for matching label and includes:
The feature field is converted into corresponding cryptographic Hash, the matching mark using the cryptographic Hash as the target pages
Label.In the present embodiment, matching tag generation module 112 can be used for performing the step S120 shown in Fig. 2, be given birth on the matching label
Specific descriptions into module 112 can join description to the step S120.
The searching module 113, for being searched using the matching label in default page extraction template storehouse with being somebody's turn to do
The corresponding default extraction template of target pages.
Specifically, in the present embodiment, the default extraction template includes presets the corresponding template of extraction template with this
Label.The searching module 113 is searched to be included with the mode of the matched default extraction template of the target pages:
Search the mould consistent with the matching label of the target pages successively in the default page extraction template storehouse
Plate label, will default extraction template corresponding with the template label as with the matched default extraction template of the target pages.This
In embodiment, the searching module 113 can be used for performing the step S130 shown in Fig. 2, and specific on the searching module 113 is retouched
State the description that can join to the step S130.
The matching result mark module 114, for carrying out template matching results mark to target pages according to lookup result
Know.In the present embodiment, the matching result mark module 114 can be used for performing the step S140 shown in Fig. 2, on the matching knot
The specific descriptions of fruit mark module 114 can join the description to the step S130.
Specifically, Fig. 6 is refer to, the matching result mark module 114 may include first flag submodule 1121 and
Two labeling submodules 1122.
The first flag submodule 1121, for when do not found in the default page extraction template storehouse with it is described
During the matched default extraction template of target pages, which is identified as to the page of the extraction template of no adaptation.
The second identifier submodule 1122, for when found in the default page extraction template storehouse with it is described
During the matched default extraction template of target pages, the default extraction template and the target pages for finding are subjected to extraction and are tested
Card, and the target pages are identified accordingly according to verification result.
In the present embodiment, the default extraction template includes being used for the multiple extraction items for extracting different content of pages, institute
State second identifier submodule 1122 and the default extraction template found and the target pages are subjected to extraction verification, and according to
The mode that verification result identifies the target pages accordingly includes:
By searching for the default extraction template to target pages carry out content of pages extraction, verify the target
Whether the page includes page data corresponding with the multiple extraction item;
If there are the target pages to include page data corresponding with the multiple extraction item, by the target pages mark
Know to there is the page of the extraction template of adaptation, and record the domain name and the correspondence of the default extraction template of the target pages;
If page data corresponding with the multiple extraction item is not included in the target pages, by the target pages mark
Know the page of the extraction template for no adaptation.
In conclusion a kind of page extraction template matching process, device and server 100 provided by the invention, by carrying
The feature field in the pagefile of target pages is taken, matching label is generated according to feature field, and the matching label passed through is looked into
Look for default extraction template corresponding with the target pages.So as to substantially reduce the matching range of default extraction template, more accelerate
Speed realizes the matching of target pages and default extraction template exactly, improves the efficiency of page transcoding service.
In embodiment provided herein, it should be understood that disclosed apparatus and method, can also be by other
Mode realize.Device embodiment described above is only schematical, for example, the flow chart and block diagram in attached drawing are shown
The device of multiple embodiments according to the present invention, the architectural framework in the cards of method and computer program product, function
And operation.At this point, each square frame in flow chart or block diagram can represent one of a module, program segment or code
Point, a part for the module, program segment or code includes one or more and is used for realization the executable of defined logic function
Instruction.It should also be noted that at some as in the implementation replaced, the function of being marked in square frame can also be with different from attached
The order marked in figure occurs.For example, two continuous square frames can essentially perform substantially in parallel, they also may be used sometimes
To perform in the opposite order, this is depending on involved function.It is it is also noted that each in block diagram and/or flow chart
The combination of square frame and the square frame in block diagram and/or flow chart, function or the dedicated of action can be based on as defined in execution
The system of hardware is realized, or can be realized with the combination of specialized hardware and computer instruction.
In addition, each function module in each embodiment of the present invention can integrate to form an independent portion
Point or modules individualism, can also two or more modules be integrated to form an independent part.
If the function is realized in the form of software function module and is used as independent production marketing or in use, can be with
It is stored in a computer read/write memory medium.Based on such understanding, technical scheme is substantially in other words
The part to contribute to the prior art or the part of the technical solution can be embodied in the form of software product, the meter
Calculation machine software product is stored in a storage medium, including some instructions are used so that a computer equipment (can be
People's computer, server, or network equipment etc.) perform all or part of step of each embodiment the method for the present invention.
And foregoing storage medium includes:USB flash disk, mobile hard disk, read-only storage (ROM, Read-Only Memory), arbitrary access are deposited
Reservoir (RAM, Random Access Memory), magnetic disc or CD etc. are various can be with the medium of store program codes.
It should be noted that herein, relational terms such as first and second and the like are used merely to a reality
Body or operation are distinguished with another entity or operation, are deposited without necessarily requiring or implying between these entities or operation
In any this actual relation or order.Moreover, term " comprising ", "comprising" or its any other variant are intended to
Non-exclusive inclusion, so that process, method, article or equipment including a series of elements not only will including those
Element, but also including other elements that are not explicitly listed, or further include as this process, method, article or equipment
Intrinsic key element.In the absence of more restrictions, the key element limited by sentence "including a ...", it is not excluded that
Also there are other identical element in process, method, article or equipment including the key element.
The foregoing is only a preferred embodiment of the present invention, is not intended to limit the invention, for the skill of this area
For art personnel, the invention may be variously modified and varied.Within the spirit and principles of the invention, that is made any repaiies
Change, equivalent substitution, improvement etc., should all be included in the protection scope of the present invention.It should be noted that:Similar label and letter exists
Similar terms is represented in following attached drawing, therefore, once being defined in a certain Xiang Yi attached drawing, is then not required in subsequent attached drawing
It is further defined and is explained.
The above description is merely a specific embodiment, but protection scope of the present invention is not limited thereto, any
Those familiar with the art the invention discloses technical scope in, change or replacement can be readily occurred in, should all be contained
Cover within protection scope of the present invention.Therefore, protection scope of the present invention answers the scope of the claims of being subject to.
Claims (20)
- A kind of 1. page extraction template matching process, applied to server, it is characterised in that the described method includes:The page data of target pages is obtained, extracts the feature field in the page data;The matching label of the target pages is generated according to the feature field;Default extraction mould corresponding with the target pages is searched in default page extraction template storehouse using the matching label Plate;Template matching results mark is carried out to target pages according to lookup result.
- 2. according to the method described in claim 1, it is characterized in that, the page data for obtaining target pages, described in extraction The step of feature field in page data, includes:The html page file of target pages is obtained, extracts the head labels in the html page file;Non- feature field, and the feature field using remaining field as the target pages are rejected from the head labels.
- 3. according to the method described in claim 2, it is characterized in that, the non-feature field of the rejecting includes:Page keyword parameter and the page in text message and metadata subtab in the head labels in title subtab The content of characterising parameter.
- 4. according to the method described in claim 1, it is characterized in that, described generate the target pages according to the feature field The step of matching label includes:The feature field is converted into corresponding cryptographic Hash, the matching label using the cryptographic Hash as the target pages.
- 5. according to the method described in claim 1, it is characterized in that, the default extraction template has presets extraction template with this Corresponding template label;It is described to be searched using the matching label in default page extraction template storehouse and the target pages pair The step of default extraction template answered, includes:Search the template mark consistent with the matching label of the target pages successively in the default page extraction template storehouse Label, will default extraction template corresponding with the template label as with the matched default extraction template of the target pages.
- 6. according to the method described in claim 1, it is characterized in that, described carry out template according to lookup result to target pages The step of being identified with result includes:If do not found in the default page extraction template storehouse with the matched default extraction template of the target pages, The target pages are identified as to the page of the extraction template of no adaptation;If found in the default page extraction template storehouse with the matched default extraction template of the target pages, The default extraction template found and the target pages are subjected to extraction verification, and according to verification result to the page object Face is identified accordingly.
- 7. according to the method described in claim 6, it is characterized in that, the default extraction template includes being used to extract the different pages Multiple extraction items of content;Described the step of carrying out extracting verification by the default extraction template found and the target pages Including:By searching for the default extraction template to target pages carry out content of pages extraction, verify the target pages Whether with the multiple extraction item corresponding page data is included;If there are the target pages to include page data corresponding with the multiple extraction item, which is identified as There is the page of the extraction template of adaptation, and record the domain name and the correspondence of the default extraction template of the target pages;If not including page data corresponding with the multiple extraction item in the target pages, which is identified as The page of extraction template without adaptation.
- A kind of 8. page extraction template coalignment, applied to server, it is characterised in that described device includes:Page feature acquisition module, for obtaining the page data of target pages, extracts the feature field in the page data;Tag generation module is matched, for generating the matching label of the target pages according to the feature field;Searching module, it is corresponding with the target pages for being searched using the matching label in default page extraction template storehouse Default extraction template;Matching result mark module, for carrying out template matching results mark to target pages according to lookup result.
- 9. device according to claim 8, it is characterised in that the page feature acquisition module includes:Head tag extraction submodules, for obtaining the html page file of target pages, are extracted in the html page file Head labels;Non- feature field rejects submodule, makees for rejecting non-feature field from the head labels, and by remaining field For the feature field of the target pages.
- 10. device according to claim 9, it is characterised in that the non-feature field rejects the non-spy that submodule is rejected Sign field includes:Page keyword parameter and the page in text message and metadata subtab in the head labels in title subtab The content of characterising parameter.
- 11. device according to claim 8, it is characterised in that the matching tag generation module generation matching mark The mode of label includes:The feature field is converted into corresponding cryptographic Hash, the matching label using the cryptographic Hash as the target pages.
- 12. device according to claim 8, it is characterised in that the default extraction template has and the default extraction mould The corresponding template label of plate;The searching module is searched to be included with the mode of the matched default extraction template of the target pages:Search the template mark consistent with the matching label of the target pages successively in the default page extraction template storehouse Label, will default extraction template corresponding with the template label as with the matched default extraction template of the target pages.
- 13. device according to claim 8, it is characterised in that the matching result mark module includes:First flag submodule, for being matched with the target pages when not found in the default page extraction template storehouse Default extraction template when, which is identified as to the page of the extraction template of no adaptation;Second identifier submodule, for being matched with the target pages when having been found in the default page extraction template storehouse Default extraction template when, the default extraction template found and the target pages are subjected to extraction verification, and according to testing Card result identifies the target pages accordingly.
- 14. device according to claim 13, it is characterised in that the default extraction template includes being used to extract not same page Multiple extraction items of face content;The second identifier submodule by the default extraction template found and the target pages into Row extracts verification, and is included according to the mode that verification result identifies the target pages accordingly:By searching for the default extraction template to target pages carry out content of pages extraction, verify the target pages Whether with the multiple extraction item corresponding page data is included;If there are the target pages to include page data corresponding with the multiple extraction item, which is identified as There is the page of the extraction template of adaptation, and record the domain name and the correspondence of the default extraction template of the target pages;If not including page data corresponding with the multiple extraction item in the target pages, which is identified as The page of extraction template without adaptation.
- 15. a kind of server, it is characterised in that the server includes:Memory;Processor;AndPage extraction template coalignment, described device are installed in the memory and including one or more by the processing The software function module that device performs, described device include:Page feature acquisition module, for obtaining the page data of target pages, extracts the feature field in the page data;Tag generation module is matched, for generating the matching label of the target pages according to the feature field;Searching module, it is corresponding with the target pages for being searched using the matching label in default page extraction template storehouse Default extraction template;Matching result mark module, for carrying out template matching results mark to target pages according to lookup result.
- 16. server according to claim 15, it is characterised in that the page feature acquisition module includes:Head tag extraction submodules, for obtaining the html page file of target pages, are extracted in the html page file Head labels;Non- feature field rejects submodule, makees for rejecting non-feature field from the head labels, and by remaining field For the feature field of the target pages.
- 17. server according to claim 16, it is characterised in that the non-feature field rejects the non-of submodule rejecting Feature field includes:Page keyword parameter and the page in text message and metadata subtab in the head labels in title subtab The content of characterising parameter.
- 18. server according to claim 15, it is characterised in that the matching tag generation module generates the matching The mode of label includes:The feature field is converted into corresponding cryptographic Hash, the matching label using the cryptographic Hash as the target pages.
- 19. server according to claim 15, it is characterised in that the matching result mark module includes:First flag submodule, for being matched with the target pages when not found in the default page extraction template storehouse Default extraction template when, which is identified as to the page of the extraction template of no adaptation;Second identifier submodule, for being matched with the target pages when having been found in the default page extraction template storehouse Default extraction template when, the default extraction template found and the target pages are subjected to extraction verification, and according to testing Card result identifies the target pages accordingly.
- 20. server according to claim 19, it is characterised in that the default extraction template includes being used to extract difference Multiple extraction items of content of pages;The second identifier submodule is by the default extraction template found and the target pages Extraction verification is carried out, and is included according to the mode that verification result identifies the target pages accordingly:By searching for the default extraction template to target pages carry out content of pages extraction, verify the target pages Whether with the multiple extraction item corresponding page data is included;If there are the target pages to include page data corresponding with the multiple extraction item, which is identified as There is the page of the extraction template of adaptation, and record the domain name and the correspondence of the default extraction template of the target pages;If not including page data corresponding with the multiple extraction item in the target pages, which is identified as The page of extraction template without adaptation.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610977262.2A CN108021598B (en) | 2016-11-04 | 2016-11-04 | Page extraction template matching method and device and server |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610977262.2A CN108021598B (en) | 2016-11-04 | 2016-11-04 | Page extraction template matching method and device and server |
Publications (2)
Publication Number | Publication Date |
---|---|
CN108021598A true CN108021598A (en) | 2018-05-11 |
CN108021598B CN108021598B (en) | 2022-05-03 |
Family
ID=62083683
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610977262.2A Active CN108021598B (en) | 2016-11-04 | 2016-11-04 | Page extraction template matching method and device and server |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108021598B (en) |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109062881A (en) * | 2018-07-11 | 2018-12-21 | 政采云有限公司 | Purchase bidding documenting method and system |
CN109582909A (en) * | 2018-12-19 | 2019-04-05 | 拉扎斯网络科技(上海)有限公司 | Webpage automatic generation method, device, electronic equipment and storage medium |
CN110059272A (en) * | 2018-11-02 | 2019-07-26 | 阿里巴巴集团控股有限公司 | A kind of page feature recognition methods and device |
CN110457302A (en) * | 2019-07-31 | 2019-11-15 | 河南开合软件技术有限公司 | A kind of structural data intelligence cleaning method |
CN110895463A (en) * | 2018-09-13 | 2020-03-20 | 百度在线网络技术(北京)有限公司 | Label processing method, device, equipment and computer readable storage medium |
CN111639250A (en) * | 2020-06-05 | 2020-09-08 | 深圳市小满科技有限公司 | Enterprise description information acquisition method and device, electronic equipment and storage medium |
CN111858963A (en) * | 2020-07-28 | 2020-10-30 | 中国银行股份有限公司 | Webpage customer service knowledge extraction method and device |
CN113127766A (en) * | 2019-12-31 | 2021-07-16 | 飞书数字科技(上海)有限公司 | Method and device for acquiring advertisement interest words, storage medium and processor |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101950312A (en) * | 2010-08-18 | 2011-01-19 | 赵清政 | Method for analyzing webpage content of internet |
CN103678509A (en) * | 2013-11-25 | 2014-03-26 | 北京奇虎科技有限公司 | Method and device for generating webpage template |
CN104035753A (en) * | 2013-03-04 | 2014-09-10 | 优视科技有限公司 | Double-WebView customized page display method and system |
CN104866527A (en) * | 2015-04-24 | 2015-08-26 | 美通云动(北京)科技有限公司 | Dynamic webpage template matching method and device |
-
2016
- 2016-11-04 CN CN201610977262.2A patent/CN108021598B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101950312A (en) * | 2010-08-18 | 2011-01-19 | 赵清政 | Method for analyzing webpage content of internet |
CN104035753A (en) * | 2013-03-04 | 2014-09-10 | 优视科技有限公司 | Double-WebView customized page display method and system |
CN103678509A (en) * | 2013-11-25 | 2014-03-26 | 北京奇虎科技有限公司 | Method and device for generating webpage template |
CN104866527A (en) * | 2015-04-24 | 2015-08-26 | 美通云动(北京)科技有限公司 | Dynamic webpage template matching method and device |
Cited By (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109062881A (en) * | 2018-07-11 | 2018-12-21 | 政采云有限公司 | Purchase bidding documenting method and system |
CN110895463A (en) * | 2018-09-13 | 2020-03-20 | 百度在线网络技术(北京)有限公司 | Label processing method, device, equipment and computer readable storage medium |
CN110059272A (en) * | 2018-11-02 | 2019-07-26 | 阿里巴巴集团控股有限公司 | A kind of page feature recognition methods and device |
CN110059272B (en) * | 2018-11-02 | 2023-08-15 | 创新先进技术有限公司 | Page feature recognition method and device |
CN109582909B (en) * | 2018-12-19 | 2021-08-10 | 拉扎斯网络科技(上海)有限公司 | Webpage automatic generation method and device, electronic equipment and storage medium |
CN109582909A (en) * | 2018-12-19 | 2019-04-05 | 拉扎斯网络科技(上海)有限公司 | Webpage automatic generation method, device, electronic equipment and storage medium |
CN110457302B (en) * | 2019-07-31 | 2022-04-29 | 河南开合软件技术有限公司 | Intelligent structured data cleaning method |
CN110457302A (en) * | 2019-07-31 | 2019-11-15 | 河南开合软件技术有限公司 | A kind of structural data intelligence cleaning method |
CN113127766A (en) * | 2019-12-31 | 2021-07-16 | 飞书数字科技(上海)有限公司 | Method and device for acquiring advertisement interest words, storage medium and processor |
CN113127766B (en) * | 2019-12-31 | 2023-04-14 | 飞书数字科技(上海)有限公司 | Method and device for acquiring advertisement interest words, storage medium and processor |
CN111639250A (en) * | 2020-06-05 | 2020-09-08 | 深圳市小满科技有限公司 | Enterprise description information acquisition method and device, electronic equipment and storage medium |
CN111639250B (en) * | 2020-06-05 | 2023-05-16 | 深圳市小满科技有限公司 | Enterprise description information acquisition method and device, electronic equipment and storage medium |
CN111858963A (en) * | 2020-07-28 | 2020-10-30 | 中国银行股份有限公司 | Webpage customer service knowledge extraction method and device |
CN111858963B (en) * | 2020-07-28 | 2024-02-23 | 中国银行股份有限公司 | Webpage customer service knowledge extraction method and device |
Also Published As
Publication number | Publication date |
---|---|
CN108021598B (en) | 2022-05-03 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108021598A (en) | Page extraction template matching process, device and server | |
US20220350855A1 (en) | Systems and methods for obtaining search results | |
US10380197B2 (en) | Network searching method and network searching system | |
CN104239298B (en) | Text message recommends method, server, browser and system | |
US9563611B2 (en) | Merging web page style addresses | |
US20080034279A1 (en) | Aggregate tag views of website information | |
US11907644B2 (en) | Detecting compatible layouts for content-based native ads | |
CN107526776A (en) | The Computerized method and system of search result is presented | |
US20100192055A1 (en) | Apparatus, method and article to interact with source files in networked environment | |
CN102314494B (en) | Method and equipment for processing webpage contents | |
CN102982117B (en) | Information search method and device | |
CN102664925B (en) | A kind of method of displaying searching result and device | |
CN105677787B (en) | Information retrieval device and information search method | |
CN108874996A (en) | website classification method and device | |
CN104077415A (en) | Searching method and device | |
CN109033282A (en) | A kind of Web page text extracting method and device based on extraction template | |
WO2014029318A1 (en) | Method and apparatus for identifying webpage type | |
CN105117434A (en) | Webpage classification method and webpage classification system | |
Joshi et al. | Web document text and images extraction using DOM analysis and natural language processing | |
CN106886594A (en) | For the method and apparatus of exhibition information | |
CN104462142A (en) | Method and device for searching for content in webpage | |
CN103631796A (en) | Website sort management method and electronic device | |
RU2632149C2 (en) | System, method and constant machine-readable medium for validation of web pages | |
CN106202349A (en) | Web page classifying dictionary creation method and device | |
CN109558123A (en) | The method of webpage conversion electrons book, electronic equipment, storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
TA01 | Transfer of patent application right |
Effective date of registration: 20200526 Address after: 310052 room 508, floor 5, building 4, No. 699, Wangshang Road, Changhe street, Binjiang District, Hangzhou City, Zhejiang Province Applicant after: Alibaba (China) Co.,Ltd. Address before: 510627 Guangdong city of Guangzhou province Whampoa Tianhe District Road No. 163 Xiping Yun Lu Yun Ping B radio 14 floor tower square Applicant before: GUANGZHOU UCWEB COMPUTER TECHNOLOGY Co.,Ltd. |
|
TA01 | Transfer of patent application right | ||
GR01 | Patent grant | ||
GR01 | Patent grant |