CN108021598A - Page extraction template matching process, device and server - Google Patents

Page extraction template matching process, device and server Download PDF

Info

Publication number
CN108021598A
CN108021598A CN201610977262.2A CN201610977262A CN108021598A CN 108021598 A CN108021598 A CN 108021598A CN 201610977262 A CN201610977262 A CN 201610977262A CN 108021598 A CN108021598 A CN 108021598A
Authority
CN
China
Prior art keywords
target pages
page
default
extraction
extraction template
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201610977262.2A
Other languages
Chinese (zh)
Other versions
CN108021598B (en
Inventor
吴伟勇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alibaba China Co Ltd
Original Assignee
Guangzhou Dongjing Computer Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangzhou Dongjing Computer Technology Co Ltd filed Critical Guangzhou Dongjing Computer Technology Co Ltd
Priority to CN201610977262.2A priority Critical patent/CN108021598B/en
Publication of CN108021598A publication Critical patent/CN108021598A/en
Application granted granted Critical
Publication of CN108021598B publication Critical patent/CN108021598B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
    • G06F16/986Document structures and storage, e.g. HTML extensions

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The present invention provides a kind of page extraction template matching process, device and server, the described method includes:The page data of target pages is obtained, extracts the feature field in the page data;The matching label of the target pages is generated according to the feature field;Default extraction template corresponding with the target pages is searched in default page extraction template storehouse using the matching label;Template matching results mark is carried out to target pages according to lookup result.So, the matching label lookup default extraction template corresponding with the target pages generated according to feature field, the matching range of default extraction template is substantially reduced, more rapidly and accurately realizes the matching of target pages and default extraction template, improves the efficiency of page transcoding service.

Description

Page extraction template matching process, device and server
Technical field
The present invention relates to technical field of network information, in particular to a kind of page extraction template matching process, device And server.
Background technology
With the rapid proliferation of intelligent mobile terminal, user is usually through mobile terminal come the page that browses web sites.But at present It is still net that the browser based on PC (personal computer) version is presented that the webpage that major website provides to the user, which has many, Page, is adapted to WAP versions (Wireless Application Protocol, Wireless Application Protocol) webpage of browser of mobile terminal It is less.Since the factors such as the screen size of mobile terminal, mobile flow influence, cause directly to browse in browser of mobile terminal The user experience of traditional WEB editions webpage is poor.In consideration of it, during user browses web sites the page, search engine or browse The Website page that device often browses user using page transcoding technology carries out transcoding, so that Website page adapts to the clear of mobile terminal Device of looking at is shown.
Transcoding service generally extracts user's content of pages to be browsed by server by extraction template, and content of pages is sieved After choosing filtering, it is laid out again and is shown to user.So so that the layout of the page is more suitable for mobile terminal and is shown, also Surfing flow is saved for user, improves page response speed.
Study and find through inventor, at present now, website quantity is various, it is necessary to make in traditional page Transcoding Scheme Attempted to carry out content of pages extraction to the page for increasing website newly successively with multiple default extraction templates, to judge that extraction template is The no page suitable for the newly-increased website.Time-consuming for the matching process of conventional method extraction template, takes server operation resource It is more.
The content of the invention
In view of this, more rapidly and accurately to realize the matching of site page and extraction template, the present invention seeks to carry For a kind of page extraction template matching process, applied to server, the described method includes:
The page data of target pages is obtained, extracts the feature field in the page data.
The matching label of the target pages is generated according to the feature field.
Default pumping corresponding with the target pages is searched in default page extraction template storehouse using the matching label Modulus plate.
Template matching results mark is carried out to target pages according to lookup result.
Another object of the present invention is to provide a kind of page extraction template coalignment, applied to server, the dress Put including:
Page feature acquisition module, for obtaining the page data of target pages, extracts the feature in the page data Field.
Tag generation module is matched, for generating the matching label of the target pages according to the feature field.
Searching module, for being searched using the matching label in default page extraction template storehouse and the target pages Corresponding default extraction template.
Matching result mark module, for carrying out template matching results mark to target pages according to lookup result.
Another object of the present invention is to provide a kind of server, the server includes:
Memory.
Processor.And
Page extraction template coalignment, described device are installed in the memory and including one or more by described The software function module that processor performs, described device include:
Page feature acquisition module, for obtaining the page data of target pages, extracts the feature in the page data Field.
Tag generation module is matched, for generating the matching label of the target pages according to the feature field.
Searching module, for being searched using the matching label in default page extraction template storehouse and the target pages Corresponding default extraction template.
Matching result mark module, for carrying out template matching results mark to target pages according to lookup result.
In terms of existing technologies, the invention has the advantages that:
A kind of page extraction template matching process, device and server provided by the invention, by extracting target pages Feature field in pagefile, matching label, and the matching label lookup passed through and the page object are generated according to feature field The corresponding default extraction template in face.So as to substantially reduce the matching range of default extraction template, more rapidly and accurately realize The matchings of target pages and default extraction template, improve the efficiency of page transcoding service.
Brief description of the drawings
In order to illustrate the technical solution of the embodiments of the present invention more clearly, below will be to needed in the embodiment attached Figure is briefly described, it will be appreciated that the following drawings illustrate only certain embodiments of the present invention, therefore be not construed as pair The restriction of scope, for those of ordinary skill in the art, without creative efforts, can also be according to this A little attached drawings obtain other relevant attached drawings.
Fig. 1 is the schematic diagram of server provided in an embodiment of the present invention;
Fig. 2 is the flow diagram of page extraction template matching process provided in an embodiment of the present invention;
Fig. 3 is the sub-step flow diagram of the step S110 shown in Fig. 2;
Fig. 4 is the high-level schematic functional block diagram of page extraction template coalignment provided in an embodiment of the present invention;
Fig. 5 is the block diagram for the submodule that the page feature acquisition module shown in Fig. 4 includes;
Fig. 6 is the block diagram for the submodule that the matching result mark module shown in Fig. 4 includes.
Icon:100- servers;110- page extraction template coalignments;111- page feature acquisition modules;1111- Head tag extraction submodules;The non-feature fields of 1112- reject submodule;112- matches tag generation module;1121- first is marked Know submodule;1122- second identifier submodules;113- searching modules;114- matching result mark modules;120- memories; 130- processors;140- communication units.
Embodiment
To make the purpose, technical scheme and advantage of the embodiment of the present invention clearer, below in conjunction with the embodiment of the present invention In attached drawing, the technical solution in the embodiment of the present invention is clearly and completely described, it is clear that described embodiment is Part of the embodiment of the present invention, instead of all the embodiments.The present invention implementation being usually described and illustrated herein in the accompanying drawings The component of example can be arranged and designed with a variety of configurations.
Therefore, below the detailed description of the embodiment of the present invention to providing in the accompanying drawings be not intended to limit it is claimed The scope of the present invention, but be merely representative of the present invention selected embodiment.Based on the embodiments of the present invention, this area is common Technical staff's all other embodiments obtained without creative efforts, belong to the model that the present invention protects Enclose.
It should be noted that:Similar label and letter represents similar terms in following attached drawing, therefore, once a certain Xiang Yi It is defined, then it further need not be defined and explained in subsequent attached drawing in a attached drawing.
In the description of the present invention, it is necessary to which explanation, term " first ", " second ", " the 3rd " etc. are only used for differentiation and retouch State, and it is not intended that instruction or hint relative importance.
In the description of the present invention, it is also necessary to explanation, unless otherwise clearly defined and limited, term " setting ", " installation ", " connected ", " connection " should be interpreted broadly, for example, it may be fixedly connected or be detachably connected, or one Connect body;Can mechanically connect or be electrically connected;It can be directly connected, can also be indirect by intermediary It is connected, can is the connection inside two elements.For the ordinary skill in the art, on being understood with concrete condition State the concrete meaning of term in the present invention.
Fig. 1 is refer to, Fig. 1 is the schematic diagram of server 100 provided in an embodiment of the present invention.The server 100 includes Page extraction template coalignment 110, memory 120, processor 130, communication unit 140.
The memory 120, processor 130 and 140 each element of communication unit are directly or indirectly electrical between each other Connection, to realize the transmission of data or interaction.For example, these elements can pass through one or more communication bus or letter between each other Number line, which is realized, to be electrically connected.The page extraction template coalignment 110 include it is at least one can be with software or firmware (Firmware) form is stored in the memory 120 or is solidificated in the operating system of the server 100 Software function module in (Operating System, OS).The processor 130 is used to perform to deposit in the memory 120 The executable module of storage, such as software function module and computer journey included by the page extraction template coalignment 110 Sequence etc..
Wherein, the memory 120 may be, but not limited to, random access memory (Random Access Memory, RAM), read-only storage (Read Only Memory, ROM), programmable read only memory (Programmable Read-Only Memory, PROM), erasable read-only memory (Erasable Programmable Read-Only Memory, EPROM), electricallyerasable ROM (EEROM) (Electric Erasable Programmable Read-Only Memory, EEPROM) etc..Wherein, memory 120 is used for storage program, the processor 130 after execute instruction is received, Perform described program.
The communication unit 140 is used to establish communication connection between the server 100 and other-end, with into line number According to transmission and interaction.
Fig. 2 is refer to, Fig. 2 is that the flow of the page extraction template matching process applied to the server 100 shown in Fig. 1 is shown It is intended to, the described method comprises the following steps.
Step S110, obtains the page data of target pages, extracts the feature field in the page data.
In general, the page of website is all based on the html file for the specification that html language is write.Through inventor Research finds, type is identical, small and medium-sized web sites similar in display structure, often with there is the same or similar html data structure.It is right The content of the page can be carried using same extraction template in these websites with same or similar html data structure Take.
In addition, generally there is identical feature field in the same or like html file of data structure, therefore in this implementation In example, the server 100 extracts the feature field of the html file of target pages, and mould is extracted for searching the corresponding page Plate.
Specifically, Fig. 3 is refer to, to realize the purpose of step S110, in the present embodiment, step S110 can include sub-step Rapid S111 and sub-step S112, below sub-paragraphs S111 and sub-step S112 be described in detail.
Sub-step S111, obtains the html page file of target pages, extracts the head marks in the html page file Label.
Specifically, html file is generally divided into head label segments and body label segments.
The effect of head labels includes:Script, instruction browser that html file is quoted is defined to find style sheet position, carry For metamessage etc..Head labels describe each attribute and information of page HTML file, including the page show title, in net Position in network and with the relation of alternative document etc..Its general html file of website with same or similar data structure Head labels have identical structure and similar content.
Therefore in this embodiment, the server 100 extracts the head labels of html file, and passes through following sub-step S112 To obtain the feature field in the head labels.
Sub-step S112, rejects non-feature field from the head labels, and using remaining field as the page object The feature field in face.
Specifically, by taking a novel class website as an example, its head label is as follows:
<head>
<Meta charset=" gbk "/>
<title>【Page title】</title>
<Meta name=" keywords " content="【Page keyword】"/>
<Meta name=" description " content="【Main contents of webpages】"/>
<Meta name=" MobileOptimized " content=" 240 "/>
<Meta name=" applicable-device " content=" mobile "/>
<Meta name=" viewport " content=" width=device-width, initial-scale= 1.0, maximum-scale=1.0, minimum-scale=1.0 "/>
<Link rel=" shortcut icon " href="/favicon.ico "/>
<Meta http-equiv=" Cache-Control " content=" max-age=300 "/>
<Meta http-equiv=" Cache-Control " content=" no-transform "/>
<Link rel=" stylesheet " type=" text/css " href="/style/style.css "/>
<Script src="/wap/qijixs/wap.js "></script>
</head>
The head labels include multiple and different subtabs, such as title (title) subtab and metadata (meta) subtab.,
The title subtabs are used for the title for defining the page, and the text within title subtabs can all browse Shown on the title bar and taskbar (such as taskbar of Microsoft Windows) of device.For different websites, its title Content in subtab is usually different.
The meta subtabs are used for the metamessage (Meta-Information) for providing the page, such as search engine With the page-describing and keyword of update frequency.Meta subtabs are used to describe the page, wherein keyword (keywords) parameter For showing the keyword of the page, the main contents brief introduction that page-describing (description) parameter is used to show the page (is such as plucked Will).The content of keywords parameters and description parameters (content) value is not generally in the head labels of different websites Together.
Inventor it has been investigated that, have same or similar structure html file head labels, except title subtabs In text and meta subtabs in keywords parameters and description parameters content beyond, other parts are homogeneous Together.Therefore in this embodiment, after the server 100 extracts the content in the head labels, reject in wherein title subtabs Text message and meta subtabs in keywords parameters and the content of description parameter items, by the head labels Remaining part is as the feature field.
In this way, the server 100 can be in the page with same or similar html data structure by step S110 In extract the shared feature field, search matched basis as step afterwards.
Step S120, the matching label of the target pages is generated according to the feature field.
Specifically, the feature field is converted into corresponding Hash (Hash) value by the server 100, and by described in Matching label of the hash value as the target pages.In the present embodiment, the feature field can be subjected to md5 computings, Base64 codings are carried out to the result that computing obtains, using the hash value formed after coding as the matching label.
In this way, the relatively large critical field of data volume is converted to data by the server 100 by step S120 After measuring the relatively small matching label, server 100 passes through the matching label lookup described in step after being more conducive to With the matched default extraction template of target pages.
Step S130, is searched corresponding with the target pages using the matching label in default page extraction template storehouse Default extraction template.
For the specific html data structure of different type website, corresponding default pumping is stored with the server 100 Modulus plate, the default extraction template carry template label, and the template label is obtained by the specific html data structure, Specific steps refer to the present embodiment step S110 and step S120, and details are not described herein again.
The server 100 searches the matching with the target pages successively in the default page extraction template storehouse The consistent template label of label, will default extraction template corresponding with the template label as matched default with the target pages Extraction template.
Target pages are carried out template matching results mark by step S140 according to lookup result.
In the present embodiment, the server 100 carries out the target pages according to the lookup result of step S130 Mark with result, it is specific as follows.
If do not found in the default page extraction template storehouse and the matched default extraction mould of the target pages The target pages, then is identified as the page of the extraction template of no adaptation by plate.So, then it represents that the target pages are not matched Default extraction template according to the html data structure of the target pages, it is necessary to establish new extraction template again.
If found in the default page extraction template storehouse and the matched default extraction mould of the target pages Plate, then carry out extraction verification by the default extraction template found and the target pages, and according to verification result to described Target pages are identified accordingly.
Specifically, the default extraction template includes being used for the multiple extraction items for extracting different content of pages, the service Device 100 by searching for the default extraction template to target pages carry out content of pages extraction, verify the page object Whether face includes page data corresponding with the multiple extraction item.
If there are the target pages to include page data corresponding with the multiple extraction item, then it represents that to be verified. In this way, the target pages are further identified as the page of the extraction template of adaptation by the server 100, and record the target The domain name of the page and the correspondence of the default extraction template, so that when next time, needs showed the target pages, can Directly invoke after corresponding default extraction template carries out the extraction of content of pages to the target pages and rendered and shown.
If page data corresponding with the multiple extraction item is not included in the target pages, then it represents that verification is obstructed Cross, which is identified as the page of the extraction template of no adaptation by the server 100.
Fig. 4 is refer to, the present embodiment also provides a kind of page extraction template coalignment 110, applied to server 100, The page extraction template coalignment 110 includes page feature acquisition module 111, matching tag generation module 112, searches mould Block 113 and matching result mark module 114.
The page feature acquisition module 111, for obtaining the page data of target pages, is extracted in the page data Feature field.In the present embodiment, which can be used for performing the step S110 shown in Fig. 2, on this The specific descriptions of page feature acquisition module 111 can join the description to the step S110.
Specifically, refer to Fig. 5, the page feature acquisition module 111 include head tag extractions submodule 1111 and Non- feature field rejects submodule 1112.
The head tag extractions submodule 1111, for obtaining the html page file of target pages, described in extraction Head labels in html page file.In the present embodiment, head tag extractions submodule 1111 can be used for performing shown in Fig. 3 Step S111, the specific descriptions on the head tag extractions submodule 1111 can join the description to the step S111.
The non-feature field rejects submodule 1112, for rejecting non-feature field from the head labels, and will Feature field of the remaining field as the target pages.
In the present embodiment, the non-feature field is rejected the non-feature field that submodule 1112 is rejected and is marked including the head Page keyword parameter in text message and metadata subtab and page-describing parameter in label in title subtab it is interior Hold.In the present embodiment, non-feature field rejects submodule 1112 and can be used for performing the step S112 shown in Fig. 3, on the non-spy The specific descriptions of sign field rejecting submodule 1112 can join the description to the step S112.
The matching tag generation module 112, for generating the matching label of the target pages according to the feature field.
Specifically, the matching tag generation module 112 generates the mode for matching label and includes:
The feature field is converted into corresponding cryptographic Hash, the matching mark using the cryptographic Hash as the target pages Label.In the present embodiment, matching tag generation module 112 can be used for performing the step S120 shown in Fig. 2, be given birth on the matching label Specific descriptions into module 112 can join description to the step S120.
The searching module 113, for being searched using the matching label in default page extraction template storehouse with being somebody's turn to do The corresponding default extraction template of target pages.
Specifically, in the present embodiment, the default extraction template includes presets the corresponding template of extraction template with this Label.The searching module 113 is searched to be included with the mode of the matched default extraction template of the target pages:
Search the mould consistent with the matching label of the target pages successively in the default page extraction template storehouse Plate label, will default extraction template corresponding with the template label as with the matched default extraction template of the target pages.This In embodiment, the searching module 113 can be used for performing the step S130 shown in Fig. 2, and specific on the searching module 113 is retouched State the description that can join to the step S130.
The matching result mark module 114, for carrying out template matching results mark to target pages according to lookup result Know.In the present embodiment, the matching result mark module 114 can be used for performing the step S140 shown in Fig. 2, on the matching knot The specific descriptions of fruit mark module 114 can join the description to the step S130.
Specifically, Fig. 6 is refer to, the matching result mark module 114 may include first flag submodule 1121 and Two labeling submodules 1122.
The first flag submodule 1121, for when do not found in the default page extraction template storehouse with it is described During the matched default extraction template of target pages, which is identified as to the page of the extraction template of no adaptation.
The second identifier submodule 1122, for when found in the default page extraction template storehouse with it is described During the matched default extraction template of target pages, the default extraction template and the target pages for finding are subjected to extraction and are tested Card, and the target pages are identified accordingly according to verification result.
In the present embodiment, the default extraction template includes being used for the multiple extraction items for extracting different content of pages, institute State second identifier submodule 1122 and the default extraction template found and the target pages are subjected to extraction verification, and according to The mode that verification result identifies the target pages accordingly includes:
By searching for the default extraction template to target pages carry out content of pages extraction, verify the target Whether the page includes page data corresponding with the multiple extraction item;
If there are the target pages to include page data corresponding with the multiple extraction item, by the target pages mark Know to there is the page of the extraction template of adaptation, and record the domain name and the correspondence of the default extraction template of the target pages;
If page data corresponding with the multiple extraction item is not included in the target pages, by the target pages mark Know the page of the extraction template for no adaptation.
In conclusion a kind of page extraction template matching process, device and server 100 provided by the invention, by carrying The feature field in the pagefile of target pages is taken, matching label is generated according to feature field, and the matching label passed through is looked into Look for default extraction template corresponding with the target pages.So as to substantially reduce the matching range of default extraction template, more accelerate Speed realizes the matching of target pages and default extraction template exactly, improves the efficiency of page transcoding service.
In embodiment provided herein, it should be understood that disclosed apparatus and method, can also be by other Mode realize.Device embodiment described above is only schematical, for example, the flow chart and block diagram in attached drawing are shown The device of multiple embodiments according to the present invention, the architectural framework in the cards of method and computer program product, function And operation.At this point, each square frame in flow chart or block diagram can represent one of a module, program segment or code Point, a part for the module, program segment or code includes one or more and is used for realization the executable of defined logic function Instruction.It should also be noted that at some as in the implementation replaced, the function of being marked in square frame can also be with different from attached The order marked in figure occurs.For example, two continuous square frames can essentially perform substantially in parallel, they also may be used sometimes To perform in the opposite order, this is depending on involved function.It is it is also noted that each in block diagram and/or flow chart The combination of square frame and the square frame in block diagram and/or flow chart, function or the dedicated of action can be based on as defined in execution The system of hardware is realized, or can be realized with the combination of specialized hardware and computer instruction.
In addition, each function module in each embodiment of the present invention can integrate to form an independent portion Point or modules individualism, can also two or more modules be integrated to form an independent part.
If the function is realized in the form of software function module and is used as independent production marketing or in use, can be with It is stored in a computer read/write memory medium.Based on such understanding, technical scheme is substantially in other words The part to contribute to the prior art or the part of the technical solution can be embodied in the form of software product, the meter Calculation machine software product is stored in a storage medium, including some instructions are used so that a computer equipment (can be People's computer, server, or network equipment etc.) perform all or part of step of each embodiment the method for the present invention. And foregoing storage medium includes:USB flash disk, mobile hard disk, read-only storage (ROM, Read-Only Memory), arbitrary access are deposited Reservoir (RAM, Random Access Memory), magnetic disc or CD etc. are various can be with the medium of store program codes.
It should be noted that herein, relational terms such as first and second and the like are used merely to a reality Body or operation are distinguished with another entity or operation, are deposited without necessarily requiring or implying between these entities or operation In any this actual relation or order.Moreover, term " comprising ", "comprising" or its any other variant are intended to Non-exclusive inclusion, so that process, method, article or equipment including a series of elements not only will including those Element, but also including other elements that are not explicitly listed, or further include as this process, method, article or equipment Intrinsic key element.In the absence of more restrictions, the key element limited by sentence "including a ...", it is not excluded that Also there are other identical element in process, method, article or equipment including the key element.
The foregoing is only a preferred embodiment of the present invention, is not intended to limit the invention, for the skill of this area For art personnel, the invention may be variously modified and varied.Within the spirit and principles of the invention, that is made any repaiies Change, equivalent substitution, improvement etc., should all be included in the protection scope of the present invention.It should be noted that:Similar label and letter exists Similar terms is represented in following attached drawing, therefore, once being defined in a certain Xiang Yi attached drawing, is then not required in subsequent attached drawing It is further defined and is explained.
The above description is merely a specific embodiment, but protection scope of the present invention is not limited thereto, any Those familiar with the art the invention discloses technical scope in, change or replacement can be readily occurred in, should all be contained Cover within protection scope of the present invention.Therefore, protection scope of the present invention answers the scope of the claims of being subject to.

Claims (20)

  1. A kind of 1. page extraction template matching process, applied to server, it is characterised in that the described method includes:
    The page data of target pages is obtained, extracts the feature field in the page data;
    The matching label of the target pages is generated according to the feature field;
    Default extraction mould corresponding with the target pages is searched in default page extraction template storehouse using the matching label Plate;
    Template matching results mark is carried out to target pages according to lookup result.
  2. 2. according to the method described in claim 1, it is characterized in that, the page data for obtaining target pages, described in extraction The step of feature field in page data, includes:
    The html page file of target pages is obtained, extracts the head labels in the html page file;
    Non- feature field, and the feature field using remaining field as the target pages are rejected from the head labels.
  3. 3. according to the method described in claim 2, it is characterized in that, the non-feature field of the rejecting includes:
    Page keyword parameter and the page in text message and metadata subtab in the head labels in title subtab The content of characterising parameter.
  4. 4. according to the method described in claim 1, it is characterized in that, described generate the target pages according to the feature field The step of matching label includes:
    The feature field is converted into corresponding cryptographic Hash, the matching label using the cryptographic Hash as the target pages.
  5. 5. according to the method described in claim 1, it is characterized in that, the default extraction template has presets extraction template with this Corresponding template label;It is described to be searched using the matching label in default page extraction template storehouse and the target pages pair The step of default extraction template answered, includes:
    Search the template mark consistent with the matching label of the target pages successively in the default page extraction template storehouse Label, will default extraction template corresponding with the template label as with the matched default extraction template of the target pages.
  6. 6. according to the method described in claim 1, it is characterized in that, described carry out template according to lookup result to target pages The step of being identified with result includes:
    If do not found in the default page extraction template storehouse with the matched default extraction template of the target pages, The target pages are identified as to the page of the extraction template of no adaptation;
    If found in the default page extraction template storehouse with the matched default extraction template of the target pages, The default extraction template found and the target pages are subjected to extraction verification, and according to verification result to the page object Face is identified accordingly.
  7. 7. according to the method described in claim 6, it is characterized in that, the default extraction template includes being used to extract the different pages Multiple extraction items of content;Described the step of carrying out extracting verification by the default extraction template found and the target pages Including:
    By searching for the default extraction template to target pages carry out content of pages extraction, verify the target pages Whether with the multiple extraction item corresponding page data is included;
    If there are the target pages to include page data corresponding with the multiple extraction item, which is identified as There is the page of the extraction template of adaptation, and record the domain name and the correspondence of the default extraction template of the target pages;
    If not including page data corresponding with the multiple extraction item in the target pages, which is identified as The page of extraction template without adaptation.
  8. A kind of 8. page extraction template coalignment, applied to server, it is characterised in that described device includes:
    Page feature acquisition module, for obtaining the page data of target pages, extracts the feature field in the page data;
    Tag generation module is matched, for generating the matching label of the target pages according to the feature field;
    Searching module, it is corresponding with the target pages for being searched using the matching label in default page extraction template storehouse Default extraction template;
    Matching result mark module, for carrying out template matching results mark to target pages according to lookup result.
  9. 9. device according to claim 8, it is characterised in that the page feature acquisition module includes:
    Head tag extraction submodules, for obtaining the html page file of target pages, are extracted in the html page file Head labels;
    Non- feature field rejects submodule, makees for rejecting non-feature field from the head labels, and by remaining field For the feature field of the target pages.
  10. 10. device according to claim 9, it is characterised in that the non-feature field rejects the non-spy that submodule is rejected Sign field includes:
    Page keyword parameter and the page in text message and metadata subtab in the head labels in title subtab The content of characterising parameter.
  11. 11. device according to claim 8, it is characterised in that the matching tag generation module generation matching mark The mode of label includes:
    The feature field is converted into corresponding cryptographic Hash, the matching label using the cryptographic Hash as the target pages.
  12. 12. device according to claim 8, it is characterised in that the default extraction template has and the default extraction mould The corresponding template label of plate;The searching module is searched to be included with the mode of the matched default extraction template of the target pages:
    Search the template mark consistent with the matching label of the target pages successively in the default page extraction template storehouse Label, will default extraction template corresponding with the template label as with the matched default extraction template of the target pages.
  13. 13. device according to claim 8, it is characterised in that the matching result mark module includes:
    First flag submodule, for being matched with the target pages when not found in the default page extraction template storehouse Default extraction template when, which is identified as to the page of the extraction template of no adaptation;
    Second identifier submodule, for being matched with the target pages when having been found in the default page extraction template storehouse Default extraction template when, the default extraction template found and the target pages are subjected to extraction verification, and according to testing Card result identifies the target pages accordingly.
  14. 14. device according to claim 13, it is characterised in that the default extraction template includes being used to extract not same page Multiple extraction items of face content;The second identifier submodule by the default extraction template found and the target pages into Row extracts verification, and is included according to the mode that verification result identifies the target pages accordingly:
    By searching for the default extraction template to target pages carry out content of pages extraction, verify the target pages Whether with the multiple extraction item corresponding page data is included;
    If there are the target pages to include page data corresponding with the multiple extraction item, which is identified as There is the page of the extraction template of adaptation, and record the domain name and the correspondence of the default extraction template of the target pages;
    If not including page data corresponding with the multiple extraction item in the target pages, which is identified as The page of extraction template without adaptation.
  15. 15. a kind of server, it is characterised in that the server includes:
    Memory;
    Processor;And
    Page extraction template coalignment, described device are installed in the memory and including one or more by the processing The software function module that device performs, described device include:
    Page feature acquisition module, for obtaining the page data of target pages, extracts the feature field in the page data;
    Tag generation module is matched, for generating the matching label of the target pages according to the feature field;
    Searching module, it is corresponding with the target pages for being searched using the matching label in default page extraction template storehouse Default extraction template;
    Matching result mark module, for carrying out template matching results mark to target pages according to lookup result.
  16. 16. server according to claim 15, it is characterised in that the page feature acquisition module includes:
    Head tag extraction submodules, for obtaining the html page file of target pages, are extracted in the html page file Head labels;
    Non- feature field rejects submodule, makees for rejecting non-feature field from the head labels, and by remaining field For the feature field of the target pages.
  17. 17. server according to claim 16, it is characterised in that the non-feature field rejects the non-of submodule rejecting Feature field includes:
    Page keyword parameter and the page in text message and metadata subtab in the head labels in title subtab The content of characterising parameter.
  18. 18. server according to claim 15, it is characterised in that the matching tag generation module generates the matching The mode of label includes:
    The feature field is converted into corresponding cryptographic Hash, the matching label using the cryptographic Hash as the target pages.
  19. 19. server according to claim 15, it is characterised in that the matching result mark module includes:
    First flag submodule, for being matched with the target pages when not found in the default page extraction template storehouse Default extraction template when, which is identified as to the page of the extraction template of no adaptation;
    Second identifier submodule, for being matched with the target pages when having been found in the default page extraction template storehouse Default extraction template when, the default extraction template found and the target pages are subjected to extraction verification, and according to testing Card result identifies the target pages accordingly.
  20. 20. server according to claim 19, it is characterised in that the default extraction template includes being used to extract difference Multiple extraction items of content of pages;The second identifier submodule is by the default extraction template found and the target pages Extraction verification is carried out, and is included according to the mode that verification result identifies the target pages accordingly:
    By searching for the default extraction template to target pages carry out content of pages extraction, verify the target pages Whether with the multiple extraction item corresponding page data is included;
    If there are the target pages to include page data corresponding with the multiple extraction item, which is identified as There is the page of the extraction template of adaptation, and record the domain name and the correspondence of the default extraction template of the target pages;
    If not including page data corresponding with the multiple extraction item in the target pages, which is identified as The page of extraction template without adaptation.
CN201610977262.2A 2016-11-04 2016-11-04 Page extraction template matching method and device and server Active CN108021598B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610977262.2A CN108021598B (en) 2016-11-04 2016-11-04 Page extraction template matching method and device and server

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610977262.2A CN108021598B (en) 2016-11-04 2016-11-04 Page extraction template matching method and device and server

Publications (2)

Publication Number Publication Date
CN108021598A true CN108021598A (en) 2018-05-11
CN108021598B CN108021598B (en) 2022-05-03

Family

ID=62083683

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610977262.2A Active CN108021598B (en) 2016-11-04 2016-11-04 Page extraction template matching method and device and server

Country Status (1)

Country Link
CN (1) CN108021598B (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109062881A (en) * 2018-07-11 2018-12-21 政采云有限公司 Purchase bidding documenting method and system
CN109582909A (en) * 2018-12-19 2019-04-05 拉扎斯网络科技(上海)有限公司 Webpage automatic generation method, device, electronic equipment and storage medium
CN110059272A (en) * 2018-11-02 2019-07-26 阿里巴巴集团控股有限公司 A kind of page feature recognition methods and device
CN110457302A (en) * 2019-07-31 2019-11-15 河南开合软件技术有限公司 A kind of structural data intelligence cleaning method
CN110895463A (en) * 2018-09-13 2020-03-20 百度在线网络技术(北京)有限公司 Label processing method, device, equipment and computer readable storage medium
CN111639250A (en) * 2020-06-05 2020-09-08 深圳市小满科技有限公司 Enterprise description information acquisition method and device, electronic equipment and storage medium
CN111858963A (en) * 2020-07-28 2020-10-30 中国银行股份有限公司 Webpage customer service knowledge extraction method and device
CN113127766A (en) * 2019-12-31 2021-07-16 飞书数字科技(上海)有限公司 Method and device for acquiring advertisement interest words, storage medium and processor

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101950312A (en) * 2010-08-18 2011-01-19 赵清政 Method for analyzing webpage content of internet
CN103678509A (en) * 2013-11-25 2014-03-26 北京奇虎科技有限公司 Method and device for generating webpage template
CN104035753A (en) * 2013-03-04 2014-09-10 优视科技有限公司 Double-WebView customized page display method and system
CN104866527A (en) * 2015-04-24 2015-08-26 美通云动(北京)科技有限公司 Dynamic webpage template matching method and device

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101950312A (en) * 2010-08-18 2011-01-19 赵清政 Method for analyzing webpage content of internet
CN104035753A (en) * 2013-03-04 2014-09-10 优视科技有限公司 Double-WebView customized page display method and system
CN103678509A (en) * 2013-11-25 2014-03-26 北京奇虎科技有限公司 Method and device for generating webpage template
CN104866527A (en) * 2015-04-24 2015-08-26 美通云动(北京)科技有限公司 Dynamic webpage template matching method and device

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109062881A (en) * 2018-07-11 2018-12-21 政采云有限公司 Purchase bidding documenting method and system
CN110895463A (en) * 2018-09-13 2020-03-20 百度在线网络技术(北京)有限公司 Label processing method, device, equipment and computer readable storage medium
CN110059272A (en) * 2018-11-02 2019-07-26 阿里巴巴集团控股有限公司 A kind of page feature recognition methods and device
CN110059272B (en) * 2018-11-02 2023-08-15 创新先进技术有限公司 Page feature recognition method and device
CN109582909B (en) * 2018-12-19 2021-08-10 拉扎斯网络科技(上海)有限公司 Webpage automatic generation method and device, electronic equipment and storage medium
CN109582909A (en) * 2018-12-19 2019-04-05 拉扎斯网络科技(上海)有限公司 Webpage automatic generation method, device, electronic equipment and storage medium
CN110457302B (en) * 2019-07-31 2022-04-29 河南开合软件技术有限公司 Intelligent structured data cleaning method
CN110457302A (en) * 2019-07-31 2019-11-15 河南开合软件技术有限公司 A kind of structural data intelligence cleaning method
CN113127766A (en) * 2019-12-31 2021-07-16 飞书数字科技(上海)有限公司 Method and device for acquiring advertisement interest words, storage medium and processor
CN113127766B (en) * 2019-12-31 2023-04-14 飞书数字科技(上海)有限公司 Method and device for acquiring advertisement interest words, storage medium and processor
CN111639250A (en) * 2020-06-05 2020-09-08 深圳市小满科技有限公司 Enterprise description information acquisition method and device, electronic equipment and storage medium
CN111639250B (en) * 2020-06-05 2023-05-16 深圳市小满科技有限公司 Enterprise description information acquisition method and device, electronic equipment and storage medium
CN111858963A (en) * 2020-07-28 2020-10-30 中国银行股份有限公司 Webpage customer service knowledge extraction method and device
CN111858963B (en) * 2020-07-28 2024-02-23 中国银行股份有限公司 Webpage customer service knowledge extraction method and device

Also Published As

Publication number Publication date
CN108021598B (en) 2022-05-03

Similar Documents

Publication Publication Date Title
CN108021598A (en) Page extraction template matching process, device and server
US20220350855A1 (en) Systems and methods for obtaining search results
US10380197B2 (en) Network searching method and network searching system
CN104239298B (en) Text message recommends method, server, browser and system
US9563611B2 (en) Merging web page style addresses
US20080034279A1 (en) Aggregate tag views of website information
US11907644B2 (en) Detecting compatible layouts for content-based native ads
CN107526776A (en) The Computerized method and system of search result is presented
US20100192055A1 (en) Apparatus, method and article to interact with source files in networked environment
CN102314494B (en) Method and equipment for processing webpage contents
CN102982117B (en) Information search method and device
CN102664925B (en) A kind of method of displaying searching result and device
CN105677787B (en) Information retrieval device and information search method
CN108874996A (en) website classification method and device
CN104077415A (en) Searching method and device
CN109033282A (en) A kind of Web page text extracting method and device based on extraction template
WO2014029318A1 (en) Method and apparatus for identifying webpage type
CN105117434A (en) Webpage classification method and webpage classification system
Joshi et al. Web document text and images extraction using DOM analysis and natural language processing
CN106886594A (en) For the method and apparatus of exhibition information
CN104462142A (en) Method and device for searching for content in webpage
CN103631796A (en) Website sort management method and electronic device
RU2632149C2 (en) System, method and constant machine-readable medium for validation of web pages
CN106202349A (en) Web page classifying dictionary creation method and device
CN109558123A (en) The method of webpage conversion electrons book, electronic equipment, storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right

Effective date of registration: 20200526

Address after: 310052 room 508, floor 5, building 4, No. 699, Wangshang Road, Changhe street, Binjiang District, Hangzhou City, Zhejiang Province

Applicant after: Alibaba (China) Co.,Ltd.

Address before: 510627 Guangdong city of Guangzhou province Whampoa Tianhe District Road No. 163 Xiping Yun Lu Yun Ping B radio 14 floor tower square

Applicant before: GUANGZHOU UCWEB COMPUTER TECHNOLOGY Co.,Ltd.

TA01 Transfer of patent application right
GR01 Patent grant
GR01 Patent grant