CN104679804A - Web page multiple attribute marking method and implementation thereof - Google Patents

Web page multiple attribute marking method and implementation thereof Download PDF

Info

Publication number
CN104679804A
CN104679804A CN201410176809.XA CN201410176809A CN104679804A CN 104679804 A CN104679804 A CN 104679804A CN 201410176809 A CN201410176809 A CN 201410176809A CN 104679804 A CN104679804 A CN 104679804A
Authority
CN
China
Prior art keywords
attribute
webpage
information
web page
assignment
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201410176809.XA
Other languages
Chinese (zh)
Inventor
王建平
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ningbo You Ce Information Technology Co Ltd
Original Assignee
Ningbo You Ce Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ningbo You Ce Information Technology Co Ltd filed Critical Ningbo You Ce Information Technology Co Ltd
Priority to CN201410176809.XA priority Critical patent/CN104679804A/en
Publication of CN104679804A publication Critical patent/CN104679804A/en
Pending legal-status Critical Current

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a web page multiple attribute marking method and an implementation thereof and relates to the field of web page information processing. By providing an attribute recognition module, an attribute configuration module and an attribute calling module for web page multiple information attribute marking processing, the problems that selected web page multiple information attributes are recognized, sorted and marked in various manners, and the attribute marking result and progress can be called flexibly and repeatedly are solved in the entire and system aspects. Compared with the prior art, the method has the advantages that by means of the web page multiple information attribute recognizing module definition and marking process, a uniform novel web page multiple attribute marking method is provided, the web page information attribute marking and processing efficiency and accuracy can be improved greatly, the basis for calling the web page multiple information attribute marking result and process conveniently and repeatedly during business processing is provided, and the web page multiple information attribute business system processing efficiency can be improved efficiency.

Description

A kind of labeling method of webpage multiple attributes and realization thereof
Technical field
The invention provides a kind of webpage to having multiple information attribute to mark and the method realized, belonging to the information processing in field of computer technology.Specifically, during Web information processing to internet (mobile Internet) in practical business process, often need the mark from many levels (not only multiple attribute classification), multiple dimension (certain attribute classification but also have multiple property value), webpage being carried out to information attribute, and based on the information attribute set of webpage, the accurate information realized in the personalized search of information in search engine, information pushing pushes, the information Accurate classification in information classification, the thematic cluster in converging information.For this reason, patent of the present invention proposes the multiple attributes labeling method of webpage, and by the direct webpage attribute flags of Combination application, to combine with rule base based on the webpage attribute flags of keywords database and keyword and carry out webpage attribute flags, use sorter to carry out the multiple methods such as webpage attribute flags, effectively can mark out the multiple information attribute of webpage, thus can facilitate in actual applications, efficiently webpage attribute information marked, extract and applied.
Background technology
Along with popularizing of network, webpage becomes the most general information carrier, and people obtain the Knowledge and information required for oneself by search or the webpage directly clicked in website.Manually carry out outside information processing category in use, in the practical business application of function based on search engine, information pushing and information classification, usually need by capturing the webpage that is dispersed in different web sites and classifying to it or the automatic information processing such as cluster.Webpage is classified or the business processing such as cluster time, practical application needs to set about from many levels, multiple dimension, and in this case, info web attribute just has multiple characteristics.In existing Web information processing method, often set about from certain attribute of webpage, lack the method from overall angle overall treatment webpage various information attribute, webpage attribute can be repeated, is applied to whole business procession efficiently; Meanwhile, the information classification after existing webpage attribute flags, particularly Web relation recognition to multiple classification time, still have that efficiency is not high, inaccurate problem of classifying.The method that we provide, when information processes in earlier stage, multiple mark is carried out to info web attribute flags, by combining the property set of webpage, can realize information classification efficiently, neatly, accurately, this invention can be widely applied in search engine, information pushing, information classification, information personalized displaying.
Summary of the invention
For solving above-mentioned technological deficiency, the present invention is based on practical business demand, need to solve and how a concrete webpage is set out to multiple information attribute and gives Info attribute value, thus a kind of efficient, accurate, multiple information attribute mask method of easily extensible, reusable webpage is provided.
In actual information process business, a webpage attribute (being designated as symbol P) is usually the vector with multiple attribute classification (being designated as symbol A).Be designated as:
P is info web attribute matrix, A i(i=1 ... n) represent i-th attribute classification, its property value also has multiple usually, i.e. A ialso be a vector.Because other specific object number of each Attribute class of webpage is different, we define other Attribute domain of webpage Attribute class is V, namely
V=(v 1, v 2... v n), wherein, corresponding i-th other value of Attribute class (i.e. the number of attribute) of vi, , its value can be random natural number, and R is natural number.
Then A ican be expressed as:
A i= (a i,1,a i,2,…a i,vi)
Therefore, the multiple attributes of a webpage can be expressed as follows:
P is numerical value is 0,1 matrix formed, i.e. a i, vivalue can only be 0 or 1.Attribute value (a i, vi) when being 0, represent this webpage in i attribute classification, v iindividual attribute does not meet, and this value is designated as 0; And as attribute value (a i, vi) when being 1, represent this webpage in i attribute classification, v iindividual attribute meets, and this value is designated as 1.
Need to illustrate, webpage can have multiple different attribute classification, and other value of certain Attribute class can be often one, and more situation may be multiple.Such as when classifying to webpage, webpage often not only belongs to some classes, multiple classification may be belonged to simultaneously, therefore, when carrying out webpage attribute assignment for this classification, in the property column vector of its correspondence, can occur that in multiple position value is the situation of 1 a position, also.
A labeling method for webpage multiple attributes, concrete steps are described below:
1, the web data captured and information is processed, the Attribute domain of comformed information attribute classification and certain classification, generating web page attribute vector;
2, for practical business demand, set property vector to property value vector assignment (0 or 1), specifically, namely according to the requirement of webpage attribute to property value, by indirect assignment or technical scheme, assignment is carried out to property value, form info web attribute matrix.In the present invention, for the difference of all kinds of attribute assignment situation, except utilizing the technical method such as sorter, Keywords matching to carry out except assignment to not exclusive property value, make use of the regular method combined with keyword especially and carrying out property value assignment;
3, complete after the information attribute of webpage attribute vector mark, computing machine stores information attribute matrix, and to the use defining interface of info web attribute or call method.
Actualizing technology scheme of the present invention is: a kind of labeling method device of webpage multiple attributes comprises Attribute Recognition module, attribute configuration module and attribute calling module, is specifically described below:
A, Attribute Recognition mark module: webpage attribute model can be defined from multi-level (attribute classification), various dimensions (the specific object number attribute classification), according to the number of attributes of webpage attribute classification number, certain classification, definition is generating web page attribute matrix also, the storage mode of configuration webpage attribute vector and storage space.The webpage attribute model defined according to this pattern is with good expansibility;
B, attribute configuration module: in practical business, the multiple attributes of webpage (i.e. multiple attribute numbers in multi-class, the classification of webpage attribute) carries out assignment often through multiple technologies means or mode, being included in when attribute is determined can indirect assignment, or carry out assignment by the multiple technologies such as sorter, Keywords matching means, in aforementioned manner, that the value of webpage attribute is determined often, unique, then in a kind of mode, webpage attribute is often not exclusive.In the present invention, for the situation that property value is not exclusive, be provided with the attribute assignment technical method that rule combines with keyword especially.Therefore, the multiple technology assignment unit of attribute configuration module integration of the present invention, specifically, comprises the attribute assignment technical unit utilizing Keywords matching to realize, the attribute assignment technical unit utilizing sorter to realize, and the evaluating technology unit that rule combines with keyword.Above-mentioned sorter evaluating technology unit, Keywords matching evaluating technology unit etc., identical with the mode used now, technical the present invention carries out attribute flags by the means such as sorter, Keywords matching to be used further to classification and other application (such as information pushing, personalized search etc.), instead of is used for directly classifying.The evaluating technology unit that rule combines with keyword, in order to improve efficiency and the accuracy of assignment process, carries out the principle of " first Keywords matching after rule ";
C, attribute calling module.Info web attribute call method and technical interface specification are set, can reuse in each module of transaction processing system easily, reach the requirement processing business efficiently, easily.
The beneficial effect that the present invention reaches:
1, based on the present invention, from system, the unified process of overall angle, marking of web pages information attribute, can once, efficient identification goes out the multiple information attribute of webpage, this multiple information attribute mark mode is also extendible simultaneously, the mark number of its attribute flags classification of namely different webpages, a certain category attribute is all different, and this is unexistent mode in former Web information processing method.
2, based on the present invention, by comprehensively using different technologies means and modes such as comprising information attribute mark that keyword Sum fanction storehouse combines, the efficiency of info web attribute flags process can be promoted, greatly improve the accuracy rate of info web attribute flags, effectively can replace the work of artificial info web attribute flags, effectively can replace the traditional approach that the sorter generally adopted at present is classified to information.
3, based on the present invention, by calling setting after webpage attribute flags, (some attribute classification is namely chosen according to webpage community set, and some attribute in them), invoking web page multiple information attribute mark result can be repeated, not only can improve system treatment effeciency, simultaneously also for needing the business of flexible invoking web page information attribute mark to provide the foundation, webpage multiple attributes mark can apply to accurate policy vertical search engine, personalized information push is served, individualized Information display, policy special topic generates in the business such as service, substantially increase the usefulness of system, accuracy rate, reduce service logic and realize difficulty.
The labeling method of webpage multiple attributes and realization thereof have 3 accompanying drawings, wherein:
Fig. 1 is method flow diagram;
Fig. 2 is installation drawing;
Fig. 3 is the pie graph of attribute configuration module.
Embodiment
Below in conjunction with accompanying drawing, for the multiple information attribute labeling process of webpage of policy area, the invention will be further described.Following examples only for technical scheme of the present invention is clearly described, and can not limit the scope of the invention with this.
The webpage in policy information field often comprises multiple information attribute, and such as, the information attribute vector that webpage may comprise is: industry, issues department, special topic for classification, region ....Wherein, the value of (classification) attribute can comprise { policy notice, Calendar of Petroleum Economic Events, policy document, policy knowledge ..., the value of (region) attribute can be made up of province+prefecture-level city+county or municipality directly under the Central Government+urban district, the value of (industry) attribute need define according to the needs of embody rule, one-level industry can be built, secondary industry, three grades of industries etc., (issuing department) attribute can comprise national level rank, provincial, and municipal level, prefecture-level and counties and districts' level, the People's Government can be comprised from department's attribute, send out and change, work is believed, commercial affairs, the tax, the administrative departments such as industry and commerce, (special topic) attribute can according to service needed, contain technological transformation, medium-sized and small enterprises, intellecture property ... etc. attribute,
In above-mentioned information attribute, the Info attribute value of (classification) attribute, (industry) attribute is often not exclusive, and such as, certain webpage may belong to policy notice and policy document simultaneously; Also may belong to multiple industry category, such as a policy information belongs to electronic information, biological medicine, optical, mechanical and electronic integration, energy-conserving and environment-protective simultaneously ... etc. field; And (region) attribute, (issuing department) other property value of Attribute class may be unique, also different administrative regions and administrative department may have been belonged to simultaneously.
As shown in Fig. 1, the labeling method of a kind of webpage of the present invention multiple attributes comprises following three steps:
1, according to service needed, to the policy class webpage by reptile automatic capturing, define the multiple information attribute vector that business processing needs, as its information attribute vector of the above-mentioned web page definition to policy area is: { classification, region, industry, issue department, special topic ..., meanwhile, other specific object territory of each Attribute class is determined.As determined, the Attribute domain of { classification } attribute is for { policy notifies, Calendar of Petroleum Economic Events, policy document, policy knowledge ...;
2, according to business demand, recognition and verification goes out info web property value vector, gives 0 or 1 respectively to numerical value in matrix.Can indirect assignment be taked to each property value vector respectively or carry out assignment by technological means, as the region to policy area webpage, issue department's attribute, can according to the property value vector pre-defined, alignment processing is carried out when capturing webpage, directly give region to webpage to define with the information attribute of the department of issue, namely utilize the direct value of network address; For policy trade classification attribute, can application class device technique classification, unique industry attribute both can be set, also can give multiple industry attribute; Except sorter, we can utilize the attribute of specialized dictionary to industry to identify equally, and an information can be designated multiple industry; For category attribute, key application word coupling and rule base are in conjunction with the technical scheme of Keywords matching, the attributes such as policy notice, policy document, Calendar of Petroleum Economic Events, policy knowledge are given fast to webpage, at this moment can be unique (only having a value to be 1), also can multiple value be 1, such as a webpage can belong to policy notice, policy document simultaneously, and all assignment is 1, and other values are 0;
3, definition information attribute call method or interface specification, thus the mark of information attribute is applied to rapidly in other business processing go.As, in above-mentioned policy information field, can to its call method of the definition such as industry attribute, area attribute mark, industry attribute is mapped with the industry attribute of user, thus provide personalized information push service, individualized webpage displaying, personalized precisely search service, and form policy special topic fast based on community set.
As shown in Fig. 2, the labeling method of a kind of webpage multiple attributes of the present invention and implement device figure thereof, comprise Attribute Recognition module, attribute configuration module, attribute calling module three modules.Wherein: Attribute Recognition module realizes defining the attribute of policy area webpage, identify, storing, and pre-defines other Attribute domain of each Attribute class; Attribute configuration module, for can the attribute of indirect assignment, by restriction during policy web page source and definition, directly gives info web attribute and is worth accordingly, for the attribute that value is not exclusive, by technological means assignment in value vector space; Attribute calling module carries out reusable information attribute mark to needs in operation system, by definition information attribute flags result, according to information attribute collection, is the reserved interface easily of business processing below.
As shown in Fig. 3, the pie graph of attribute configuration module, comprise indirect assignment unit and 3 class technology assignment unit (sorter assignment unit that info web attribute is marked, Keywords matching assignment unit, the assignment unit that rule is combined with Keywords matching), wherein sorter assignment unit, Keywords matching assignment unit is identical with current congenic method, the assignment unit that rule is combined with Keywords matching, in the sphere of policy in Web Page Processing, rule comprises classifying rules and rejects two rule-likes, the foundation of keyword is mainly set up according to the key position (as title etc.) of webpage, in rule when Keywords matching is in conjunction with assignment, info web attribute labeling is carried out to the principle of " first Keywords matching after rule ".

Claims (2)

1. the labeling method of a webpage multiple attributes and realization thereof, after realizing, device comprises three modules: A. Attribute Recognition module, for defining webpage multiple information attribute model and identifying that the multiple information attribute of webpage is vectorial, and according to business demand, define Attribute domain vector; Whether B. attribute configuration module, determine according to other Attribute domain value of Attribute class, by indirect assignment or technological means assignment, gives to the multidimensional information attribute vector of webpage the value determined; C. attribute calling module, for arranging info web attribute call method and technical interface specification; Described implementation method core comprises: steps A: the web data that process captures and information, the Attribute domain of comformed information attribute classification and certain classification, generating web page attribute vector; Step B: for practical business demand, set property vector to property value vector assignment; Step C: to the info web attribute definition call method marked and interface specification.
2. the labeling method of a kind of webpage multiple attributes according to claim 1 and realization thereof, is characterized in that: used overall, system and the multiple information attribute model of extendible webpage; The method that the indirect assignment of the multiple information attribute labeling process of webpage combines with technological means assignment; Principle that the information attribute labelling technique implementation procedure that rule combines with Keywords matching uses " first Keywords matching after rule "; The system call method of the multiple information attribute mark result of webpage or process and interface specification.
CN201410176809.XA 2014-04-30 2014-04-30 Web page multiple attribute marking method and implementation thereof Pending CN104679804A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410176809.XA CN104679804A (en) 2014-04-30 2014-04-30 Web page multiple attribute marking method and implementation thereof

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410176809.XA CN104679804A (en) 2014-04-30 2014-04-30 Web page multiple attribute marking method and implementation thereof

Publications (1)

Publication Number Publication Date
CN104679804A true CN104679804A (en) 2015-06-03

Family

ID=53314853

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410176809.XA Pending CN104679804A (en) 2014-04-30 2014-04-30 Web page multiple attribute marking method and implementation thereof

Country Status (1)

Country Link
CN (1) CN104679804A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116304457A (en) * 2023-02-27 2023-06-23 山东乾舜广告传媒有限公司 Marking method for webpage multiple information attribute

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101963966A (en) * 2009-07-24 2011-02-02 李占胜 Method for sorting search results by adding labels into search results

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101963966A (en) * 2009-07-24 2011-02-02 李占胜 Method for sorting search results by adding labels into search results

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116304457A (en) * 2023-02-27 2023-06-23 山东乾舜广告传媒有限公司 Marking method for webpage multiple information attribute
CN116304457B (en) * 2023-02-27 2024-03-29 山东乾舜广告传媒有限公司 Marking method for webpage multiple information attribute

Similar Documents

Publication Publication Date Title
Fernandez-Anez Stakeholders approach to smart cities: A survey on smart city definitions
CN104123346B (en) A kind of structured data search method
CN106649223A (en) Financial report automatic generation method based on natural language processing
US9442905B1 (en) Detecting neighborhoods from geocoded web documents
CN110597870A (en) Enterprise relation mining method
CN110781670B (en) Chinese place name semantic disambiguation method based on encyclopedic knowledge base and word vectors
CN102129470A (en) Tag clustering method and system
CN102163214A (en) Numerical map generation device and method thereof
CN108628811A (en) The matching process and device of address text
CN110929797A (en) Personnel capacity quantitative evaluation method
CN111882403A (en) Financial service platform intelligent recommendation method based on user data
CN110825817B (en) Enterprise suspected association judgment method and system
CN201654779U (en) Scientific document automatic classification system
Xie et al. Estimation of entity‐level land use and its application in urban sectoral land use footprint: A bottom‐up model with emerging geospatial data
CN105912723A (en) Storage method of custom field
CN113505273A (en) Data sorting method, device, equipment and medium based on repeated data screening
CN109902148B (en) Automatic enterprise name completion method for address book contacts
Brandas et al. Data driven decision support systems: an application case in labour market analysis
CN104679804A (en) Web page multiple attribute marking method and implementation thereof
CN110389932A (en) Electric power automatic document classifying method and device
CN110175199A (en) Energy enterprise key user's identifying and analyzing method based on K mean cluster algorithm
CN111143421A (en) Data sharing method and device, electronic equipment and storage medium
CN110175219A (en) A kind of K12 stage repeats school's recognition methods, device, equipment and storage medium
CN107169044A (en) A kind of city talent resource integrated management method
Huancheng et al. An analysis of research trends on data mining in Chinese academic libraries

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20150603

WD01 Invention patent application deemed withdrawn after publication