CN108614825A - A kind of web page characteristics extracting method and device - Google Patents

A kind of web page characteristics extracting method and device Download PDF

Info

Publication number
CN108614825A
CN108614825A CN201611137455.3A CN201611137455A CN108614825A CN 108614825 A CN108614825 A CN 108614825A CN 201611137455 A CN201611137455 A CN 201611137455A CN 108614825 A CN108614825 A CN 108614825A
Authority
CN
China
Prior art keywords
feature
word
weighted value
value
web page
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201611137455.3A
Other languages
Chinese (zh)
Other versions
CN108614825B (en
Inventor
吕颖韬
冯宜安
周璐
张贝金
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Mobile Communications Group Co Ltd
China Mobile Hangzhou Information Technology Co Ltd
Original Assignee
China Mobile Communications Group Co Ltd
China Mobile Hangzhou Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Mobile Communications Group Co Ltd, China Mobile Hangzhou Information Technology Co Ltd filed Critical China Mobile Communications Group Co Ltd
Priority to CN201611137455.3A priority Critical patent/CN108614825B/en
Publication of CN108614825A publication Critical patent/CN108614825A/en
Application granted granted Critical
Publication of CN108614825B publication Critical patent/CN108614825B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The embodiment of the invention discloses a kind of web page characteristics extracting method and devices, and target webpage is divided into multiple documentation sections according to the position framework of webpage information;Word segmentation processing is carried out to multiple documentation sections respectively, word segmentation processing result is counted, multiple set corresponding with multiple documentation sections are obtained, determine that base position weighted value, first set are data in the multiple set to most set according to number corresponding with Feature Words in first set;According to all set in base position weighted value, default weight proportion value and multiple set in addition to first set, the weighted value of all set in multiple set in addition to first set is determined;The weighted value of all set in multiple set and multiple set in addition to first set is subjected to integration processing, obtains the feature vector of target webpage, so that carrying out signature analysis to webpage according to feature vector.

Description

A kind of web page characteristics extracting method and device
Technical field
The present invention relates to the Feature Extraction Technology of internet arena more particularly to a kind of web page characteristics extracting method and dresses It sets.
Background technology
The extraction of web page characteristics is that web page contents are carried out with one of the key technology of data analysis, even more to Internet user Carry out the important link of personality analysis and personalized service recommendation.The quality of the extraction quality of web page characteristics will have a direct impact on To the quality to Internet user's personality analysis result, the matter provided personalized service to user also can be further influenced Amount.The extraction process of web page characteristics to the framework of webpage, the content words of webpage it is rich, the synonymy of words is very quick Sense, the extraction algorithm of web page characteristics need to consider influence of these factors to extracting result, evade the interference of other factors, extract The Feature Words of web page contents can most be characterized.
In the prior art, web page characteristics extraction algorithm mainly with word frequency-document frequency (TF_IDF, Termfrequency-inverse document frequency) algorithm and be based on DOM Document Object Model (DOM, Document Object Mode) optimization is conceived based on tree extraction technique.TF_IDF algorithms are a kind of for information retrieval and data digging The common weighting technique of pick is commented using the number of files of number and the whole network comprising this words that words occurs hereof is calculated Estimate significance level of the words in webpage, and screens the Feature Words of webpage as standard using the size of significance level.Based on dom tree Extraction technique be that hypertext markup language html web page is realized according to tree-like hierarchical structure feature possessed by html web page In data pick-up, extract the Feature Words of webpage by the way of the feature vector of optimization webpage.It is extracted using based on dom tree The web page characteristics word that technical limit spacing arrives has relatively high accuracy rate and recall rate.
However, there are irrationalities for term weight function calculating in TF_IDF algorithms, due to html document and common document It being very different in structure, it belongs to semi-structured textual form, and location is different in a document for Feature Words, it It embodying and differentiated should also be to the degree of article characterization ability, the weighted value assigned just should be different, therefore, this It is not science and comprehensive kind simply to apply mechanically IDF calculating;Separating capacity Shortcomings between TF_IDF classes, TF_IDF is only Difference of the characteristic item in class where this text and this text can be distinguished, but this characteristic item and other classes cannot be showed well Between difference.It is according to html web page institute to have excessive dependence, DOM technologies to structure of web page based on the extraction technique of dom tree The tree-like hierarchical structure feature that has realizes the data pick-up in html web page, is got using based on dom tree extraction technique Web page characteristics word accuracy rate and recall rate it is relatively high, but the technology needs corresponding several sample webpages, therefore is Suitable for each different ken, but due to structural excessive dependence, being easy under the form that structure of web page changes Passively.To sum up, certain limitation that above two basic methods are individually present, i.e., to the unwise of Feature Words present position Feel and structure of web page is excessively relied on.
Invention content
In order to solve the above technical problems, a kind of web page characteristics extracting method of offer of the embodiment of the present invention and device, optimize net The quality of page feature extraction result, ensures the correctness to the personality analysis data of Internet user.
The technical proposal of the invention is realized in this way:
The embodiment of the present invention provides a kind of web page characteristics extracting method, the method includes:
Target webpage is obtained, the target webpage is divided by multiple documentation sections according to the position framework of webpage information;
Respectively to the multiple documentation section carry out word segmentation processing, word segmentation processing result is counted, obtain with it is described The corresponding multiple set of multiple documentation sections, wherein each documentation section corresponds to a set, every in the multiple set One set includes at least one data pair, each data is to including:Feature Words and number corresponding with the Feature Words;
Determine that base position weighted value, the first set are according to number corresponding with the Feature Words in first set Data are to most set in the multiple set;
According to the base position weighted value, preset in weight proportion value and the multiple set in addition to the first set All set, determine the weighted value of all set in the multiple set in addition to the first set;
The weighted value of all set in the multiple set and the multiple set in addition to the first set is carried out Integration is handled, and obtains the feature vector of the target webpage, so that carrying out signature analysis to webpage according to described eigenvector.
Optionally, target webpage is divided into multiple documentation sections by the position framework according to webpage information, including:
The target webpage is divided into three title, keyword, text documentation sections according to the position framework of webpage information.
Optionally, described that word segmentation processing is carried out to the multiple documentation section respectively, word segmentation processing result is counted, Multiple set corresponding with the multiple documentation section are obtained, including:
Word segmentation processing is carried out to title documentation section, word segmentation processing result, which is carried out synonym merging treatment, obtains the first spy Word is levied, corresponding with fisrt feature word number is counted, by the fisrt feature word and corresponding with the fisrt feature word Number is stored in second set corresponding with title documentation section with the format of data pair, and the fisrt feature word includes at least one A Feature Words;
Word segmentation processing is carried out to keyword documentation section, word segmentation processing result, which is carried out synonym merging treatment, obtains second Feature Words count corresponding with second feature word number, by the second feature word and corresponding with the second feature word Number third set corresponding with keyword documentation section is stored in the format of data pair, the second feature word includes extremely Few Feature Words;
Word segmentation processing is carried out to text document part, word segmentation processing result, which is carried out synonym merging treatment, obtains third spy Word is levied, number corresponding with the third feature word is counted, the third feature word and the third feature word is corresponding secondary Number is stored in first set corresponding with text document part with the format of data pair, and the third feature word includes at least one Feature Words.
Optionally, described that base position weighted value is determined according to number corresponding with the Feature Words in first set, packet It includes:
Determine that maximum times value is the base position weight in the corresponding number of all Feature Words in the first set Value.
Optionally, described to remove institute according in the base position weighted value, default weight proportion value and the multiple set All set outside first set are stated, determine the weighted value of all set in the multiple set in addition to the first set, Including:
The value that the base position weighted value is multiplied with the first default weight proportion value respectively in the second set The corresponding number of each Feature Words is multiplied, and obtains the weighted value of each Feature Words in the second set, and described first is pre- If weight proportion value is weight proportion of the web page title position relative to Web page text position;
The value that the base position weighted value is multiplied with the second default weight proportion value respectively in the third set The corresponding number of each Feature Words is multiplied, and obtains the weighted value of each Feature Words in the third set, and described second is pre- If weight proportion value is weight proportion of the Web Page Key Words position relative to Web page text position.
Optionally, described by the multiple set and all set in the multiple set in addition to the first set Weighted value carries out integration processing, obtains the feature vector of the target webpage, including:
The corresponding weighted value of same characteristic features word in the multiple set is added, the weighted value after will add up according to from greatly to It is small to be ranked up, determine that preceding n weighted value and Feature Words corresponding with the preceding n weighted value are the target webpage after sequence Feature vector, wherein n is natural number.
The embodiment of the present invention provides a kind of web page characteristics extraction element, and described device includes:Acquiring unit, processing unit, Determination unit, wherein
The acquiring unit, for obtaining target webpage;
The processing unit, for the target webpage to be divided into multiple document portions according to the position framework of webpage information Point, be additionally operable to respectively to the multiple documentation section carry out word segmentation processing, word segmentation processing result is counted, obtain with it is described The corresponding multiple set of multiple documentation sections, wherein each documentation section corresponds to a set, every in the multiple set One set includes at least one data pair, each data is to including:Feature Words and number corresponding with the Feature Words;
The determination unit, for determining base position weight according to number corresponding with the Feature Words in first set Value, the first set are data in the multiple set to most set;Be additionally operable to according to the base position weighted value, All set in default weight proportion value and the multiple set in addition to the first set, determine and are removed in the multiple set The weighted value of all set outside the first set;
The processing unit is additionally operable to the institute in the multiple set and the multiple set in addition to the first set There is the weighted value of set to carry out integration processing, obtain the feature vector of the target webpage, so that according to described eigenvector Signature analysis is carried out to webpage.
Optionally, the processing unit, for according to the position framework of webpage information by the target webpage be divided into title, Three keyword, text documentation sections;
It is additionally operable to carry out word segmentation processing to title documentation section, word segmentation processing result is subjected to synonym merging treatment acquisition Fisrt feature word counts corresponding with fisrt feature word number, by the fisrt feature word and with the fisrt feature word Corresponding number is stored in second set corresponding with title documentation section with the format of data pair, and the fisrt feature word includes At least one Feature Words;
It is additionally operable to carry out word segmentation processing to keyword documentation section, word segmentation processing result progress synonym merging treatment is obtained Second feature word, count corresponding with second feature word number, by the second feature word and with the second feature The corresponding number of word is stored in third set corresponding with keyword documentation section, the second feature word with the format of data pair Including at least one Feature Words;
It is additionally operable to carry out word segmentation processing to text document part, word segmentation processing result is subjected to synonym merging treatment acquisition Third feature word counts number corresponding with the third feature word, by the third feature word and the third feature word pair The number answered is stored in first set corresponding with text document part with the format of data pair, and the third feature word includes extremely Few Feature Words.
Optionally, the determination unit, it is maximum in the corresponding number of all Feature Words for determining in the first set Secondary numerical value is the base position weighted value;
The processing unit is additionally operable to the value that the base position weighted value is multiplied with the first default weight proportion value point Number not corresponding with each Feature Words in the second set is multiplied, and obtains each Feature Words in the second set Weighted value, the first default weight proportion value are weight proportion of the web page title position relative to Web page text position;Also use In the value that the base position weighted value is multiplied with the second default weight proportion value respectively with each in the third set The corresponding number of Feature Words is multiplied, and obtains the weighted value of each Feature Words in the third set, the second default weight Ratio value is weight proportion of the Web Page Key Words position relative to Web page text position.
Optionally, the processing unit is additionally operable to the corresponding weighted value of same characteristic features word in the multiple set being added, Weighted value after will add up according to being ranked up from big to small;
The determination unit, for determining preceding n weighted value and Feature Words corresponding with the preceding n weighted value after sequence For the feature vector of the target webpage, wherein n is natural number.
An embodiment of the present invention provides a kind of web page characteristics extracting method and devices, with each basis of pre-determined webpage The default weight proportion value at position and the base position for being used as each basic part with the highest number of statistical web page Feature Words The adjusted value of weighted value finally determines the position weight value of Feature Words, realizes the personalization in the extraction of web page characteristics word, real Show and position weight on each position of web page contents has been dynamically determined, in this way, the quality of optimization web page characteristics extraction result, is protected The correctness to the personality analysis data of Internet user is demonstrate,proved, and the offer that Internet user provides personalized service is closed The guiding of physics and chemistry.
Description of the drawings
Fig. 1 is web page characteristics extracting method flow diagram provided in an embodiment of the present invention;
Fig. 2 is web page characteristics extracting method exemplary plot provided in an embodiment of the present invention;
Fig. 3 is web page characteristics extraction element structural schematic diagram provided in an embodiment of the present invention.
Specific implementation mode
Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete Site preparation describes.
Web page characteristics extracting method provided by the invention, uses position weight in being extracted to web page characteristics, fusion The influence that position weight and the frequency of occurrences the two elements extract web page characteristics vector.Extract high frequency and with the whole network other On the basis of webpage has discrimination vocabulary, multiple document portions are divided into according to the home position framework of webpage information to target webpage Point, and different weight proportion values, and the number according to the most Feature Words of webpage occurrence number are assigned to each documentation section As basic position weight value, going out in webpage according to vocabulary for the combination webpage on each position is determined with the product of the two Existing position weight value, to realize characterization ability of the dynamic adjustment Feature Words to targeted web content.
The present invention provides a kind of web page characteristics extracting method, as shown in Figure 1, the method may include:
Step 101 obtains target webpage, and the target webpage is divided into multiple documents according to the position framework of webpage information Part.
The executive agent of web page characteristics extracting method provided in an embodiment of the present invention is web page characteristics extraction element, i.e. webpage Feature deriving means obtain target webpage, and the target webpage is divided into multiple document portions according to the position framework of webpage information Point.
Specifically, as shown in Fig. 2, web page characteristics extraction element can be according to the position framework of webpage information by target webpage It is divided into three title, keyword, text documentation sections.
Step 102 carries out word segmentation processing to the multiple documentation section respectively, counts, obtains to word segmentation processing result Obtain multiple set corresponding with the multiple documentation section.
Wherein, each documentation section corresponds to a set, each set in the multiple set includes at least one A data pair, each data is to including:Feature Words and number corresponding with the Feature Words.
In a kind of possible realization method, web page characteristics extraction element carries out word segmentation processing to title documentation section, will point Word handling result carries out synonym merging treatment and obtains fisrt feature word, counts number corresponding with the fisrt feature word, will The fisrt feature word and number corresponding with the fisrt feature word are stored in and title documentation section with the format of data pair Corresponding second set, the fisrt feature word include at least one Feature Words;
Web page characteristics extraction element carries out word segmentation processing to keyword documentation section, and word segmentation processing result is carried out synonym Merging treatment obtains second feature word, counts corresponding with second feature word number, by the second feature word and with institute It states the corresponding number of second feature word and third set corresponding with keyword documentation section is stored in the format of data pair, it is described Second feature word includes at least one Feature Words;
Web page characteristics extraction element carries out word segmentation processing to text document part, and word segmentation processing result is carried out synonym conjunction And handle and obtain third feature word, corresponding with third feature word number is counted, by the third feature word and described the The corresponding number of three Feature Words is stored in first set corresponding with text document part with the format of data pair, and the third is special It includes at least one Feature Words to levy word.
Specifically, as shown in Fig. 2, by entire Web page structural, target webpage is divided by title TITLE according to position, is closed Tri- documentation sections of keyword MRTA and text CONTENT, and respectively to three documentation sections by ICTCLAS segmenter point Word, and word segmentation result is counted into each word or phrase occurrence number by synonym merging treatment, with (pij,fj) data pair Format is deposited in respectively in collection resultant vector title, meta and content, wherein p is word or phrase, and f is that word or phrase occur Number, i is that phrase the coding of position occurs, and j is the appearance order of phrase on the position, and title is and title document portion Divide corresponding set, meta is set corresponding with keyword documentation section, and content is collection corresponding with text document part It closes.
Assuming that in title, meta and content the total number of word respectively for l, m, n,
Then the aggregates content of title is:{(pt1,f1),(pt2,f2)...(ptk,fk)...(ptl,fl)};
The aggregates content of meta is:{(pm1,f1),(pm2,f2)...(pmk,fk)...(pmm,fm)};
The aggregates content of content is:{(pc1,f1),(pc2,f2)...(pck,fk)...(pcn,fn)}。
Step 103 determines base position weighted value according to number corresponding with the Feature Words in first set.
Wherein, the first set is data in the multiple set to most set.
Specifically, web page characteristics extraction element determines in the first set maximum time in the corresponding number of all Feature Words Numerical value is the base position weighted value.
In the embodiment of the present invention weight shared by each position in webpage is distinguished, is indicated in different location with this Word or phrase it is different to the influence powers of webpage main contents and symbolization power, so needing exist for individually to each position Word or phrase distribute weights.
Here, it is α B by the weight definition of the word of caption position or phrase, the word of keyword position or the weight of phrase are fixed Justice is β B, and the word of Web page text position or the weight definition of phrase are 1, wherein B is basic weighted value, and α and β are web page title The weight proportion value of position and keyword position relative to Web page text position, under normal circumstancesIn the present embodiment, α 4, β is taken to take 2, α, β can be adjusted according to practical concrete condition.
Here, calculating base position weighted value B is:
B=max { fc1,fc2...fck...fcn} (1)
Step 104, according in the base position weighted value, default weight proportion value and the multiple set except described the All set outside one set, determine the weighted value of all set in the multiple set in addition to the first set.
Specifically, what the base position weighted value was multiplied by web page characteristics extraction element with the first default weight proportion value Number corresponding with each Feature Words in the second set is multiplied value respectively, obtains each feature in the second set The weighted value of word, the first default weight proportion value are weight proportion of the web page title position relative to Web page text position; The value that the base position weighted value is multiplied with the second default weight proportion value is special with each in the third set respectively It levies the corresponding number of word to be multiplied, obtains the weighted value of each Feature Words in the third set, the second default weight ratio Example value is weight proportion of the Web Page Key Words position relative to Web page text position.
Each word or phrase in set of computations title and meta after the base position weighted value B obtained according to formula (1) Weight:
Wt=α B* { (pt1,f1),(pt2,f2)...(ptk,fk)...(ptl,fl)}
={ (pt1,αB*f1),(pt2,αB*f2)...(ptk,αB*fk)...(ptl,αB*fl)}
Wm=β B* { (pm1,f1),(pm2,f2)...(pmk,fk)...(pmm,fm)}
={ (pm1,βB*f1),(pm2,βB*f2)...(pmk,βB*fk)...(pmm,βB*fm)} (2)
Step 105, by the power of all set in addition to the first set in the multiple set and the multiple set Weight values carry out integration processing, obtain the feature vector of the target webpage, so that being carried out to webpage according to described eigenvector Signature analysis.
Specifically, the corresponding weighted value of same characteristic features word in the multiple set is added by web page characteristics extraction element, it will Weighted value after being added according to being ranked up from big to small, determine after sequence preceding n weighted value and with the preceding n weighted value pair The Feature Words answered are the feature vector of the target webpage, wherein n is natural number.
Illustratively, it is obtained in three parts of webpage after the weight set of word or phrase according to formula (2), integrates webpage three Partial characteristic item set and its weight are in the same characteristic item set, and integrating principle is:The weight phase of identical characteristic item Add, and is sorted from big to small according to feature weight, the n feature vector as webpage before choosing.The form of expression is:T= {t1,....,ti,....tn, w={ w1,....,wi,...,wn, tiIt is characterized word, wiFor with Feature Words tiCorresponding weight Value.Wherein, n can be adjusted according to actual conditions dynamic, and T is the feature set of words of webpage, and w is characterized the weighted value collection of set of words It closes, the two corresponds.
Web page characteristics extracting method provided in an embodiment of the present invention is applicable to most of internet web page feature and carries Take process;Need not machine learning be carried out to a large amount of webpages on internet in advance, and independent of the structure of webpage;It realizes pair Position weight is dynamically determined on each position of web page contents;With the position weight ratio of each basic part of pre-determined webpage Example and it is used as the position weight adjusted value of each basic part with the highest frequency of statistical web page content words and carrys out last determination The position weight of Feature Words realizes the personalization in the extraction of web page characteristics word, in this way, the matter of optimization web page characteristics extraction result Amount ensures the correctness to the personality analysis data of Internet user, and provide personalized service to Internet user The guiding rationalized is provided.
In the prior art, if the frequency that occurs in an article of some word or phrase is high, and in other articles Seldom occur, then it is assumed that this word or phrase have good class discrimination ability, are adapted to classify, in this way, existing centainly Limitation, the i.e. prior art make no exception word or phrase all in webpage, and unique difference is exactly to occur in webpage Number, and with this difference to determine whether be suitable as the Feature Words of webpage, but webpage is relative to document half structure The particularity of change, the weight meaning that the position that Feature Words occur has its different, the position meaning that Feature Words occur even compare feature The number of word can more represent the characteristics of webpage.For example, the word or phrase that occur in web page title are comparatively than in webpage The word or phrase occurred in text can will more summarize the characteristics of content and characterization webpage of webpage, because web page title is to have been subjected to Author from refine feature, and the word or phrase occurred in Web page text be described in detail web page contents numerous vocabulary it One.
Web page characteristics extracting method provided in an embodiment of the present invention is also in advance basic courses department for some key positions of webpage The position weight ratio value of position, such as caption position, keyword position and text position.These positions are the basic frameworks of webpage Position, it may be said that be the position that all webpages can all be covered on internet, so without largely being learnt to this;Statistics is complete The number for occurring most Feature Words in webpage is used as the adjusted value of base position weighted value with this, for each webpage, feature The highest frequency of word is can not be scheduled, is finally used as webpage with the product of position weight ratio value and base position weighted value The position weight value of Feature Words on each position thereby realizes the dynamic adjustment to web placement weight.
Web page characteristics extracting method provided in an embodiment of the present invention, can be detached from the framework of webpage, according to word frequency come early period The words with high-frequency in web page contents is screened, the result position according to webpage is not needed to come extraction one by one, is respectively processed, To solve the defect that existing extraction algorithm excessively relies on structure of web page to a certain extent;Position residing for words It sets and webpage highest frequency is dynamically to adjust the weighted value of each vocabulary, and extracted in webpage as standard using this weighted value Can upper characterization web page contents to greatest extent Feature Words, solve that existing extraction algorithm is insensitive to Feature Words present position to be lacked It falls into, balances the interactively of high frequency and position to Feature Words.
The embodiment of the present invention provides a kind of web page characteristics extraction element 30, as shown in figure 3, described device includes:It obtains single Member 301, processing unit 302, determination unit 303, wherein
The acquiring unit 301, for obtaining target webpage;
The processing unit 302, for the target webpage to be divided into multiple documents according to the position framework of webpage information Part is additionally operable to carry out word segmentation processing to the multiple documentation section respectively, be counted to word segmentation processing result, acquisition and institute State the corresponding multiple set of multiple documentation sections, wherein each documentation section corresponds to one and gathers, in the multiple set Each set includes at least one data pair, each data is to including:Feature Words and number corresponding with the Feature Words;
The determination unit 303, for determining base position according to number corresponding with the Feature Words in first set Weighted value, the first set are data in the multiple set to most set;It is additionally operable to be weighed according to the base position All set in weight values, default weight proportion value and the multiple set in addition to the first set, determine the multiple collection The weighted value of all set in conjunction in addition to the first set;
The processing unit 302, be additionally operable to by it is the multiple set and the multiple set in addition to the first set The weighted values of all set carry out integration processing, the feature vector of the target webpage is obtained, so that according to the feature Vector carries out signature analysis to webpage.
Further, the processing unit 302, for being divided into the target webpage according to the position framework of webpage information Three title, keyword, text documentation sections;
It is additionally operable to carry out word segmentation processing to title documentation section, word segmentation processing result is subjected to synonym merging treatment acquisition Fisrt feature word counts corresponding with fisrt feature word number, by the fisrt feature word and with the fisrt feature word Corresponding number is stored in second set corresponding with title documentation section with the format of data pair, and the fisrt feature word includes At least one Feature Words;
It is additionally operable to carry out word segmentation processing to keyword documentation section, word segmentation processing result progress synonym merging treatment is obtained Second feature word, count corresponding with second feature word number, by the second feature word and with the second feature The corresponding number of word is stored in third set corresponding with keyword documentation section, the second feature word with the format of data pair Including at least one Feature Words;
It is additionally operable to carry out word segmentation processing to text document part, word segmentation processing result is subjected to synonym merging treatment acquisition Third feature word counts number corresponding with the third feature word, by the third feature word and the third feature word pair The number answered is stored in first set corresponding with text document part with the format of data pair, and the third feature word includes extremely Few Feature Words.
Further, the determination unit 303, for determining in the first set in the corresponding number of all Feature Words Maximum times value is the base position weighted value;
The processing unit 302 is additionally operable to the base position weighted value being multiplied with the first default weight proportion value Number corresponding with each Feature Words in the second set is multiplied value respectively, obtains each feature in the second set The weighted value of word, the first default weight proportion value are weight proportion of the web page title position relative to Web page text position; Be additionally operable to the value that the base position weighted value is multiplied with the second default weight proportion value respectively with it is every in the third set The corresponding number of one Feature Words is multiplied, and obtains the weighted value of each Feature Words in the third set, and described second is default Weight proportion value is weight proportion of the Web Page Key Words position relative to Web page text position.
Further, the processing unit 302 is additionally operable to the corresponding weighted value of same characteristic features word in the multiple set It is added, weighted value after will add up according to being ranked up from big to small;
The determination unit 303, for determining preceding n weighted value and spy corresponding with the preceding n weighted value after sequence Levy the feature vector that word is the target webpage, wherein n is natural number.
Specifically, the understanding of web page characteristics extraction element provided in an embodiment of the present invention can be carried with reference to above-mentioned web page characteristics The explanation of embodiment of the method is taken, details are not described herein for the embodiment of the present invention.
Web page characteristics extraction element provided in an embodiment of the present invention, can be detached from the framework of webpage, according to word frequency come early period The words with high-frequency in web page contents is screened, the result position according to webpage is not needed to come extraction one by one, is respectively processed, To solve the defect that existing extraction algorithm excessively relies on structure of web page to a certain extent;Position residing for words It sets and webpage highest frequency is dynamically to adjust the weighted value of each vocabulary, and extracted in webpage as standard using this weighted value Can upper characterization web page contents to greatest extent Feature Words, solve that existing extraction algorithm is insensitive to Feature Words present position to be lacked It falls into, balances the interactively of high frequency and position to Feature Words.
It should be understood by those skilled in the art that, the embodiment of the present invention can be provided as method, system or computer program Product.Therefore, the shape of hardware embodiment, software implementation or embodiment combining software and hardware aspects can be used in the present invention Formula.Moreover, the present invention can be used can use storage in the computer that one or more wherein includes computer usable program code The form for the computer program product implemented on medium (including but not limited to magnetic disk storage and optical memory etc.).
The present invention be with reference to according to the method for the embodiment of the present invention, the flow of equipment (system) and computer program product Figure and/or block diagram describe.It should be understood that can be realized by computer program instructions every first-class in flowchart and/or the block diagram The combination of flow and/or box in journey and/or box and flowchart and/or the block diagram.These computer programs can be provided Instruct the processor of all-purpose computer, special purpose computer, Embedded Processor or other programmable data processing devices to produce A raw machine so that the instruction executed by computer or the processor of other programmable data processing devices is generated for real The device for the function of being specified in present one flow of flow chart or one box of multiple flows and/or block diagram or multiple boxes.
These computer program instructions, which may also be stored in, can guide computer or other programmable data processing devices with spy Determine in the computer-readable memory that mode works so that instruction generation stored in the computer readable memory includes referring to Enable the manufacture of device, the command device realize in one flow of flow chart or multiple flows and/or one box of block diagram or The function of being specified in multiple boxes.
These computer program instructions also can be loaded onto a computer or other programmable data processing device so that count Series of operation steps are executed on calculation machine or other programmable devices to generate computer implemented processing, in computer or The instruction executed on other programmable devices is provided for realizing in one flow of flow chart or multiple flows and/or block diagram one The step of function of being specified in a box or multiple boxes.
The foregoing is only a preferred embodiment of the present invention, is not intended to limit the scope of the present invention.

Claims (10)

1. a kind of web page characteristics extracting method, which is characterized in that the method includes:
Target webpage is obtained, the target webpage is divided by multiple documentation sections according to the position framework of webpage information;
Respectively to the multiple documentation section carry out word segmentation processing, word segmentation processing result is counted, obtain with it is the multiple The corresponding multiple set of documentation section, wherein each documentation section corresponds to a set, each in the multiple set Set includes at least one data pair, each data is to including:Feature Words and number corresponding with the Feature Words;
Determine that base position weighted value, the first set are described according to number corresponding with the Feature Words in first set Data are to most set in multiple set;
According to the institute in the base position weighted value, default weight proportion value and the multiple set in addition to the first set There is set, determines the weighted value of all set in the multiple set in addition to the first set;
The weighted value of all set in the multiple set and the multiple set in addition to the first set is integrated Processing, obtains the feature vector of the target webpage, so that carrying out signature analysis to webpage according to described eigenvector.
2. according to the method described in claim 1, it is characterized in that, the position framework according to webpage information is by target webpage It is divided into multiple documentation sections, including:
The target webpage is divided into three title, keyword, text documentation sections according to the position framework of webpage information.
3. according to the method described in claim 2, it is characterized in that, described respectively carry out at participle the multiple documentation section Reason, counts word segmentation processing result, obtains multiple set corresponding with the multiple documentation section, including:
Word segmentation processing is carried out to title documentation section, word segmentation processing result, which is carried out synonym merging treatment, obtains fisrt feature Word counts number corresponding with the fisrt feature word, by the fisrt feature word and corresponding with the fisrt feature word time Number is stored in second set corresponding with title documentation section with the format of data pair, and the fisrt feature word includes at least one Feature Words;
Word segmentation processing is carried out to keyword documentation section, word segmentation processing result, which is carried out synonym merging treatment, obtains second feature Word counts number corresponding with the second feature word, by the second feature word and corresponding with the second feature word time Number is stored in third set corresponding with keyword documentation section with the format of data pair, and the second feature word includes at least one A Feature Words;
Word segmentation processing is carried out to text document part, word segmentation processing result, which is carried out synonym merging treatment, obtains third feature Word counts number corresponding with the third feature word, by the third feature word and the corresponding number of the third feature word It is stored in first set corresponding with text document part with the format of data pair, the third feature word includes at least one spy Levy word.
4. method according to claim 1 or 3, which is characterized in that it is described according in first set with the Feature Words pair The number answered determines base position weighted value, including:
Determine that maximum times value is the base position weighted value in the corresponding number of all Feature Words in the first set.
5. according to the method described in claim 3, it is characterized in that, described according to the base position weighted value, default weight All set in ratio value and the multiple set in addition to the first set determine and remove described first in the multiple set The weighted value of all set outside set, including:
The value that the base position weighted value is multiplied with the first default weight proportion value respectively with it is each in the second set The corresponding number of a Feature Words is multiplied, and obtains the weighted value of each Feature Words in the second set, the described first default power Weight ratio value is weight proportion of the web page title position relative to Web page text position;
The value that the base position weighted value is multiplied with the second default weight proportion value respectively with it is each in the third set The corresponding number of a Feature Words is multiplied, and obtains the weighted value of each Feature Words in the third set, the described second default power Weight ratio value is weight proportion of the Web Page Key Words position relative to Web page text position.
6. according to the method described in claim 1, it is characterized in that, described will remove in the multiple set and the multiple set The weighted value of all set outside the first set carries out integration processing, obtains the feature vector of the target webpage, including:
The corresponding weighted value of same characteristic features word in the multiple set is added, the weighted value after will add up according to from big to small into Row sequence determines that preceding n weighted value and Feature Words corresponding with the preceding n weighted value are the spy of the target webpage after sequence Sign vector, wherein n is natural number.
7. a kind of web page characteristics extraction element, which is characterized in that described device includes:Acquiring unit, determines list at processing unit Member, wherein
The acquiring unit, for obtaining target webpage;
The processing unit, for the target webpage to be divided into multiple documentation sections according to the position framework of webpage information, also For respectively to the multiple documentation section carry out word segmentation processing, word segmentation processing result is counted, obtain with it is the multiple The corresponding multiple set of documentation section, wherein each documentation section corresponds to a set, each in the multiple set Set includes at least one data pair, each data is to including:Feature Words and number corresponding with the Feature Words;
The determination unit, for determining base position weighted value according to number corresponding with the Feature Words in first set, The first set is data in the multiple set to most set;It is additionally operable to according to the base position weighted value, in advance If all set in weight proportion value and the multiple set in addition to the first set, determines and remove institute in the multiple set State the weighted value of all set outside first set;
The processing unit is additionally operable to all collection in the multiple set and the multiple set in addition to the first set The weighted value of conjunction carries out integration processing, obtains the feature vector of the target webpage, so that according to described eigenvector to net Page carries out signature analysis.
8. device according to claim 7, which is characterized in that
The processing unit, for the target webpage to be divided into title, keyword, text according to the position framework of webpage information Three documentation sections;
It is additionally operable to carry out word segmentation processing to title documentation section, word segmentation processing result, which is carried out synonym merging treatment, obtains first Feature Words count corresponding with fisrt feature word number, by the fisrt feature word and corresponding with the fisrt feature word Number second set corresponding with title documentation section is stored in the format of data pair, the fisrt feature word includes at least One Feature Words;
It is additionally operable to carry out word segmentation processing to keyword documentation section, word segmentation processing result, which is carried out synonym merging treatment, obtains the Two Feature Words count corresponding with second feature word number, by the second feature word and with the second feature word pair The number answered is stored in third set corresponding with keyword documentation section with the format of data pair, and the second feature word includes At least one Feature Words;
It is additionally operable to carry out word segmentation processing to text document part, word segmentation processing result, which is carried out synonym merging treatment, obtains third Feature Words count number corresponding with the third feature word, and the third feature word and the third feature word is corresponding Number is stored in first set corresponding with text document part with the format of data pair, and the third feature word includes at least one A Feature Words.
9. device according to claim 7 or 8, which is characterized in that the determination unit, for determining the first set In in the corresponding number of all Feature Words maximum times value be the base position weighted value;
The processing unit, be additionally operable to the value that the base position weighted value is multiplied with the first default weight proportion value respectively with The corresponding number of each Feature Words is multiplied in the second set, obtains the weight of each Feature Words in the second set Value, the first default weight proportion value are weight proportion of the web page title position relative to Web page text position;Being additionally operable to will The value that the base position weighted value is multiplied with the second default weight proportion value respectively with each feature in the third set The corresponding number of word is multiplied, and obtains the weighted value of each Feature Words in the third set, the second default weight proportion Value is weight proportion of the Web Page Key Words position relative to Web page text position.
10. device according to claim 7, which is characterized in that
The processing unit is additionally operable to the corresponding weighted value of same characteristic features word in the multiple set being added, after will add up Weighted value according to being ranked up from big to small;
The determination unit, for determining, preceding n weighted value and Feature Words corresponding with the preceding n weighted value are institute after sequence State the feature vector of target webpage, wherein n is natural number.
CN201611137455.3A 2016-12-12 2016-12-12 Webpage feature extraction method and device Active CN108614825B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201611137455.3A CN108614825B (en) 2016-12-12 2016-12-12 Webpage feature extraction method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201611137455.3A CN108614825B (en) 2016-12-12 2016-12-12 Webpage feature extraction method and device

Publications (2)

Publication Number Publication Date
CN108614825A true CN108614825A (en) 2018-10-02
CN108614825B CN108614825B (en) 2022-04-15

Family

ID=63657508

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201611137455.3A Active CN108614825B (en) 2016-12-12 2016-12-12 Webpage feature extraction method and device

Country Status (1)

Country Link
CN (1) CN108614825B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112989790A (en) * 2021-03-17 2021-06-18 中国科学院深圳先进技术研究院 Document characterization method and device based on deep learning, equipment and storage medium
CN114528811A (en) * 2022-01-21 2022-05-24 北京麦克斯泰科技有限公司 Article content extraction method, device, equipment and storage medium
CN115858470A (en) * 2022-12-26 2023-03-28 深圳市中政汇智管理咨询有限公司 Policy and regulation file matching method, system, server and storage medium

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101079031A (en) * 2006-06-15 2007-11-28 腾讯科技(深圳)有限公司 Web page subject extraction system and method
CN101246498A (en) * 2008-03-27 2008-08-20 腾讯科技(深圳)有限公司 News web page searching method
CN103064969A (en) * 2012-12-31 2013-04-24 武汉传神信息技术有限公司 Method for automatically creating keyword index table
US20130246184A1 (en) * 2012-03-13 2013-09-19 PowerLinks Media Limited Method and system for displaying a contextual advertisement on a webpage
CN103617213A (en) * 2013-11-19 2014-03-05 北京奇虎科技有限公司 Method and system for identifying newspage attributive characters
CN103810264A (en) * 2014-01-27 2014-05-21 西安理工大学 Webpage text classification method based on feature selection
US20140143225A1 (en) * 2012-11-21 2014-05-22 Hon Hai Precision Industry Co., Ltd. Web searching method, system, and apparatus

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101079031A (en) * 2006-06-15 2007-11-28 腾讯科技(深圳)有限公司 Web page subject extraction system and method
CN101246498A (en) * 2008-03-27 2008-08-20 腾讯科技(深圳)有限公司 News web page searching method
US20130246184A1 (en) * 2012-03-13 2013-09-19 PowerLinks Media Limited Method and system for displaying a contextual advertisement on a webpage
US20140143225A1 (en) * 2012-11-21 2014-05-22 Hon Hai Precision Industry Co., Ltd. Web searching method, system, and apparatus
CN103064969A (en) * 2012-12-31 2013-04-24 武汉传神信息技术有限公司 Method for automatically creating keyword index table
CN103617213A (en) * 2013-11-19 2014-03-05 北京奇虎科技有限公司 Method and system for identifying newspage attributive characters
CN103810264A (en) * 2014-01-27 2014-05-21 西安理工大学 Webpage text classification method based on feature selection

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
刘琼琼等: "面向网页的主题概念挖掘", 《计算机科学》 *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112989790A (en) * 2021-03-17 2021-06-18 中国科学院深圳先进技术研究院 Document characterization method and device based on deep learning, equipment and storage medium
CN112989790B (en) * 2021-03-17 2023-02-28 中国科学院深圳先进技术研究院 Document characterization method and device based on deep learning, equipment and storage medium
CN114528811A (en) * 2022-01-21 2022-05-24 北京麦克斯泰科技有限公司 Article content extraction method, device, equipment and storage medium
CN114528811B (en) * 2022-01-21 2022-09-02 北京麦克斯泰科技有限公司 Article content extraction method, device, equipment and storage medium
CN115858470A (en) * 2022-12-26 2023-03-28 深圳市中政汇智管理咨询有限公司 Policy and regulation file matching method, system, server and storage medium
CN115858470B (en) * 2022-12-26 2023-09-22 深圳市中政汇智管理咨询有限公司 Policy and regulation file matching method, system, server and storage medium

Also Published As

Publication number Publication date
CN108614825B (en) 2022-04-15

Similar Documents

Publication Publication Date Title
CN105354333B (en) A kind of method for extracting topic based on newsletter archive
CN104915448B (en) A kind of entity based on level convolutional network and paragraph link method
CN105426762B (en) A kind of static detection method that android application programs are malicious
CN106570144A (en) Method and apparatus for recommending information
CN105956031A (en) Text classification method and apparatus
CN105243152A (en) Graph model-based automatic abstracting method
CN107862022A (en) Cultural resource commending system
CN107180075A (en) The label automatic generation method of text classification integrated level clustering
CN105760493A (en) Automatic work order classification method for electricity marketing service hot spot 95598
CN108199951A (en) A kind of rubbish mail filtering method based on more algorithm fusion models
CN105205090A (en) Web page text classification algorithm research based on web page link analysis and support vector machine
CN107943824A (en) A kind of big data news category method, system and device based on LDA
CN105528422A (en) Focused crawler processing method and apparatus
CN101833579B (en) Method and system for automatically detecting academic misconduct literature
CN108614825A (en) A kind of web page characteristics extracting method and device
CN105205163B (en) A kind of multi-level two sorting technique of the incremental learning of science and technology news
CN105512333A (en) Product comment theme searching method based on emotional tendency
CN107145476A (en) One kind is based on improvement TF IDF keyword extraction algorithms
CN108536683A (en) A kind of paper fragmentation information abstracting method based on machine learning
CN109710725A (en) A kind of Chinese table column label restoration methods and system based on text classification
CN103514151A (en) Dependency grammar analysis method and device and auxiliary classifier training method
CN102436512A (en) Preference-based web page text content control method
CN103268346A (en) Semi-supervised classification method and semi-supervised classification system
CN103257961B (en) Bibliography disappear weight method, Apparatus and system
CN110019820A (en) Main suit and present illness history symptom Timing Coincidence Detection method in a kind of case history

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information

Address after: 310012 building A01, 1600 yuhangtang Road, Wuchang Street, Yuhang District, Hangzhou City, Zhejiang Province

Applicant after: CHINA MOBILE (HANGZHOU) INFORMATION TECHNOLOGY Co.,Ltd.

Applicant after: China Mobile Communications Corp.

Address before: 310012, No. 14, building three, Chang Torch Hotel, No. 259, Wensanlu Road, Xihu District, Zhejiang, Hangzhou

Applicant before: CHINA MOBILE (HANGZHOU) INFORMATION TECHNOLOGY Co.,Ltd.

Applicant before: China Mobile Communications Corp.

CB02 Change of applicant information
GR01 Patent grant
GR01 Patent grant