CN108614825A - A kind of web page characteristics extracting method and device - Google Patents
A kind of web page characteristics extracting method and device Download PDFInfo
- Publication number
- CN108614825A CN108614825A CN201611137455.3A CN201611137455A CN108614825A CN 108614825 A CN108614825 A CN 108614825A CN 201611137455 A CN201611137455 A CN 201611137455A CN 108614825 A CN108614825 A CN 108614825A
- Authority
- CN
- China
- Prior art keywords
- feature
- word
- weighted value
- value
- web page
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Information Transfer Between Computers (AREA)
Abstract
The embodiment of the invention discloses a kind of web page characteristics extracting method and devices, and target webpage is divided into multiple documentation sections according to the position framework of webpage information;Word segmentation processing is carried out to multiple documentation sections respectively, word segmentation processing result is counted, multiple set corresponding with multiple documentation sections are obtained, determine that base position weighted value, first set are data in the multiple set to most set according to number corresponding with Feature Words in first set;According to all set in base position weighted value, default weight proportion value and multiple set in addition to first set, the weighted value of all set in multiple set in addition to first set is determined;The weighted value of all set in multiple set and multiple set in addition to first set is subjected to integration processing, obtains the feature vector of target webpage, so that carrying out signature analysis to webpage according to feature vector.
Description
Technical field
The present invention relates to the Feature Extraction Technology of internet arena more particularly to a kind of web page characteristics extracting method and dresses
It sets.
Background technology
The extraction of web page characteristics is that web page contents are carried out with one of the key technology of data analysis, even more to Internet user
Carry out the important link of personality analysis and personalized service recommendation.The quality of the extraction quality of web page characteristics will have a direct impact on
To the quality to Internet user's personality analysis result, the matter provided personalized service to user also can be further influenced
Amount.The extraction process of web page characteristics to the framework of webpage, the content words of webpage it is rich, the synonymy of words is very quick
Sense, the extraction algorithm of web page characteristics need to consider influence of these factors to extracting result, evade the interference of other factors, extract
The Feature Words of web page contents can most be characterized.
In the prior art, web page characteristics extraction algorithm mainly with word frequency-document frequency (TF_IDF,
Termfrequency-inverse document frequency) algorithm and be based on DOM Document Object Model (DOM, Document
Object Mode) optimization is conceived based on tree extraction technique.TF_IDF algorithms are a kind of for information retrieval and data digging
The common weighting technique of pick is commented using the number of files of number and the whole network comprising this words that words occurs hereof is calculated
Estimate significance level of the words in webpage, and screens the Feature Words of webpage as standard using the size of significance level.Based on dom tree
Extraction technique be that hypertext markup language html web page is realized according to tree-like hierarchical structure feature possessed by html web page
In data pick-up, extract the Feature Words of webpage by the way of the feature vector of optimization webpage.It is extracted using based on dom tree
The web page characteristics word that technical limit spacing arrives has relatively high accuracy rate and recall rate.
However, there are irrationalities for term weight function calculating in TF_IDF algorithms, due to html document and common document
It being very different in structure, it belongs to semi-structured textual form, and location is different in a document for Feature Words, it
It embodying and differentiated should also be to the degree of article characterization ability, the weighted value assigned just should be different, therefore, this
It is not science and comprehensive kind simply to apply mechanically IDF calculating;Separating capacity Shortcomings between TF_IDF classes, TF_IDF is only
Difference of the characteristic item in class where this text and this text can be distinguished, but this characteristic item and other classes cannot be showed well
Between difference.It is according to html web page institute to have excessive dependence, DOM technologies to structure of web page based on the extraction technique of dom tree
The tree-like hierarchical structure feature that has realizes the data pick-up in html web page, is got using based on dom tree extraction technique
Web page characteristics word accuracy rate and recall rate it is relatively high, but the technology needs corresponding several sample webpages, therefore is
Suitable for each different ken, but due to structural excessive dependence, being easy under the form that structure of web page changes
Passively.To sum up, certain limitation that above two basic methods are individually present, i.e., to the unwise of Feature Words present position
Feel and structure of web page is excessively relied on.
Invention content
In order to solve the above technical problems, a kind of web page characteristics extracting method of offer of the embodiment of the present invention and device, optimize net
The quality of page feature extraction result, ensures the correctness to the personality analysis data of Internet user.
The technical proposal of the invention is realized in this way:
The embodiment of the present invention provides a kind of web page characteristics extracting method, the method includes:
Target webpage is obtained, the target webpage is divided by multiple documentation sections according to the position framework of webpage information;
Respectively to the multiple documentation section carry out word segmentation processing, word segmentation processing result is counted, obtain with it is described
The corresponding multiple set of multiple documentation sections, wherein each documentation section corresponds to a set, every in the multiple set
One set includes at least one data pair, each data is to including:Feature Words and number corresponding with the Feature Words;
Determine that base position weighted value, the first set are according to number corresponding with the Feature Words in first set
Data are to most set in the multiple set;
According to the base position weighted value, preset in weight proportion value and the multiple set in addition to the first set
All set, determine the weighted value of all set in the multiple set in addition to the first set;
The weighted value of all set in the multiple set and the multiple set in addition to the first set is carried out
Integration is handled, and obtains the feature vector of the target webpage, so that carrying out signature analysis to webpage according to described eigenvector.
Optionally, target webpage is divided into multiple documentation sections by the position framework according to webpage information, including:
The target webpage is divided into three title, keyword, text documentation sections according to the position framework of webpage information.
Optionally, described that word segmentation processing is carried out to the multiple documentation section respectively, word segmentation processing result is counted,
Multiple set corresponding with the multiple documentation section are obtained, including:
Word segmentation processing is carried out to title documentation section, word segmentation processing result, which is carried out synonym merging treatment, obtains the first spy
Word is levied, corresponding with fisrt feature word number is counted, by the fisrt feature word and corresponding with the fisrt feature word
Number is stored in second set corresponding with title documentation section with the format of data pair, and the fisrt feature word includes at least one
A Feature Words;
Word segmentation processing is carried out to keyword documentation section, word segmentation processing result, which is carried out synonym merging treatment, obtains second
Feature Words count corresponding with second feature word number, by the second feature word and corresponding with the second feature word
Number third set corresponding with keyword documentation section is stored in the format of data pair, the second feature word includes extremely
Few Feature Words;
Word segmentation processing is carried out to text document part, word segmentation processing result, which is carried out synonym merging treatment, obtains third spy
Word is levied, number corresponding with the third feature word is counted, the third feature word and the third feature word is corresponding secondary
Number is stored in first set corresponding with text document part with the format of data pair, and the third feature word includes at least one
Feature Words.
Optionally, described that base position weighted value is determined according to number corresponding with the Feature Words in first set, packet
It includes:
Determine that maximum times value is the base position weight in the corresponding number of all Feature Words in the first set
Value.
Optionally, described to remove institute according in the base position weighted value, default weight proportion value and the multiple set
All set outside first set are stated, determine the weighted value of all set in the multiple set in addition to the first set,
Including:
The value that the base position weighted value is multiplied with the first default weight proportion value respectively in the second set
The corresponding number of each Feature Words is multiplied, and obtains the weighted value of each Feature Words in the second set, and described first is pre-
If weight proportion value is weight proportion of the web page title position relative to Web page text position;
The value that the base position weighted value is multiplied with the second default weight proportion value respectively in the third set
The corresponding number of each Feature Words is multiplied, and obtains the weighted value of each Feature Words in the third set, and described second is pre-
If weight proportion value is weight proportion of the Web Page Key Words position relative to Web page text position.
Optionally, described by the multiple set and all set in the multiple set in addition to the first set
Weighted value carries out integration processing, obtains the feature vector of the target webpage, including:
The corresponding weighted value of same characteristic features word in the multiple set is added, the weighted value after will add up according to from greatly to
It is small to be ranked up, determine that preceding n weighted value and Feature Words corresponding with the preceding n weighted value are the target webpage after sequence
Feature vector, wherein n is natural number.
The embodiment of the present invention provides a kind of web page characteristics extraction element, and described device includes:Acquiring unit, processing unit,
Determination unit, wherein
The acquiring unit, for obtaining target webpage;
The processing unit, for the target webpage to be divided into multiple document portions according to the position framework of webpage information
Point, be additionally operable to respectively to the multiple documentation section carry out word segmentation processing, word segmentation processing result is counted, obtain with it is described
The corresponding multiple set of multiple documentation sections, wherein each documentation section corresponds to a set, every in the multiple set
One set includes at least one data pair, each data is to including:Feature Words and number corresponding with the Feature Words;
The determination unit, for determining base position weight according to number corresponding with the Feature Words in first set
Value, the first set are data in the multiple set to most set;Be additionally operable to according to the base position weighted value,
All set in default weight proportion value and the multiple set in addition to the first set, determine and are removed in the multiple set
The weighted value of all set outside the first set;
The processing unit is additionally operable to the institute in the multiple set and the multiple set in addition to the first set
There is the weighted value of set to carry out integration processing, obtain the feature vector of the target webpage, so that according to described eigenvector
Signature analysis is carried out to webpage.
Optionally, the processing unit, for according to the position framework of webpage information by the target webpage be divided into title,
Three keyword, text documentation sections;
It is additionally operable to carry out word segmentation processing to title documentation section, word segmentation processing result is subjected to synonym merging treatment acquisition
Fisrt feature word counts corresponding with fisrt feature word number, by the fisrt feature word and with the fisrt feature word
Corresponding number is stored in second set corresponding with title documentation section with the format of data pair, and the fisrt feature word includes
At least one Feature Words;
It is additionally operable to carry out word segmentation processing to keyword documentation section, word segmentation processing result progress synonym merging treatment is obtained
Second feature word, count corresponding with second feature word number, by the second feature word and with the second feature
The corresponding number of word is stored in third set corresponding with keyword documentation section, the second feature word with the format of data pair
Including at least one Feature Words;
It is additionally operable to carry out word segmentation processing to text document part, word segmentation processing result is subjected to synonym merging treatment acquisition
Third feature word counts number corresponding with the third feature word, by the third feature word and the third feature word pair
The number answered is stored in first set corresponding with text document part with the format of data pair, and the third feature word includes extremely
Few Feature Words.
Optionally, the determination unit, it is maximum in the corresponding number of all Feature Words for determining in the first set
Secondary numerical value is the base position weighted value;
The processing unit is additionally operable to the value that the base position weighted value is multiplied with the first default weight proportion value point
Number not corresponding with each Feature Words in the second set is multiplied, and obtains each Feature Words in the second set
Weighted value, the first default weight proportion value are weight proportion of the web page title position relative to Web page text position;Also use
In the value that the base position weighted value is multiplied with the second default weight proportion value respectively with each in the third set
The corresponding number of Feature Words is multiplied, and obtains the weighted value of each Feature Words in the third set, the second default weight
Ratio value is weight proportion of the Web Page Key Words position relative to Web page text position.
Optionally, the processing unit is additionally operable to the corresponding weighted value of same characteristic features word in the multiple set being added,
Weighted value after will add up according to being ranked up from big to small;
The determination unit, for determining preceding n weighted value and Feature Words corresponding with the preceding n weighted value after sequence
For the feature vector of the target webpage, wherein n is natural number.
An embodiment of the present invention provides a kind of web page characteristics extracting method and devices, with each basis of pre-determined webpage
The default weight proportion value at position and the base position for being used as each basic part with the highest number of statistical web page Feature Words
The adjusted value of weighted value finally determines the position weight value of Feature Words, realizes the personalization in the extraction of web page characteristics word, real
Show and position weight on each position of web page contents has been dynamically determined, in this way, the quality of optimization web page characteristics extraction result, is protected
The correctness to the personality analysis data of Internet user is demonstrate,proved, and the offer that Internet user provides personalized service is closed
The guiding of physics and chemistry.
Description of the drawings
Fig. 1 is web page characteristics extracting method flow diagram provided in an embodiment of the present invention;
Fig. 2 is web page characteristics extracting method exemplary plot provided in an embodiment of the present invention;
Fig. 3 is web page characteristics extraction element structural schematic diagram provided in an embodiment of the present invention.
Specific implementation mode
Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete
Site preparation describes.
Web page characteristics extracting method provided by the invention, uses position weight in being extracted to web page characteristics, fusion
The influence that position weight and the frequency of occurrences the two elements extract web page characteristics vector.Extract high frequency and with the whole network other
On the basis of webpage has discrimination vocabulary, multiple document portions are divided into according to the home position framework of webpage information to target webpage
Point, and different weight proportion values, and the number according to the most Feature Words of webpage occurrence number are assigned to each documentation section
As basic position weight value, going out in webpage according to vocabulary for the combination webpage on each position is determined with the product of the two
Existing position weight value, to realize characterization ability of the dynamic adjustment Feature Words to targeted web content.
The present invention provides a kind of web page characteristics extracting method, as shown in Figure 1, the method may include:
Step 101 obtains target webpage, and the target webpage is divided into multiple documents according to the position framework of webpage information
Part.
The executive agent of web page characteristics extracting method provided in an embodiment of the present invention is web page characteristics extraction element, i.e. webpage
Feature deriving means obtain target webpage, and the target webpage is divided into multiple document portions according to the position framework of webpage information
Point.
Specifically, as shown in Fig. 2, web page characteristics extraction element can be according to the position framework of webpage information by target webpage
It is divided into three title, keyword, text documentation sections.
Step 102 carries out word segmentation processing to the multiple documentation section respectively, counts, obtains to word segmentation processing result
Obtain multiple set corresponding with the multiple documentation section.
Wherein, each documentation section corresponds to a set, each set in the multiple set includes at least one
A data pair, each data is to including:Feature Words and number corresponding with the Feature Words.
In a kind of possible realization method, web page characteristics extraction element carries out word segmentation processing to title documentation section, will point
Word handling result carries out synonym merging treatment and obtains fisrt feature word, counts number corresponding with the fisrt feature word, will
The fisrt feature word and number corresponding with the fisrt feature word are stored in and title documentation section with the format of data pair
Corresponding second set, the fisrt feature word include at least one Feature Words;
Web page characteristics extraction element carries out word segmentation processing to keyword documentation section, and word segmentation processing result is carried out synonym
Merging treatment obtains second feature word, counts corresponding with second feature word number, by the second feature word and with institute
It states the corresponding number of second feature word and third set corresponding with keyword documentation section is stored in the format of data pair, it is described
Second feature word includes at least one Feature Words;
Web page characteristics extraction element carries out word segmentation processing to text document part, and word segmentation processing result is carried out synonym conjunction
And handle and obtain third feature word, corresponding with third feature word number is counted, by the third feature word and described the
The corresponding number of three Feature Words is stored in first set corresponding with text document part with the format of data pair, and the third is special
It includes at least one Feature Words to levy word.
Specifically, as shown in Fig. 2, by entire Web page structural, target webpage is divided by title TITLE according to position, is closed
Tri- documentation sections of keyword MRTA and text CONTENT, and respectively to three documentation sections by ICTCLAS segmenter point
Word, and word segmentation result is counted into each word or phrase occurrence number by synonym merging treatment, with (pij,fj) data pair
Format is deposited in respectively in collection resultant vector title, meta and content, wherein p is word or phrase, and f is that word or phrase occur
Number, i is that phrase the coding of position occurs, and j is the appearance order of phrase on the position, and title is and title document portion
Divide corresponding set, meta is set corresponding with keyword documentation section, and content is collection corresponding with text document part
It closes.
Assuming that in title, meta and content the total number of word respectively for l, m, n,
Then the aggregates content of title is:{(pt1,f1),(pt2,f2)...(ptk,fk)...(ptl,fl)};
The aggregates content of meta is:{(pm1,f1),(pm2,f2)...(pmk,fk)...(pmm,fm)};
The aggregates content of content is:{(pc1,f1),(pc2,f2)...(pck,fk)...(pcn,fn)}。
Step 103 determines base position weighted value according to number corresponding with the Feature Words in first set.
Wherein, the first set is data in the multiple set to most set.
Specifically, web page characteristics extraction element determines in the first set maximum time in the corresponding number of all Feature Words
Numerical value is the base position weighted value.
In the embodiment of the present invention weight shared by each position in webpage is distinguished, is indicated in different location with this
Word or phrase it is different to the influence powers of webpage main contents and symbolization power, so needing exist for individually to each position
Word or phrase distribute weights.
Here, it is α B by the weight definition of the word of caption position or phrase, the word of keyword position or the weight of phrase are fixed
Justice is β B, and the word of Web page text position or the weight definition of phrase are 1, wherein B is basic weighted value, and α and β are web page title
The weight proportion value of position and keyword position relative to Web page text position, under normal circumstancesIn the present embodiment, α
4, β is taken to take 2, α, β can be adjusted according to practical concrete condition.
Here, calculating base position weighted value B is:
B=max { fc1,fc2...fck...fcn} (1)
Step 104, according in the base position weighted value, default weight proportion value and the multiple set except described the
All set outside one set, determine the weighted value of all set in the multiple set in addition to the first set.
Specifically, what the base position weighted value was multiplied by web page characteristics extraction element with the first default weight proportion value
Number corresponding with each Feature Words in the second set is multiplied value respectively, obtains each feature in the second set
The weighted value of word, the first default weight proportion value are weight proportion of the web page title position relative to Web page text position;
The value that the base position weighted value is multiplied with the second default weight proportion value is special with each in the third set respectively
It levies the corresponding number of word to be multiplied, obtains the weighted value of each Feature Words in the third set, the second default weight ratio
Example value is weight proportion of the Web Page Key Words position relative to Web page text position.
Each word or phrase in set of computations title and meta after the base position weighted value B obtained according to formula (1)
Weight:
Wt=α B* { (pt1,f1),(pt2,f2)...(ptk,fk)...(ptl,fl)}
={ (pt1,αB*f1),(pt2,αB*f2)...(ptk,αB*fk)...(ptl,αB*fl)}
Wm=β B* { (pm1,f1),(pm2,f2)...(pmk,fk)...(pmm,fm)}
={ (pm1,βB*f1),(pm2,βB*f2)...(pmk,βB*fk)...(pmm,βB*fm)} (2)
Step 105, by the power of all set in addition to the first set in the multiple set and the multiple set
Weight values carry out integration processing, obtain the feature vector of the target webpage, so that being carried out to webpage according to described eigenvector
Signature analysis.
Specifically, the corresponding weighted value of same characteristic features word in the multiple set is added by web page characteristics extraction element, it will
Weighted value after being added according to being ranked up from big to small, determine after sequence preceding n weighted value and with the preceding n weighted value pair
The Feature Words answered are the feature vector of the target webpage, wherein n is natural number.
Illustratively, it is obtained in three parts of webpage after the weight set of word or phrase according to formula (2), integrates webpage three
Partial characteristic item set and its weight are in the same characteristic item set, and integrating principle is:The weight phase of identical characteristic item
Add, and is sorted from big to small according to feature weight, the n feature vector as webpage before choosing.The form of expression is:T=
{t1,....,ti,....tn, w={ w1,....,wi,...,wn, tiIt is characterized word, wiFor with Feature Words tiCorresponding weight
Value.Wherein, n can be adjusted according to actual conditions dynamic, and T is the feature set of words of webpage, and w is characterized the weighted value collection of set of words
It closes, the two corresponds.
Web page characteristics extracting method provided in an embodiment of the present invention is applicable to most of internet web page feature and carries
Take process;Need not machine learning be carried out to a large amount of webpages on internet in advance, and independent of the structure of webpage;It realizes pair
Position weight is dynamically determined on each position of web page contents;With the position weight ratio of each basic part of pre-determined webpage
Example and it is used as the position weight adjusted value of each basic part with the highest frequency of statistical web page content words and carrys out last determination
The position weight of Feature Words realizes the personalization in the extraction of web page characteristics word, in this way, the matter of optimization web page characteristics extraction result
Amount ensures the correctness to the personality analysis data of Internet user, and provide personalized service to Internet user
The guiding rationalized is provided.
In the prior art, if the frequency that occurs in an article of some word or phrase is high, and in other articles
Seldom occur, then it is assumed that this word or phrase have good class discrimination ability, are adapted to classify, in this way, existing centainly
Limitation, the i.e. prior art make no exception word or phrase all in webpage, and unique difference is exactly to occur in webpage
Number, and with this difference to determine whether be suitable as the Feature Words of webpage, but webpage is relative to document half structure
The particularity of change, the weight meaning that the position that Feature Words occur has its different, the position meaning that Feature Words occur even compare feature
The number of word can more represent the characteristics of webpage.For example, the word or phrase that occur in web page title are comparatively than in webpage
The word or phrase occurred in text can will more summarize the characteristics of content and characterization webpage of webpage, because web page title is to have been subjected to
Author from refine feature, and the word or phrase occurred in Web page text be described in detail web page contents numerous vocabulary it
One.
Web page characteristics extracting method provided in an embodiment of the present invention is also in advance basic courses department for some key positions of webpage
The position weight ratio value of position, such as caption position, keyword position and text position.These positions are the basic frameworks of webpage
Position, it may be said that be the position that all webpages can all be covered on internet, so without largely being learnt to this;Statistics is complete
The number for occurring most Feature Words in webpage is used as the adjusted value of base position weighted value with this, for each webpage, feature
The highest frequency of word is can not be scheduled, is finally used as webpage with the product of position weight ratio value and base position weighted value
The position weight value of Feature Words on each position thereby realizes the dynamic adjustment to web placement weight.
Web page characteristics extracting method provided in an embodiment of the present invention, can be detached from the framework of webpage, according to word frequency come early period
The words with high-frequency in web page contents is screened, the result position according to webpage is not needed to come extraction one by one, is respectively processed,
To solve the defect that existing extraction algorithm excessively relies on structure of web page to a certain extent;Position residing for words
It sets and webpage highest frequency is dynamically to adjust the weighted value of each vocabulary, and extracted in webpage as standard using this weighted value
Can upper characterization web page contents to greatest extent Feature Words, solve that existing extraction algorithm is insensitive to Feature Words present position to be lacked
It falls into, balances the interactively of high frequency and position to Feature Words.
The embodiment of the present invention provides a kind of web page characteristics extraction element 30, as shown in figure 3, described device includes:It obtains single
Member 301, processing unit 302, determination unit 303, wherein
The acquiring unit 301, for obtaining target webpage;
The processing unit 302, for the target webpage to be divided into multiple documents according to the position framework of webpage information
Part is additionally operable to carry out word segmentation processing to the multiple documentation section respectively, be counted to word segmentation processing result, acquisition and institute
State the corresponding multiple set of multiple documentation sections, wherein each documentation section corresponds to one and gathers, in the multiple set
Each set includes at least one data pair, each data is to including:Feature Words and number corresponding with the Feature Words;
The determination unit 303, for determining base position according to number corresponding with the Feature Words in first set
Weighted value, the first set are data in the multiple set to most set;It is additionally operable to be weighed according to the base position
All set in weight values, default weight proportion value and the multiple set in addition to the first set, determine the multiple collection
The weighted value of all set in conjunction in addition to the first set;
The processing unit 302, be additionally operable to by it is the multiple set and the multiple set in addition to the first set
The weighted values of all set carry out integration processing, the feature vector of the target webpage is obtained, so that according to the feature
Vector carries out signature analysis to webpage.
Further, the processing unit 302, for being divided into the target webpage according to the position framework of webpage information
Three title, keyword, text documentation sections;
It is additionally operable to carry out word segmentation processing to title documentation section, word segmentation processing result is subjected to synonym merging treatment acquisition
Fisrt feature word counts corresponding with fisrt feature word number, by the fisrt feature word and with the fisrt feature word
Corresponding number is stored in second set corresponding with title documentation section with the format of data pair, and the fisrt feature word includes
At least one Feature Words;
It is additionally operable to carry out word segmentation processing to keyword documentation section, word segmentation processing result progress synonym merging treatment is obtained
Second feature word, count corresponding with second feature word number, by the second feature word and with the second feature
The corresponding number of word is stored in third set corresponding with keyword documentation section, the second feature word with the format of data pair
Including at least one Feature Words;
It is additionally operable to carry out word segmentation processing to text document part, word segmentation processing result is subjected to synonym merging treatment acquisition
Third feature word counts number corresponding with the third feature word, by the third feature word and the third feature word pair
The number answered is stored in first set corresponding with text document part with the format of data pair, and the third feature word includes extremely
Few Feature Words.
Further, the determination unit 303, for determining in the first set in the corresponding number of all Feature Words
Maximum times value is the base position weighted value;
The processing unit 302 is additionally operable to the base position weighted value being multiplied with the first default weight proportion value
Number corresponding with each Feature Words in the second set is multiplied value respectively, obtains each feature in the second set
The weighted value of word, the first default weight proportion value are weight proportion of the web page title position relative to Web page text position;
Be additionally operable to the value that the base position weighted value is multiplied with the second default weight proportion value respectively with it is every in the third set
The corresponding number of one Feature Words is multiplied, and obtains the weighted value of each Feature Words in the third set, and described second is default
Weight proportion value is weight proportion of the Web Page Key Words position relative to Web page text position.
Further, the processing unit 302 is additionally operable to the corresponding weighted value of same characteristic features word in the multiple set
It is added, weighted value after will add up according to being ranked up from big to small;
The determination unit 303, for determining preceding n weighted value and spy corresponding with the preceding n weighted value after sequence
Levy the feature vector that word is the target webpage, wherein n is natural number.
Specifically, the understanding of web page characteristics extraction element provided in an embodiment of the present invention can be carried with reference to above-mentioned web page characteristics
The explanation of embodiment of the method is taken, details are not described herein for the embodiment of the present invention.
Web page characteristics extraction element provided in an embodiment of the present invention, can be detached from the framework of webpage, according to word frequency come early period
The words with high-frequency in web page contents is screened, the result position according to webpage is not needed to come extraction one by one, is respectively processed,
To solve the defect that existing extraction algorithm excessively relies on structure of web page to a certain extent;Position residing for words
It sets and webpage highest frequency is dynamically to adjust the weighted value of each vocabulary, and extracted in webpage as standard using this weighted value
Can upper characterization web page contents to greatest extent Feature Words, solve that existing extraction algorithm is insensitive to Feature Words present position to be lacked
It falls into, balances the interactively of high frequency and position to Feature Words.
It should be understood by those skilled in the art that, the embodiment of the present invention can be provided as method, system or computer program
Product.Therefore, the shape of hardware embodiment, software implementation or embodiment combining software and hardware aspects can be used in the present invention
Formula.Moreover, the present invention can be used can use storage in the computer that one or more wherein includes computer usable program code
The form for the computer program product implemented on medium (including but not limited to magnetic disk storage and optical memory etc.).
The present invention be with reference to according to the method for the embodiment of the present invention, the flow of equipment (system) and computer program product
Figure and/or block diagram describe.It should be understood that can be realized by computer program instructions every first-class in flowchart and/or the block diagram
The combination of flow and/or box in journey and/or box and flowchart and/or the block diagram.These computer programs can be provided
Instruct the processor of all-purpose computer, special purpose computer, Embedded Processor or other programmable data processing devices to produce
A raw machine so that the instruction executed by computer or the processor of other programmable data processing devices is generated for real
The device for the function of being specified in present one flow of flow chart or one box of multiple flows and/or block diagram or multiple boxes.
These computer program instructions, which may also be stored in, can guide computer or other programmable data processing devices with spy
Determine in the computer-readable memory that mode works so that instruction generation stored in the computer readable memory includes referring to
Enable the manufacture of device, the command device realize in one flow of flow chart or multiple flows and/or one box of block diagram or
The function of being specified in multiple boxes.
These computer program instructions also can be loaded onto a computer or other programmable data processing device so that count
Series of operation steps are executed on calculation machine or other programmable devices to generate computer implemented processing, in computer or
The instruction executed on other programmable devices is provided for realizing in one flow of flow chart or multiple flows and/or block diagram one
The step of function of being specified in a box or multiple boxes.
The foregoing is only a preferred embodiment of the present invention, is not intended to limit the scope of the present invention.
Claims (10)
1. a kind of web page characteristics extracting method, which is characterized in that the method includes:
Target webpage is obtained, the target webpage is divided by multiple documentation sections according to the position framework of webpage information;
Respectively to the multiple documentation section carry out word segmentation processing, word segmentation processing result is counted, obtain with it is the multiple
The corresponding multiple set of documentation section, wherein each documentation section corresponds to a set, each in the multiple set
Set includes at least one data pair, each data is to including:Feature Words and number corresponding with the Feature Words;
Determine that base position weighted value, the first set are described according to number corresponding with the Feature Words in first set
Data are to most set in multiple set;
According to the institute in the base position weighted value, default weight proportion value and the multiple set in addition to the first set
There is set, determines the weighted value of all set in the multiple set in addition to the first set;
The weighted value of all set in the multiple set and the multiple set in addition to the first set is integrated
Processing, obtains the feature vector of the target webpage, so that carrying out signature analysis to webpage according to described eigenvector.
2. according to the method described in claim 1, it is characterized in that, the position framework according to webpage information is by target webpage
It is divided into multiple documentation sections, including:
The target webpage is divided into three title, keyword, text documentation sections according to the position framework of webpage information.
3. according to the method described in claim 2, it is characterized in that, described respectively carry out at participle the multiple documentation section
Reason, counts word segmentation processing result, obtains multiple set corresponding with the multiple documentation section, including:
Word segmentation processing is carried out to title documentation section, word segmentation processing result, which is carried out synonym merging treatment, obtains fisrt feature
Word counts number corresponding with the fisrt feature word, by the fisrt feature word and corresponding with the fisrt feature word time
Number is stored in second set corresponding with title documentation section with the format of data pair, and the fisrt feature word includes at least one
Feature Words;
Word segmentation processing is carried out to keyword documentation section, word segmentation processing result, which is carried out synonym merging treatment, obtains second feature
Word counts number corresponding with the second feature word, by the second feature word and corresponding with the second feature word time
Number is stored in third set corresponding with keyword documentation section with the format of data pair, and the second feature word includes at least one
A Feature Words;
Word segmentation processing is carried out to text document part, word segmentation processing result, which is carried out synonym merging treatment, obtains third feature
Word counts number corresponding with the third feature word, by the third feature word and the corresponding number of the third feature word
It is stored in first set corresponding with text document part with the format of data pair, the third feature word includes at least one spy
Levy word.
4. method according to claim 1 or 3, which is characterized in that it is described according in first set with the Feature Words pair
The number answered determines base position weighted value, including:
Determine that maximum times value is the base position weighted value in the corresponding number of all Feature Words in the first set.
5. according to the method described in claim 3, it is characterized in that, described according to the base position weighted value, default weight
All set in ratio value and the multiple set in addition to the first set determine and remove described first in the multiple set
The weighted value of all set outside set, including:
The value that the base position weighted value is multiplied with the first default weight proportion value respectively with it is each in the second set
The corresponding number of a Feature Words is multiplied, and obtains the weighted value of each Feature Words in the second set, the described first default power
Weight ratio value is weight proportion of the web page title position relative to Web page text position;
The value that the base position weighted value is multiplied with the second default weight proportion value respectively with it is each in the third set
The corresponding number of a Feature Words is multiplied, and obtains the weighted value of each Feature Words in the third set, the described second default power
Weight ratio value is weight proportion of the Web Page Key Words position relative to Web page text position.
6. according to the method described in claim 1, it is characterized in that, described will remove in the multiple set and the multiple set
The weighted value of all set outside the first set carries out integration processing, obtains the feature vector of the target webpage, including:
The corresponding weighted value of same characteristic features word in the multiple set is added, the weighted value after will add up according to from big to small into
Row sequence determines that preceding n weighted value and Feature Words corresponding with the preceding n weighted value are the spy of the target webpage after sequence
Sign vector, wherein n is natural number.
7. a kind of web page characteristics extraction element, which is characterized in that described device includes:Acquiring unit, determines list at processing unit
Member, wherein
The acquiring unit, for obtaining target webpage;
The processing unit, for the target webpage to be divided into multiple documentation sections according to the position framework of webpage information, also
For respectively to the multiple documentation section carry out word segmentation processing, word segmentation processing result is counted, obtain with it is the multiple
The corresponding multiple set of documentation section, wherein each documentation section corresponds to a set, each in the multiple set
Set includes at least one data pair, each data is to including:Feature Words and number corresponding with the Feature Words;
The determination unit, for determining base position weighted value according to number corresponding with the Feature Words in first set,
The first set is data in the multiple set to most set;It is additionally operable to according to the base position weighted value, in advance
If all set in weight proportion value and the multiple set in addition to the first set, determines and remove institute in the multiple set
State the weighted value of all set outside first set;
The processing unit is additionally operable to all collection in the multiple set and the multiple set in addition to the first set
The weighted value of conjunction carries out integration processing, obtains the feature vector of the target webpage, so that according to described eigenvector to net
Page carries out signature analysis.
8. device according to claim 7, which is characterized in that
The processing unit, for the target webpage to be divided into title, keyword, text according to the position framework of webpage information
Three documentation sections;
It is additionally operable to carry out word segmentation processing to title documentation section, word segmentation processing result, which is carried out synonym merging treatment, obtains first
Feature Words count corresponding with fisrt feature word number, by the fisrt feature word and corresponding with the fisrt feature word
Number second set corresponding with title documentation section is stored in the format of data pair, the fisrt feature word includes at least
One Feature Words;
It is additionally operable to carry out word segmentation processing to keyword documentation section, word segmentation processing result, which is carried out synonym merging treatment, obtains the
Two Feature Words count corresponding with second feature word number, by the second feature word and with the second feature word pair
The number answered is stored in third set corresponding with keyword documentation section with the format of data pair, and the second feature word includes
At least one Feature Words;
It is additionally operable to carry out word segmentation processing to text document part, word segmentation processing result, which is carried out synonym merging treatment, obtains third
Feature Words count number corresponding with the third feature word, and the third feature word and the third feature word is corresponding
Number is stored in first set corresponding with text document part with the format of data pair, and the third feature word includes at least one
A Feature Words.
9. device according to claim 7 or 8, which is characterized in that the determination unit, for determining the first set
In in the corresponding number of all Feature Words maximum times value be the base position weighted value;
The processing unit, be additionally operable to the value that the base position weighted value is multiplied with the first default weight proportion value respectively with
The corresponding number of each Feature Words is multiplied in the second set, obtains the weight of each Feature Words in the second set
Value, the first default weight proportion value are weight proportion of the web page title position relative to Web page text position;Being additionally operable to will
The value that the base position weighted value is multiplied with the second default weight proportion value respectively with each feature in the third set
The corresponding number of word is multiplied, and obtains the weighted value of each Feature Words in the third set, the second default weight proportion
Value is weight proportion of the Web Page Key Words position relative to Web page text position.
10. device according to claim 7, which is characterized in that
The processing unit is additionally operable to the corresponding weighted value of same characteristic features word in the multiple set being added, after will add up
Weighted value according to being ranked up from big to small;
The determination unit, for determining, preceding n weighted value and Feature Words corresponding with the preceding n weighted value are institute after sequence
State the feature vector of target webpage, wherein n is natural number.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201611137455.3A CN108614825B (en) | 2016-12-12 | 2016-12-12 | Webpage feature extraction method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201611137455.3A CN108614825B (en) | 2016-12-12 | 2016-12-12 | Webpage feature extraction method and device |
Publications (2)
Publication Number | Publication Date |
---|---|
CN108614825A true CN108614825A (en) | 2018-10-02 |
CN108614825B CN108614825B (en) | 2022-04-15 |
Family
ID=63657508
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201611137455.3A Active CN108614825B (en) | 2016-12-12 | 2016-12-12 | Webpage feature extraction method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108614825B (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112989790A (en) * | 2021-03-17 | 2021-06-18 | 中国科学院深圳先进技术研究院 | Document characterization method and device based on deep learning, equipment and storage medium |
CN114528811A (en) * | 2022-01-21 | 2022-05-24 | 北京麦克斯泰科技有限公司 | Article content extraction method, device, equipment and storage medium |
CN115858470A (en) * | 2022-12-26 | 2023-03-28 | 深圳市中政汇智管理咨询有限公司 | Policy and regulation file matching method, system, server and storage medium |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101079031A (en) * | 2006-06-15 | 2007-11-28 | 腾讯科技(深圳)有限公司 | Web page subject extraction system and method |
CN101246498A (en) * | 2008-03-27 | 2008-08-20 | 腾讯科技(深圳)有限公司 | News web page searching method |
CN103064969A (en) * | 2012-12-31 | 2013-04-24 | 武汉传神信息技术有限公司 | Method for automatically creating keyword index table |
US20130246184A1 (en) * | 2012-03-13 | 2013-09-19 | PowerLinks Media Limited | Method and system for displaying a contextual advertisement on a webpage |
CN103617213A (en) * | 2013-11-19 | 2014-03-05 | 北京奇虎科技有限公司 | Method and system for identifying newspage attributive characters |
CN103810264A (en) * | 2014-01-27 | 2014-05-21 | 西安理工大学 | Webpage text classification method based on feature selection |
US20140143225A1 (en) * | 2012-11-21 | 2014-05-22 | Hon Hai Precision Industry Co., Ltd. | Web searching method, system, and apparatus |
-
2016
- 2016-12-12 CN CN201611137455.3A patent/CN108614825B/en active Active
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101079031A (en) * | 2006-06-15 | 2007-11-28 | 腾讯科技(深圳)有限公司 | Web page subject extraction system and method |
CN101246498A (en) * | 2008-03-27 | 2008-08-20 | 腾讯科技(深圳)有限公司 | News web page searching method |
US20130246184A1 (en) * | 2012-03-13 | 2013-09-19 | PowerLinks Media Limited | Method and system for displaying a contextual advertisement on a webpage |
US20140143225A1 (en) * | 2012-11-21 | 2014-05-22 | Hon Hai Precision Industry Co., Ltd. | Web searching method, system, and apparatus |
CN103064969A (en) * | 2012-12-31 | 2013-04-24 | 武汉传神信息技术有限公司 | Method for automatically creating keyword index table |
CN103617213A (en) * | 2013-11-19 | 2014-03-05 | 北京奇虎科技有限公司 | Method and system for identifying newspage attributive characters |
CN103810264A (en) * | 2014-01-27 | 2014-05-21 | 西安理工大学 | Webpage text classification method based on feature selection |
Non-Patent Citations (1)
Title |
---|
刘琼琼等: "面向网页的主题概念挖掘", 《计算机科学》 * |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112989790A (en) * | 2021-03-17 | 2021-06-18 | 中国科学院深圳先进技术研究院 | Document characterization method and device based on deep learning, equipment and storage medium |
CN112989790B (en) * | 2021-03-17 | 2023-02-28 | 中国科学院深圳先进技术研究院 | Document characterization method and device based on deep learning, equipment and storage medium |
CN114528811A (en) * | 2022-01-21 | 2022-05-24 | 北京麦克斯泰科技有限公司 | Article content extraction method, device, equipment and storage medium |
CN114528811B (en) * | 2022-01-21 | 2022-09-02 | 北京麦克斯泰科技有限公司 | Article content extraction method, device, equipment and storage medium |
CN115858470A (en) * | 2022-12-26 | 2023-03-28 | 深圳市中政汇智管理咨询有限公司 | Policy and regulation file matching method, system, server and storage medium |
CN115858470B (en) * | 2022-12-26 | 2023-09-22 | 深圳市中政汇智管理咨询有限公司 | Policy and regulation file matching method, system, server and storage medium |
Also Published As
Publication number | Publication date |
---|---|
CN108614825B (en) | 2022-04-15 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN105354333B (en) | A kind of method for extracting topic based on newsletter archive | |
CN104915448B (en) | A kind of entity based on level convolutional network and paragraph link method | |
CN105426762B (en) | A kind of static detection method that android application programs are malicious | |
CN106570144A (en) | Method and apparatus for recommending information | |
CN105956031A (en) | Text classification method and apparatus | |
CN105243152A (en) | Graph model-based automatic abstracting method | |
CN107862022A (en) | Cultural resource commending system | |
CN107180075A (en) | The label automatic generation method of text classification integrated level clustering | |
CN105760493A (en) | Automatic work order classification method for electricity marketing service hot spot 95598 | |
CN108199951A (en) | A kind of rubbish mail filtering method based on more algorithm fusion models | |
CN105205090A (en) | Web page text classification algorithm research based on web page link analysis and support vector machine | |
CN107943824A (en) | A kind of big data news category method, system and device based on LDA | |
CN105528422A (en) | Focused crawler processing method and apparatus | |
CN101833579B (en) | Method and system for automatically detecting academic misconduct literature | |
CN108614825A (en) | A kind of web page characteristics extracting method and device | |
CN105205163B (en) | A kind of multi-level two sorting technique of the incremental learning of science and technology news | |
CN105512333A (en) | Product comment theme searching method based on emotional tendency | |
CN107145476A (en) | One kind is based on improvement TF IDF keyword extraction algorithms | |
CN108536683A (en) | A kind of paper fragmentation information abstracting method based on machine learning | |
CN109710725A (en) | A kind of Chinese table column label restoration methods and system based on text classification | |
CN103514151A (en) | Dependency grammar analysis method and device and auxiliary classifier training method | |
CN102436512A (en) | Preference-based web page text content control method | |
CN103268346A (en) | Semi-supervised classification method and semi-supervised classification system | |
CN103257961B (en) | Bibliography disappear weight method, Apparatus and system | |
CN110019820A (en) | Main suit and present illness history symptom Timing Coincidence Detection method in a kind of case history |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
CB02 | Change of applicant information |
Address after: 310012 building A01, 1600 yuhangtang Road, Wuchang Street, Yuhang District, Hangzhou City, Zhejiang Province Applicant after: CHINA MOBILE (HANGZHOU) INFORMATION TECHNOLOGY Co.,Ltd. Applicant after: China Mobile Communications Corp. Address before: 310012, No. 14, building three, Chang Torch Hotel, No. 259, Wensanlu Road, Xihu District, Zhejiang, Hangzhou Applicant before: CHINA MOBILE (HANGZHOU) INFORMATION TECHNOLOGY Co.,Ltd. Applicant before: China Mobile Communications Corp. |
|
CB02 | Change of applicant information | ||
GR01 | Patent grant | ||
GR01 | Patent grant |