CN108614825B - Webpage feature extraction method and device - Google Patents

Webpage feature extraction method and device Download PDF

Info

Publication number
CN108614825B
CN108614825B CN201611137455.3A CN201611137455A CN108614825B CN 108614825 B CN108614825 B CN 108614825B CN 201611137455 A CN201611137455 A CN 201611137455A CN 108614825 B CN108614825 B CN 108614825B
Authority
CN
China
Prior art keywords
word
feature
webpage
sets
weight
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201611137455.3A
Other languages
Chinese (zh)
Other versions
CN108614825A (en
Inventor
吕颖韬
冯宜安
周璐
张贝金
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Mobile Communications Group Co Ltd
China Mobile Hangzhou Information Technology Co Ltd
Original Assignee
China Mobile Communications Group Co Ltd
China Mobile Hangzhou Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Mobile Communications Group Co Ltd, China Mobile Hangzhou Information Technology Co Ltd filed Critical China Mobile Communications Group Co Ltd
Priority to CN201611137455.3A priority Critical patent/CN108614825B/en
Publication of CN108614825A publication Critical patent/CN108614825A/en
Application granted granted Critical
Publication of CN108614825B publication Critical patent/CN108614825B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Transfer Between Computers (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the invention discloses a method and a device for extracting webpage features, wherein a target webpage is divided into a plurality of document parts according to a position framework of webpage information; respectively carrying out word segmentation processing on a plurality of document parts, carrying out statistics on word segmentation processing results to obtain a plurality of sets corresponding to the document parts, and determining a basic position weight value according to the times corresponding to the characteristic words in a first set, wherein the first set is a set with the most data pairs in the plurality of sets; determining the weight values of all the sets except the first set in the plurality of sets according to the basic position weight values, the preset weight proportion values and all the sets except the first set in the plurality of sets; and integrating the weighted values of the plurality of sets and all sets except the first set in the plurality of sets to obtain the feature vector of the target webpage, so that feature analysis is performed on the webpage according to the feature vector.

Description

Webpage feature extraction method and device
Technical Field
The invention relates to a feature extraction technology in the field of internet, in particular to a webpage feature extraction method and device.
Background
The extraction of the webpage features is one of key technologies for data analysis of webpage contents, and is an important link for personalized analysis and personalized service recommendation of internet users. The quality of the extraction quality of the webpage features directly influences the quality of the personalized analysis result of the internet user and further influences the quality of the personalized service provided for the user. The extraction process of the webpage features is very sensitive to the framework of the webpage, the richness of the content words of the webpage and the synonymity of the words, and the extraction algorithm of the webpage features needs to consider the influence of the factors on the extraction result, avoid the interference of other factors and extract the feature words which can represent the webpage content most.
In the prior art, an algorithm for extracting web page features is designed and optimized mainly based on a text frequency-Document frequency (TF _ IDF) algorithm and a Document Object Model (DOM) based tree extraction technology. The TF _ IDF algorithm is a commonly used weighting technique for information retrieval and data mining, and evaluates the importance degree of words in a web page by calculating the number of times the words appear in a document and the number of documents containing the words in the whole web, and filters the characteristic words of the web page by using the size of the importance degree as a standard. The extraction technology based on the DOM tree realizes data extraction in the HTML webpage of the hypertext markup language according to the tree-shaped hierarchical structure characteristics of the HTML webpage, and extracts the characteristic words of the webpage by adopting a mode of optimizing the characteristic vector of the webpage. The webpage feature words obtained by the DOM tree extraction technology have relatively high accuracy and recall rate.
However, the feature word weight calculation in the TF _ IDF algorithm is not reasonable, because the HTML document is different from the ordinary document in structure, the HTML document belongs to a semi-structured text form, the feature words are different in position in the document, the degree of representing the article should be different, and the assigned weight values should be different, so the simple applied IDF calculation is not scientific and comprehensive; the distinguishing capability between the TF _ IDF classes is insufficient, and the TF _ IDF can only distinguish one feature item from the class of the text, but cannot well represent the distinction between the feature item and other classes. The extraction technology based on the DOM tree has excessive dependence on a webpage structure, the DOM technology realizes data extraction in the HTML webpage according to the tree-shaped hierarchical structure characteristics of the HTML webpage, the accuracy and recall rate of webpage characteristic words obtained by the extraction technology based on the DOM tree are relatively high, but the technology needs a plurality of corresponding example webpages, so the extraction technology is suitable for different knowledge fields, and is easy to be passive in the form of webpage structure change due to excessive dependence on the structure. In summary, the two basic methods have certain limitations, namely insensitivity to the position of the feature word and over dependence on the web page structure.
Disclosure of Invention
In order to solve the technical problem, embodiments of the present invention provide a method and an apparatus for extracting webpage features, so as to optimize the quality of webpage feature extraction results and ensure the correctness of personalized analysis data for internet users.
The technical scheme of the invention is realized as follows:
the embodiment of the invention provides a webpage feature extraction method, which comprises the following steps:
acquiring a target webpage, and dividing the target webpage into a plurality of document parts according to a position framework of webpage information;
performing word segmentation processing on the plurality of document parts respectively, performing statistics on word segmentation processing results, and obtaining a plurality of sets corresponding to the plurality of document parts, wherein each document part corresponds to one set, each set in the plurality of sets comprises at least one data pair, and each data pair comprises: the feature words and the times corresponding to the feature words;
determining a basic position weight value according to the times corresponding to the feature words in a first set, wherein the first set is a set with the most data pairs in the multiple sets;
determining the weight values of all the sets except the first set according to the basic position weight values, preset weight proportion values and all the sets except the first set in the sets;
and integrating the weighted values of the plurality of sets and all sets except the first set in the plurality of sets to obtain the feature vector of the target webpage, so that feature analysis is performed on the webpage according to the feature vector.
Optionally, the dividing the target webpage into a plurality of document parts according to the position architecture of the webpage information includes:
and dividing the target webpage into three document parts, namely a title, a keyword and a text according to the position architecture of the webpage information.
Optionally, the performing the word segmentation processing on the plurality of document parts respectively, and performing statistics on word segmentation processing results to obtain a plurality of sets corresponding to the plurality of document parts includes:
performing word segmentation on the title document part, performing synonym combination on the word segmentation processing result to obtain a first characteristic word, counting the times corresponding to the first characteristic word, and storing the first characteristic word and the times corresponding to the first characteristic word in a second set corresponding to the title document part in a data pair format, wherein the first characteristic word comprises at least one characteristic word;
performing word segmentation on the keyword document part, performing synonym combination on the word segmentation processing result to obtain a second characteristic word, counting the times corresponding to the second characteristic word, and storing the second characteristic word and the times corresponding to the second characteristic word in a third set corresponding to the keyword document part in a data pair format, wherein the second characteristic word comprises at least one characteristic word;
performing word segmentation on the text document part, performing synonym combination on the word segmentation processing result to obtain a third feature word, counting the times corresponding to the third feature word, and storing the third feature word and the times corresponding to the third feature word in a first set corresponding to the text document part in a data pair format, wherein the third feature word comprises at least one feature word.
Optionally, the determining a basic location weight value according to the number of times corresponding to the feature word in the first set includes:
and determining the maximum numerical value in the times corresponding to all the feature words in the first set as the basic position weight value.
Optionally, the determining the weight values of all the sets except the first set according to the basic location weight values, a preset weight proportion value and all the sets except the first set in the plurality of sets includes:
multiplying the value obtained by multiplying the basic position weight value by a first preset weight proportion value respectively by the corresponding times of each feature word in the second set to obtain the weight value of each feature word in the second set, wherein the first preset weight proportion value is the weight proportion of the title position of the webpage relative to the text position of the webpage;
and multiplying the value obtained by multiplying the basic position weight value by a second preset weight proportion value respectively by the corresponding times of each feature word in the third set to obtain the weight value of each feature word in the third set, wherein the second preset weight proportion value is the weight proportion of the position of the webpage keyword relative to the position of the webpage text.
Optionally, the integrating the weight values of the plurality of sets and all sets except the first set in the plurality of sets to obtain the feature vector of the target web page includes:
adding the weighted values corresponding to the same feature words in the multiple sets, sorting the added weighted values from large to small, and determining the top n weighted values after sorting and the feature words corresponding to the top n weighted values as the feature vectors of the target webpage, wherein n is a natural number.
The embodiment of the invention provides a webpage feature extraction device, which comprises: an acquisition unit, a processing unit, a determination unit, wherein,
the acquisition unit is used for acquiring a target webpage;
the processing unit is configured to divide the target webpage into a plurality of document parts according to a position structure of webpage information, and further configured to perform word segmentation on the plurality of document parts, perform statistics on word segmentation processing results, and obtain a plurality of sets corresponding to the plurality of document parts, where each document part corresponds to one set, each set in the plurality of sets includes at least one data pair, and each data pair includes: the feature words and the times corresponding to the feature words;
the determining unit is configured to determine a basic position weight value according to a number of times corresponding to the feature word in a first set, where the first set is a set with the largest data pairs in the plurality of sets; the weight value of all the sets except the first set in the plurality of sets is determined according to the basic position weight value, a preset weight proportion value and all the sets except the first set in the plurality of sets;
the processing unit is further configured to perform integration processing on the weighted values of the plurality of sets and all sets except the first set in the plurality of sets to obtain a feature vector of the target webpage, so that feature analysis is performed on the webpage according to the feature vector.
Optionally, the processing unit is configured to divide the target webpage into three document parts, namely a title, a keyword and a text, according to a position structure of webpage information;
the system is further used for performing word segmentation processing on the title document part, performing synonym combination processing on the word segmentation processing result to obtain a first characteristic word, counting the times corresponding to the first characteristic word, and storing the first characteristic word and the times corresponding to the first characteristic word in a second set corresponding to the title document part in a data pair format, wherein the first characteristic word comprises at least one characteristic word;
the system is also used for performing word segmentation processing on the keyword document part, performing synonym combination processing on the word segmentation processing result to obtain a second characteristic word, counting the times corresponding to the second characteristic word, and storing the second characteristic word and the times corresponding to the second characteristic word in a third set corresponding to the keyword document part in a data pair format, wherein the second characteristic word comprises at least one characteristic word;
the text document part is also used for performing word segmentation processing on the text document part, performing synonym combination processing on the word segmentation processing result to obtain a third feature word, counting the times corresponding to the third feature word, and storing the third feature word and the times corresponding to the third feature word in a first set corresponding to the text document part in a data pair format, wherein the third feature word comprises at least one feature word.
Optionally, the determining unit is configured to determine a maximum number of times corresponding to all feature words in the first set as the basic position weight value;
the processing unit is further configured to multiply a value obtained by multiplying the basic position weight value by a first preset weight proportion value by the number of times corresponding to each feature word in the second set, so as to obtain a weight value of each feature word in the second set, where the first preset weight proportion value is a weight proportion of a webpage title position relative to a webpage text position; and the second preset weight proportion value is a weight proportion of the position of the webpage keyword relative to the position of the webpage text, and is used for multiplying the value obtained by multiplying the weight value of the basic position weight value by a second preset weight proportion value by the corresponding times of each feature word in the third set respectively to obtain the weight value of each feature word in the third set.
Optionally, the processing unit is further configured to add weight values corresponding to the same feature words in the multiple sets, and sort the added weight values from large to small;
the determining unit is configured to determine that n weighted values before and n feature words corresponding to the n weighted values after the sorting are feature vectors of the target webpage, where n is a natural number.
The embodiment of the invention provides a webpage feature extraction method and a device, which are used for determining preset weight proportion values of all basic parts of a webpage in advance and using the highest times of counting webpage feature words as adjustment values of basic position weight values of all basic parts to finally determine the position weight values of the feature words, thereby realizing the individuation in the extraction of the webpage feature words, and realizing the dynamic determination of the position weight values of all positions of webpage contents.
Drawings
Fig. 1 is a schematic flow chart of a method for extracting web page features according to an embodiment of the present invention;
fig. 2 is a diagram illustrating an exemplary method for extracting web page features according to an embodiment of the present invention;
fig. 3 is a schematic structural diagram of a web page feature extraction device according to an embodiment of the present invention.
Detailed Description
The technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention.
According to the webpage feature extraction method provided by the invention, the position weight is adopted in the webpage feature extraction, and the influence of two factors of the position weight and the occurrence frequency on the webpage feature vector extraction is fused. On the basis of extracting high-frequency vocabularies which have discrimination with other webpages in the whole network, dividing a target webpage into a plurality of document parts according to a basic position framework of webpage information, giving different weight proportion values to the document parts, determining position weight values of combined webpages in each position according to vocabularies appearing in the webpage by taking the frequency of characteristic words with the largest frequency of webpage appearance as a basic position weight value, and accordingly realizing the capability of dynamically adjusting the representation of the characteristic words on the target webpage content.
The invention provides a webpage feature extraction method, as shown in fig. 1, the method may include:
step 101, obtaining a target webpage, and dividing the target webpage into a plurality of document parts according to a position framework of webpage information.
The execution main body of the webpage feature extraction method provided by the embodiment of the invention is a webpage feature extraction device, namely the webpage feature extraction device acquires a target webpage and divides the target webpage into a plurality of document parts according to the position architecture of webpage information.
Specifically, as shown in fig. 2, the web page feature extraction device may divide the target web page into three document parts, namely, a title, a keyword, and a text, according to the location structure of the web page information.
And 102, performing word segmentation processing on the plurality of document parts respectively, and counting word segmentation processing results to obtain a plurality of sets corresponding to the plurality of document parts.
Wherein each document part corresponds to a collection, each of the plurality of collections includes at least one data pair, and each data pair includes: the characteristic words and the times corresponding to the characteristic words.
In a possible implementation manner, the web page feature extraction device performs word segmentation on a title document part, performs synonym combination on word segmentation processing results to obtain a first feature word, counts the times corresponding to the first feature word, and stores the first feature word and the times corresponding to the first feature word in a second set corresponding to the title document part in a data pair format, where the first feature word includes at least one feature word;
the webpage feature extraction device carries out word segmentation on the keyword document part, synonym merging processing is carried out on the word segmentation processing result to obtain a second feature word, the times corresponding to the second feature word are counted, the second feature word and the times corresponding to the second feature word are stored in a third set corresponding to the keyword document part in a data pair format, and the second feature word comprises at least one feature word;
the webpage feature extraction device carries out word segmentation on the text document part, synonym merging processing is carried out on the word segmentation processing result to obtain a third feature word, the times corresponding to the third feature word are counted, the third feature word and the times corresponding to the third feature word are stored in a first set corresponding to the text document part in a data pair format, and the third feature word comprises at least one feature word.
Specifically, as shown in fig. 2, the whole web page is structured, the target web page is divided into three document parts, namely, a TITLE tlle, a keyword MRTA and a text CONTENT, according to the position, the three document parts are respectively subjected to ictlas participle segmentation, and the participle results are subjected to synonym combination processing to count the occurrence frequency of each word or phrase, so as to (p)ij,fj) The formats of the data pairs are stored in the set vector titleIn meta and content, where p is a word or phrase, f is the number of times the word or phrase appears, i is the code of the appearing position of the phrase, j is the appearing order of the phrase at that position, title is the set corresponding to the title document part, meta is the set corresponding to the keyword document part, and content is the set corresponding to the body document part.
Assuming that the total number of words in title, meta and content is l, m and n respectively,
the aggregate content of the title is: { (p)t1,f1),(pt2,f2)...(ptk,fk)...(ptl,fl)};
The aggregate content of meta is: { (p)m1,f1),(pm2,f2)...(pmk,fk)...(pmm,fm)};
The content set content is: { (p)c1,f1),(pc2,f2)...(pck,fk)...(pcn,fn)}。
And 103, determining a basic position weight value according to the times corresponding to the feature words in the first set.
Wherein the first set is a set with the most data pairs in the plurality of sets.
Specifically, the web page feature extraction device determines the maximum value of times corresponding to all feature words in the first set as the basic position weight value.
In the embodiment of the invention, the weight occupied by each position in the webpage is distinguished, so that the influence and symbolic acting force of the words or phrases at different positions on the main content of the webpage are different, and therefore, the weight needs to be distributed to the words or phrases at each position independently.
Here, the weight of the word or phrase at the title position is defined as α B, the weight of the word or phrase at the keyword position is defined as β B, and the weight of the word or phrase at the text position of the web page is defined as 1, where B is a basic weight value, and α and β are weight ratio values of the title position and the keyword position of the web page with respect to the text position of the web page, and generallyUnder the circumstances
Figure BDA0001177177740000081
In this embodiment, α is 4, β is 2, and α and β may be adjusted according to actual specific conditions.
Here, the base position weight value B is calculated as:
B=max{fc1,fc2...fck...fcn} (1)
and step 104, determining the weight values of all the sets except the first set in the plurality of sets according to the basic position weight values, preset weight proportion values and all the sets except the first set in the plurality of sets.
Specifically, the webpage feature extraction device multiplies the value obtained by multiplying the basic position weight value by a first preset weight proportion value by the corresponding frequency of each feature word in the second set respectively to obtain the weight value of each feature word in the second set, wherein the first preset weight proportion value is the weight proportion of the webpage title position relative to the webpage text position; and multiplying the value obtained by multiplying the basic position weight value by a second preset weight proportion value respectively by the corresponding times of each feature word in the third set to obtain the weight value of each feature word in the third set, wherein the second preset weight proportion value is the weight proportion of the position of the webpage keyword relative to the position of the webpage text.
Calculating the weight of each word or phrase in the set title and meta according to the basic position weight value B obtained by the formula (1):
Wt=αB*{(pt1,f1),(pt2,f2)...(ptk,fk)...(ptl,fl)}
={(pt1,αB*f1),(pt2,αB*f2)...(ptk,αB*fk)...(ptl,αB*fl)}
Wm=βB*{(pm1,f1),(pm2,f2)...(pmk,fk)...(pmm,fm)}
={(pm1,βB*f1),(pm2,βB*f2)...(pmk,βB*fk)...(pmm,βB*fm)} (2)
step 105, integrating the weighted values of the plurality of sets and all sets except the first set in the plurality of sets to obtain a feature vector of the target webpage, so as to perform feature analysis on the webpage according to the feature vector.
Specifically, the web page feature extraction device adds weight values corresponding to the same feature words in the multiple sets, sorts the added weight values from large to small, and determines n front weight values after sorting and the feature words corresponding to the n front weight values as feature vectors of the target web page, where n is a natural number.
Illustratively, after the weight sets of words or phrases in the three parts of the web page are obtained according to the formula (2), the feature item sets of the three parts of the web page and the weights thereof are integrated into the same feature item set, and the integration principle is as follows: the weights of the same feature items are added, and the feature items are sorted from big to small according to the feature weights, and the top n feature vectors are selected as the feature vectors of the web pages. The expression form is as follows: t ═ T1,....,ti,....tn},w={w1,....,wi,...,wn},tiAs a feature word, wiIs a feature word tiAnd (4) corresponding weight values. N can be dynamically adjusted according to actual conditions, T is a feature word set of the webpage, w is a weight value set of the feature word set, and the two sets correspond to each other one by one.
The webpage feature extraction method provided by the embodiment of the invention can be suitable for the most of the internet webpage feature extraction processes; machine learning on a large number of webpages on the Internet in advance is not needed, and the method is independent of the structure of the webpages; the dynamic determination of the position weight of each position of the webpage content is realized; the position weight proportion of each basic part of the webpage determined in advance and the highest frequency of the statistical webpage content words are used as the position weight adjusting value of each basic part to finally determine the position weight of the feature words, thereby realizing the individuation in the extraction of the webpage feature words, optimizing the quality of the webpage feature extraction result, ensuring the correctness of the individualized analysis data of the internet user, and providing reasonable guidance for the individualized service of the internet user.
In the prior art, if a word or phrase appears frequently in an article and rarely appears in other articles, the word or phrase is considered to have good category distinguishing capability and is suitable for classification, so that certain limitation exists, that is, the prior art considers all words or phrases in a webpage to be identical, the only difference is the number of times of appearance in the webpage, and judges whether the word or phrase is suitable as a feature word of the webpage or not according to the difference, but the position of the appearance of the feature word has different weight meanings relative to the specificity of the semi-structure of the document, and the position meaning of the appearance of the feature word can even represent the characteristics of the webpage more than the number of times of the feature word. For example, words or phrases appearing in the title of a web page are relatively more general about the content of the web page and characterize the web page than words or phrases appearing in the body of the web page, because the title of a web page is a feature that has been self-refined by the author, and words or phrases appearing in the body of a web page are one of many words that describe the content of a web page in detail.
The webpage feature extraction method provided by the embodiment of the invention is characterized in that some key parts of the webpage are basic parts in advance, such as position weight proportion values of a title position, a keyword position and a text position. The parts are basic framework parts of the web pages, namely, the parts covered by all the web pages on the Internet, so that a large amount of learning is not needed; counting the frequency of the most characteristic words in the whole webpage, taking the frequency as an adjustment value of a basic position weight value, wherein the highest frequency of the characteristic words is unpredictable for each webpage, and finally taking the product of the position weight proportion value and the basic position weight value as the position weight value of the characteristic words on each position of the webpage, thereby realizing the dynamic adjustment of the webpage position weight.
The webpage feature extraction method provided by the embodiment of the invention can be separated from the framework of the webpage, high-frequency words in webpage content are screened in the early stage according to the word frequency, and are not required to be extracted one by one according to the result position of the webpage and are respectively processed, so that the defect that the existing extraction algorithm is excessively dependent on the webpage structure is solved to a certain extent; the weight value of each vocabulary is dynamically adjusted according to the position of the word and the highest frequency of the webpage, and the characteristic words which can represent the webpage content to the maximum extent in the webpage are extracted by taking the weight value as a standard, so that the defect that the existing extraction algorithm is insensitive to the position of the characteristic words is overcome, and the action relation of high frequency and position on the characteristic words is balanced.
An embodiment of the present invention provides a web page feature extraction apparatus 30, as shown in fig. 3, the apparatus includes: an acquisition unit 301, a processing unit 302, a determination unit 303, wherein,
the acquiring unit 301 is configured to acquire a target webpage;
the processing unit 302 is configured to divide the target webpage into a plurality of document parts according to a location structure of webpage information, and further configured to perform word segmentation on the plurality of document parts respectively, perform statistics on word segmentation results, and obtain a plurality of sets corresponding to the plurality of document parts, where each document part corresponds to one set, each set in the plurality of sets includes at least one data pair, and each data pair includes: the feature words and the times corresponding to the feature words;
the determining unit 303 is configured to determine a basic position weight value according to a number of times corresponding to the feature word in a first set, where the first set is a set with the largest data pairs in the multiple sets; the weight value of all the sets except the first set in the plurality of sets is determined according to the basic position weight value, a preset weight proportion value and all the sets except the first set in the plurality of sets;
the processing unit 302 is further configured to perform integration processing on the weight values of the multiple sets and all sets except the first set in the multiple sets to obtain a feature vector of the target webpage, so that feature analysis is performed on the webpage according to the feature vector.
Further, the processing unit 302 is configured to divide the target webpage into three document parts, namely a title, a keyword, and a text, according to a location structure of webpage information;
the system is further used for performing word segmentation processing on the title document part, performing synonym combination processing on the word segmentation processing result to obtain a first characteristic word, counting the times corresponding to the first characteristic word, and storing the first characteristic word and the times corresponding to the first characteristic word in a second set corresponding to the title document part in a data pair format, wherein the first characteristic word comprises at least one characteristic word;
the system is also used for performing word segmentation processing on the keyword document part, performing synonym combination processing on the word segmentation processing result to obtain a second characteristic word, counting the times corresponding to the second characteristic word, and storing the second characteristic word and the times corresponding to the second characteristic word in a third set corresponding to the keyword document part in a data pair format, wherein the second characteristic word comprises at least one characteristic word;
the text document part is also used for performing word segmentation processing on the text document part, performing synonym combination processing on the word segmentation processing result to obtain a third feature word, counting the times corresponding to the third feature word, and storing the third feature word and the times corresponding to the third feature word in a first set corresponding to the text document part in a data pair format, wherein the third feature word comprises at least one feature word.
Further, the determining unit 303 is configured to determine that a maximum number of times corresponding to all feature words in the first set is the basic position weight value;
the processing unit 302 is further configured to multiply a value obtained by multiplying the basic position weight value by a first preset weight proportion value, by the number of times corresponding to each feature word in the second set, respectively, to obtain a weight value of each feature word in the second set, where the first preset weight proportion value is a weight proportion of a webpage title position relative to a webpage text position; and the second preset weight proportion value is a weight proportion of the position of the webpage keyword relative to the position of the webpage text, and is used for multiplying the value obtained by multiplying the weight value of the basic position weight value by a second preset weight proportion value by the corresponding times of each feature word in the third set respectively to obtain the weight value of each feature word in the third set.
Further, the processing unit 302 is further configured to add weight values corresponding to the same feature words in the multiple sets, and sort the added weight values from large to small;
the determining unit 303 is configured to determine that n weighted values before and n feature words corresponding to the n weighted values after the sorting are feature vectors of the target webpage, where n is a natural number.
Specifically, for understanding of the web page feature extraction device provided in the embodiment of the present invention, reference may be made to the description of the foregoing web page feature extraction method embodiment, and details of the embodiment of the present invention are not described herein again.
The webpage feature extraction device provided by the embodiment of the invention can be separated from the framework of the webpage, high-frequency words in webpage content are screened in the early stage according to the word frequency, and are not required to be extracted one by one according to the result position of the webpage and are respectively processed, so that the defect that the existing extraction algorithm is excessively dependent on the webpage structure is overcome to a certain extent; the weight value of each vocabulary is dynamically adjusted according to the position of the word and the highest frequency of the webpage, and the characteristic words which can represent the webpage content to the maximum extent in the webpage are extracted by taking the weight value as a standard, so that the defect that the existing extraction algorithm is insensitive to the position of the characteristic words is overcome, and the action relation of high frequency and position on the characteristic words is balanced.
As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of a hardware embodiment, a software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, optical storage, and the like) having computer-usable program code embodied therein.
The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
The above description is only a preferred embodiment of the present invention, and is not intended to limit the scope of the present invention.

Claims (10)

1. A method for extracting web page features, the method comprising:
acquiring a target webpage, and dividing the target webpage into a plurality of document parts according to a position framework of webpage information;
performing word segmentation processing on the plurality of document parts respectively, performing statistics on word segmentation processing results, and obtaining a plurality of sets corresponding to the plurality of document parts, wherein each document part corresponds to one set, each set in the plurality of sets comprises at least one data pair, and each data pair comprises: the feature words and the times corresponding to the feature words;
determining a basic position weight value according to the times corresponding to the feature words in a first set, wherein the first set is a set with the most data pairs in the multiple sets;
determining the weight values of all the sets except the first set according to the basic position weight values, preset weight proportion values and all the sets except the first set in the sets so as to distinguish the weight occupied by each position in the target webpage;
and integrating the weighted values of the plurality of sets and all sets except the first set in the plurality of sets to obtain the feature vector of the target webpage, so that feature analysis is performed on the webpage according to the feature vector.
2. The method of claim 1, wherein the dividing the target web page into a plurality of document parts according to the location architecture of the web page information comprises:
and dividing the target webpage into three document parts, namely a title, a keyword and a text according to the position architecture of the webpage information.
3. The method according to claim 2, wherein the performing the word segmentation processing on the plurality of document parts respectively, performing statistics on word segmentation processing results, and obtaining a plurality of sets corresponding to the plurality of document parts comprises:
performing word segmentation on the title document part, performing synonym combination on the word segmentation processing result to obtain a first characteristic word, counting the times corresponding to the first characteristic word, and storing the first characteristic word and the times corresponding to the first characteristic word in a second set corresponding to the title document part in a data pair format, wherein the first characteristic word comprises at least one characteristic word;
performing word segmentation on the keyword document part, performing synonym combination on the word segmentation processing result to obtain a second characteristic word, counting the times corresponding to the second characteristic word, and storing the second characteristic word and the times corresponding to the second characteristic word in a third set corresponding to the keyword document part in a data pair format, wherein the second characteristic word comprises at least one characteristic word;
performing word segmentation on the text document part, performing synonym combination on the word segmentation processing result to obtain a third feature word, counting the times corresponding to the third feature word, and storing the third feature word and the times corresponding to the third feature word in a first set corresponding to the text document part in a data pair format, wherein the third feature word comprises at least one feature word.
4. The method according to claim 1 or 3, wherein determining the base position weight value according to the number of times corresponding to the feature word in the first set comprises:
and determining the maximum numerical value in the times corresponding to all the feature words in the first set as the basic position weight value.
5. The method of claim 3, wherein determining the weight values of all but the first set of the plurality of sets according to the base location weight value, a preset weight proportion value, and all but the first set of the plurality of sets comprises:
multiplying the value obtained by multiplying the basic position weight value by a first preset weight proportion value respectively by the corresponding times of each feature word in the second set to obtain the weight value of each feature word in the second set, wherein the first preset weight proportion value is the weight proportion of the title position of the webpage relative to the text position of the webpage;
and multiplying the value obtained by multiplying the basic position weight value by a second preset weight proportion value respectively by the corresponding times of each feature word in the third set to obtain the weight value of each feature word in the third set, wherein the second preset weight proportion value is the weight proportion of the position of the webpage keyword relative to the position of the webpage text.
6. The method according to claim 1, wherein the integrating the weight values of the plurality of sets and all sets except the first set to obtain the feature vector of the target web page comprises:
adding the weighted values corresponding to the same feature words in the multiple sets, sorting the added weighted values from large to small, and determining the top n weighted values after sorting and the feature words corresponding to the top n weighted values as the feature vectors of the target webpage, wherein n is a natural number.
7. An apparatus for extracting web page features, the apparatus comprising: an acquisition unit, a processing unit, a determination unit, wherein,
the acquisition unit is used for acquiring a target webpage;
the processing unit is configured to divide the target webpage into a plurality of document parts according to a position structure of webpage information, and further configured to perform word segmentation on the plurality of document parts, perform statistics on word segmentation processing results, and obtain a plurality of sets corresponding to the plurality of document parts, where each document part corresponds to one set, each set in the plurality of sets includes at least one data pair, and each data pair includes: the feature words and the times corresponding to the feature words;
the determining unit is configured to determine a basic position weight value according to a number of times corresponding to the feature word in a first set, where the first set is a set with the largest data pairs in the plurality of sets; the system is further used for determining the weight values of all the sets except the first set in the plurality of sets according to the weight values of the basic positions, preset weight proportion values and all the sets except the first set in the plurality of sets so as to distinguish the weight occupied by each position in the target webpage;
the processing unit is further configured to perform integration processing on the weighted values of the plurality of sets and all sets except the first set in the plurality of sets to obtain a feature vector of the target webpage, so that feature analysis is performed on the webpage according to the feature vector.
8. The apparatus of claim 7,
the processing unit is used for dividing the target webpage into three document parts, namely a title, a keyword and a text according to the position architecture of the webpage information;
the system is further used for performing word segmentation processing on the title document part, performing synonym combination processing on the word segmentation processing result to obtain a first characteristic word, counting the times corresponding to the first characteristic word, and storing the first characteristic word and the times corresponding to the first characteristic word in a second set corresponding to the title document part in a data pair format, wherein the first characteristic word comprises at least one characteristic word;
the system is also used for performing word segmentation processing on the keyword document part, performing synonym combination processing on the word segmentation processing result to obtain a second characteristic word, counting the times corresponding to the second characteristic word, and storing the second characteristic word and the times corresponding to the second characteristic word in a third set corresponding to the keyword document part in a data pair format, wherein the second characteristic word comprises at least one characteristic word;
the text document part is also used for performing word segmentation processing on the text document part, performing synonym combination processing on the word segmentation processing result to obtain a third feature word, counting the times corresponding to the third feature word, and storing the third feature word and the times corresponding to the third feature word in a first set corresponding to the text document part in a data pair format, wherein the third feature word comprises at least one feature word.
9. The apparatus according to claim 7 or 8, wherein the determining unit is configured to determine a maximum number of times corresponding to all feature words in the first set as the basic location weight value;
the processing unit is further configured to multiply a value obtained by multiplying the basic position weight value by a first preset weight proportion value by the number of times corresponding to each feature word in the second set, so as to obtain a weight value of each feature word in the second set, where the first preset weight proportion value is a weight proportion of a webpage title position relative to a webpage text position; and the second preset weight proportion value is a weight proportion of the position of the webpage keyword relative to the position of the webpage text, and is used for multiplying the value obtained by multiplying the weight value of the basic position weight value by a second preset weight proportion value by the corresponding times of each feature word in the third set respectively to obtain the weight value of each feature word in the third set.
10. The apparatus of claim 7,
the processing unit is further configured to add weight values corresponding to the same feature words in the multiple sets, and sort the added weight values from large to small;
the determining unit is configured to determine that n weighted values before and n feature words corresponding to the n weighted values after the sorting are feature vectors of the target webpage, where n is a natural number.
CN201611137455.3A 2016-12-12 2016-12-12 Webpage feature extraction method and device Active CN108614825B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201611137455.3A CN108614825B (en) 2016-12-12 2016-12-12 Webpage feature extraction method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201611137455.3A CN108614825B (en) 2016-12-12 2016-12-12 Webpage feature extraction method and device

Publications (2)

Publication Number Publication Date
CN108614825A CN108614825A (en) 2018-10-02
CN108614825B true CN108614825B (en) 2022-04-15

Family

ID=63657508

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201611137455.3A Active CN108614825B (en) 2016-12-12 2016-12-12 Webpage feature extraction method and device

Country Status (1)

Country Link
CN (1) CN108614825B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112989790B (en) * 2021-03-17 2023-02-28 中国科学院深圳先进技术研究院 Document characterization method and device based on deep learning, equipment and storage medium
CN114528811B (en) * 2022-01-21 2022-09-02 北京麦克斯泰科技有限公司 Article content extraction method, device, equipment and storage medium
CN115858470B (en) * 2022-12-26 2023-09-22 深圳市中政汇智管理咨询有限公司 Policy and regulation file matching method, system, server and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101079031A (en) * 2006-06-15 2007-11-28 腾讯科技(深圳)有限公司 Web page subject extraction system and method
CN101246498A (en) * 2008-03-27 2008-08-20 腾讯科技(深圳)有限公司 News web page searching method
CN103064969A (en) * 2012-12-31 2013-04-24 武汉传神信息技术有限公司 Method for automatically creating keyword index table
CN103810264A (en) * 2014-01-27 2014-05-21 西安理工大学 Webpage text classification method based on feature selection

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130246184A1 (en) * 2012-03-13 2013-09-19 PowerLinks Media Limited Method and system for displaying a contextual advertisement on a webpage
TW201421267A (en) * 2012-11-21 2014-06-01 Hon Hai Prec Ind Co Ltd Searching system and method
CN103617213B (en) * 2013-11-19 2017-04-19 北京奇虎科技有限公司 Method and system for identifying newspage attributive characters

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101079031A (en) * 2006-06-15 2007-11-28 腾讯科技(深圳)有限公司 Web page subject extraction system and method
CN101246498A (en) * 2008-03-27 2008-08-20 腾讯科技(深圳)有限公司 News web page searching method
CN103064969A (en) * 2012-12-31 2013-04-24 武汉传神信息技术有限公司 Method for automatically creating keyword index table
CN103810264A (en) * 2014-01-27 2014-05-21 西安理工大学 Webpage text classification method based on feature selection

Also Published As

Publication number Publication date
CN108614825A (en) 2018-10-02

Similar Documents

Publication Publication Date Title
CN106202518B (en) Short text classification method based on CHI and sub-category association rule algorithm
CN107193959B (en) Pure text-oriented enterprise entity classification method
WO2017167067A1 (en) Method and device for webpage text classification, method and device for webpage text recognition
CN105488077B (en) Method and device for generating content label
CN103914478B (en) Webpage training method and system, webpage Forecasting Methodology and system
CN103207913B (en) The acquisition methods of commercial fine granularity semantic relation and system
CN107992542A (en) A kind of similar article based on topic model recommends method
CN104598532A (en) Information processing method and device
CN103294817A (en) Text feature extraction method based on categorical distribution probability
CN108199951A (en) A kind of rubbish mail filtering method based on more algorithm fusion models
CN103309862A (en) Webpage type recognition method and system
CN108614825B (en) Webpage feature extraction method and device
CN104361037B (en) Microblogging sorting technique and device
CN103020067A (en) Method and device for determining webpage type
CN108228612B (en) Method and device for extracting network event keywords and emotional tendency
CN105512333A (en) Product comment theme searching method based on emotional tendency
CN108763348A (en) A kind of classification improved method of extension short text word feature vector
CN109558587A (en) A kind of classification method for the unbalanced public opinion orientation identification of category distribution
Li et al. An improved KNN algorithm for text classification
CN113312476A (en) Automatic text labeling method and device and terminal
CN114997288A (en) Design resource association method
KR20100080099A (en) Method for searching information and computer readable medium storing thereof
CN112131341A (en) Text similarity calculation method and device, electronic equipment and storage medium
CN110019820A (en) Main suit and present illness history symptom Timing Coincidence Detection method in a kind of case history
JP2016218512A (en) Information processing device and information processing program

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information

Address after: 310012 building A01, 1600 yuhangtang Road, Wuchang Street, Yuhang District, Hangzhou City, Zhejiang Province

Applicant after: CHINA MOBILE (HANGZHOU) INFORMATION TECHNOLOGY Co.,Ltd.

Applicant after: China Mobile Communications Corp.

Address before: 310012, No. 14, building three, Chang Torch Hotel, No. 259, Wensanlu Road, Xihu District, Zhejiang, Hangzhou

Applicant before: CHINA MOBILE (HANGZHOU) INFORMATION TECHNOLOGY Co.,Ltd.

Applicant before: China Mobile Communications Corp.

CB02 Change of applicant information
GR01 Patent grant
GR01 Patent grant