CN105426379A

CN105426379A - Keyword weight calculation method based on position of word

Info

Publication number: CN105426379A
Application number: CN201410563853.6A
Authority: CN
Inventors: 刘永坚; 白立华; 杨朝阳; 李文忠; 杨慧; 朱驰风
Original assignee: Wuhan University of Technology WUT
Current assignee: Wuhan University of Technology WUT
Priority date: 2014-10-22
Filing date: 2014-10-22
Publication date: 2016-03-23

Abstract

The present invention provides a keyword weight calculation method based on a position of a word. The method comprises the following steps of: pre-processing a document: performing pre-processing on the provided document to obtain text information; extracting a keyword: extracting the keyword from the text information after preforming preprocessing on the document; acquiring an influence factor: acquiring a weight factor of the extracted keyword; acquiring a basic weight in one aspect; identifying words in a first sentence of an abstract and an article chapter in the other aspect; performing weighted calculation: regarding the acquired influence factor as a weight calculating factor to perform final weight calculation; and outputting a keyword weight table: outputting the final keyword weight table. According to the keyword weight calculation method based on the position of the word provided by the present invention, a weight parameter of the word can be calculated accurately, so that the keyword analysis is facilitated and the understanding and memorization of readers to the content of articles are facilitated.

Description

Based on the key word weight calculation method of word position

Technical field

The present invention relates to digital publication technical field, refer more particularly to a kind of key word weight calculation method based on word position.

Background technology

Key word is exactly word or word that user inputs when using search, that at utmost can summarize the information content that user will search, is generalization and the centralization of information.At the key word that Publishing Industry is spoken of, often refer to core and the main contents of article.

At present in the article of publication, the significance level of sentence can be reflected in the position that sentence occurs in article, in like manner the significance level of word in article also can be reflected in the appearance position of word in article, in a lot of situation, important word all appears at summary, the first sentence of article paragraph, and therefore the position of word can as of a weight computing factor.

The factor of word position, mostly based on word frequency, is not considered in the factor of influence of key word weight computing by the account form of current key word weights.

Summary of the invention

Technical matters to be solved by this invention is just to the technical deficiency of above-mentioned existence, a kind of weight parameter that accurately can calculate word is provided, be conducive to the analysis of keyword, help reader the key word weight calculation method based on word position understood the content of article and remember.

The technical solution adopted for the present invention to solve the technical problems is:

Based on the key word weight calculation method of word position, it is characterized in that, comprise the steps:

Document pre-service: in computer systems, which, utilizes tool change for pdf form for the document provided, and carries out pre-service and obtains text message; Analytical tool is adopted to be resolved by the page of pdf document, all page datas of pdf document can be got after parsing, identify catalogue page and page paragraph by catalogue, paragraph feature, and the subsequent treatment such as rational for these data convenient storage participle are called.

Keyword extraction: key word is extracted to the text message after document pre-service, contrasts existing key table, in units of paragraph, keyword extraction is carried out to each section of every one page of document, and stored in computer system.

Obtain factor of influence: weight is obtained to the key word extracted; Obtain basic weights on the one hand; Identify the word in summary, the first sentence of article paragraph on the other hand.

Weighted calculation: the factor of influence got is carried out final weight computing as the weight computing factor.

Export key word weight table: the final weights list providing key word.

In such scheme, described extraction key word is specially: adopt ansj participle assembly to extract paragraph key word to pdf content by paragraph participle.

In such scheme, described acquisition weight detailed step is:

Obtain basic weights: the weight computing interfaces such as the weight computing interface provided by ansj or Lucene are calculated weights to the key word extracted and obtain basic weights f.

Identify the word in summary, the first sentence of article paragraph: identify summary and paragraph by summary section and article paragraph feature and identify corresponding summary key word and the first key word of section.

In such scheme, described weighted calculation concrete steps: the word weighting process of and summary identification first to section on the basis of basic weights, finally obtains the weights being with position weighting.

Weight computing formula: W _k(f _k, a _k, h _k)=f _k+ f (a _k)+g (h _k)

f (a_{k}) = \frac{t_{a} \times tf (t_{k}, d_{j}) \times \log (\frac{N}{n_{k}} + 0.01)}{\sqrt{Σ_{j} {(tf (t_{k}, d_{j}) \times \log (\frac{N}{n_{k}}))}^{2}}} \times α

g (h_{k}) = \frac{h_{a} \times tf (t_{k}, d_{j}) \times \log (\frac{N}{n_{k}} + 0.01)}{\sqrt{Σ_{j} {(tf (t_{k}, d_{j}) \times \log (\frac{N}{n_{k}}))}^{2}}} \times σ

Tf (t _k, d _j): the word frequency of word k in document j.

T _a: the word frequency during word k makes a summary.

H _a: word k is in the word frequency of this section.

α, σ: the mediation factor, according to the factor values that a large amount of statistical test obtains.

T _k: the number of times that word k occurs in article.

D _j: vocabulary sum in document j.

N: total number of files.

N _k: containing the number of files of this vocabulary.

Principle of the present invention is that employing instrument is resolved pdf document, in computer systems, which, ansj assembly is used to press paragraph extraction key word to the information after resolving, use xsimilarity assembly to contrast between two to the key word extracted and carry out synonym merger, adopt ansj component interface to calculate in key word weight storage database, finally can check the Weight markup information of paragraph in an electronic document.

The invention has the beneficial effects as follows:

The inventive method accurately can calculate the weight parameter of word, is conducive to the analysis of keyword, helps reader to understand the content of article and remember.

Accompanying drawing explanation

Fig. 1 is the process flow diagram of the embodiment of the present invention;

Embodiment

Below in conjunction with embodiment, the present invention is further illustrated:

The key word weight calculation method based on word position as shown in Figure 1, comprises the steps:

Document pre-service: for the document provided, if the document being similar to pdf will carry out pre-service obtain text message, concrete can with the text resolution instrument of corresponding correspondence; Analytical tool is adopted to be resolved by the page of pdf document, all page datas of pdf document can be got after parsing, identify catalogue page and page paragraph by catalogue, paragraph feature, and the subsequent treatment such as rational for these data convenient storage participle are called.。

Keyword extraction: key word is extracted to the text message after document pre-service, contrasts existing key table, in units of paragraph, keyword extraction is carried out to each section of every one page of document, and stored in computer system.。

Export key word weight table: the final weights list providing key word.

In the present embodiment, described extraction key word is specially: adopt ansj participle assembly to extract paragraph key word to pdf content by paragraph participle.

In such scheme, described acquisition weight detailed step is:

In the present embodiment, described weighted calculation concrete steps: the word weighting process of and summary identification first to section on the basis of basic weights, finally obtains the weights being with position weighting.

Weight computing formula: W _k(f _k, a _k, h _k)=f _k+ f (a _k)+g (h _k)

f (a_{k}) = \frac{t_{a} \times tf (t_{k}, d_{j}) \times \log (\frac{N}{n_{k}} + 0.01)}{\sqrt{Σ_{j} {(tf (t_{k}, d_{j}) \times \log (\frac{N}{n_{k}}))}^{2}}} \times α

g (h_{k}) = \frac{h_{a} \times tf (t_{k}, d_{j}) \times \log (\frac{N}{n_{k}} + 0.01)}{\sqrt{Σ_{j} {(tf (t_{k}, d_{j}) \times \log (\frac{N}{n_{k}}))}^{2}}} \times σ

Tf (t _k, d _j): the word frequency of word k in document j.

T _a: the word frequency during word k makes a summary.

H _a: word k is in the word frequency of this section.

T _k: the number of times that word k occurs in article.

D _j: vocabulary sum in document j.

N: total number of files.

N _k: containing the number of files of this vocabulary.

Protection scope of the present invention is not limited to the above embodiments, and obviously, those skilled in the art can carry out various change and distortion to the present invention and not depart from the scope of the present invention and spirit.If these are changed and distortion belongs in the scope of the claims in the present invention and equivalent technologies thereof, then the intent of the present invention also comprises these changes and distortion.

Claims

1., based on the key word weight calculation method of word position, it is characterized in that, comprise the steps:

Document pre-service: in computer systems, which, utilizes tool change for pdf form for the document provided, and carries out pre-service and obtains text message, concrete can with the text resolution instrument of corresponding correspondence; Analytical tool is adopted to be resolved by the page of pdf document, all page datas of pdf document can be got after parsing, identify catalogue page and page paragraph by catalogue, paragraph feature, and the subsequent treatment such as rational for these data convenient storage participle are called;

Keyword extraction: key word is extracted to the text message after document pre-service, contrasts existing key table, in units of paragraph, keyword extraction is carried out to each section of every one page of document, and stored in computer system;

Obtain factor of influence: weight is obtained to the key word extracted; Obtain basic weights on the one hand; Identify the word in summary, the first sentence of article paragraph on the other hand;

Weighted calculation: the factor of influence got is carried out final weight computing as the weight computing factor;

Export key word weight table: the final weights list providing key word.

2. as claimed in claim 1 based on the key word weight calculation method of word position, it is characterized in that: described extraction key word is specially: adopt ansj participle assembly to extract paragraph key word to pdf content by paragraph participle.

3., as claimed in claim 1 based on the key word weight calculation method of word position, it is characterized in that: described acquisition weight detailed step is:

Obtain basic weights: the weight computing interfaces such as the weight computing interface provided by ansj or Lucene are calculated weights to the key word extracted and obtain basic weights f;

4. as claimed in claim 1 based on the key word weight calculation method of word position, it is characterized in that: described weighted calculation concrete steps: the word weighting process of and summary identification first to section on the basis of basic weights, finally obtains the weights being with position weighting;

Weight computing formula: W _k(f _k, a _k, h _k)=f _k+ f (a _k)+g (h _k)

f (a_{k}) = \frac{t_{a} \times tf (t_{k}, d_{j}) \times \log (\frac{N}{n_{k}} + 0.01)}{\sqrt{Σ_{j} {(tf (t_{k}, d_{j}) \times \log (\frac{N}{n_{k}}))}^{2}}} \times α

g (h_{k}) = \frac{h_{a} \times tf (t_{k}, d_{j}) \times \log (\frac{N}{n_{k}} + 0.01)}{\sqrt{Σ_{j} {(tf (t_{k}, d_{j}) \times \log (\frac{N}{n_{k}}))}^{2}}} \times σ

Tf (t _k, d _j) the word frequency of word k in document j.

T _a: the word frequency during word k makes a summary.

H _a: word k is in the word frequency of this section.

T _k: the number of times that word k occurs in article.

D _j: vocabulary sum in document j.

N: total number of files.

N _k: containing the number of files of this vocabulary.