CN105426379A - Keyword weight calculation method based on position of word - Google Patents

Keyword weight calculation method based on position of word Download PDF

Info

Publication number
CN105426379A
CN105426379A CN201410563853.6A CN201410563853A CN105426379A CN 105426379 A CN105426379 A CN 105426379A CN 201410563853 A CN201410563853 A CN 201410563853A CN 105426379 A CN105426379 A CN 105426379A
Authority
CN
China
Prior art keywords
word
weight
key word
document
paragraph
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201410563853.6A
Other languages
Chinese (zh)
Inventor
刘永坚
白立华
杨朝阳
李文忠
杨慧
朱驰风
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wuhan University of Technology WUT
Original Assignee
Wuhan University of Technology WUT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wuhan University of Technology WUT filed Critical Wuhan University of Technology WUT
Priority to CN201410563853.6A priority Critical patent/CN105426379A/en
Publication of CN105426379A publication Critical patent/CN105426379A/en
Pending legal-status Critical Current

Links

Landscapes

  • Machine Translation (AREA)

Abstract

The present invention provides a keyword weight calculation method based on a position of a word. The method comprises the following steps of: pre-processing a document: performing pre-processing on the provided document to obtain text information; extracting a keyword: extracting the keyword from the text information after preforming preprocessing on the document; acquiring an influence factor: acquiring a weight factor of the extracted keyword; acquiring a basic weight in one aspect; identifying words in a first sentence of an abstract and an article chapter in the other aspect; performing weighted calculation: regarding the acquired influence factor as a weight calculating factor to perform final weight calculation; and outputting a keyword weight table: outputting the final keyword weight table. According to the keyword weight calculation method based on the position of the word provided by the present invention, a weight parameter of the word can be calculated accurately, so that the keyword analysis is facilitated and the understanding and memorization of readers to the content of articles are facilitated.

Description

Based on the key word weight calculation method of word position
Technical field
The present invention relates to digital publication technical field, refer more particularly to a kind of key word weight calculation method based on word position.
Background technology
Key word is exactly word or word that user inputs when using search, that at utmost can summarize the information content that user will search, is generalization and the centralization of information.At the key word that Publishing Industry is spoken of, often refer to core and the main contents of article.
At present in the article of publication, the significance level of sentence can be reflected in the position that sentence occurs in article, in like manner the significance level of word in article also can be reflected in the appearance position of word in article, in a lot of situation, important word all appears at summary, the first sentence of article paragraph, and therefore the position of word can as of a weight computing factor.
The factor of word position, mostly based on word frequency, is not considered in the factor of influence of key word weight computing by the account form of current key word weights.
Summary of the invention
Technical matters to be solved by this invention is just to the technical deficiency of above-mentioned existence, a kind of weight parameter that accurately can calculate word is provided, be conducive to the analysis of keyword, help reader the key word weight calculation method based on word position understood the content of article and remember.
The technical solution adopted for the present invention to solve the technical problems is:
Based on the key word weight calculation method of word position, it is characterized in that, comprise the steps:
Document pre-service: in computer systems, which, utilizes tool change for pdf form for the document provided, and carries out pre-service and obtains text message; Analytical tool is adopted to be resolved by the page of pdf document, all page datas of pdf document can be got after parsing, identify catalogue page and page paragraph by catalogue, paragraph feature, and the subsequent treatment such as rational for these data convenient storage participle are called.
Keyword extraction: key word is extracted to the text message after document pre-service, contrasts existing key table, in units of paragraph, keyword extraction is carried out to each section of every one page of document, and stored in computer system.
Obtain factor of influence: weight is obtained to the key word extracted; Obtain basic weights on the one hand; Identify the word in summary, the first sentence of article paragraph on the other hand.
Weighted calculation: the factor of influence got is carried out final weight computing as the weight computing factor.
Export key word weight table: the final weights list providing key word.
In such scheme, described extraction key word is specially: adopt ansj participle assembly to extract paragraph key word to pdf content by paragraph participle.
In such scheme, described acquisition weight detailed step is:
Obtain basic weights: the weight computing interfaces such as the weight computing interface provided by ansj or Lucene are calculated weights to the key word extracted and obtain basic weights f.
Identify the word in summary, the first sentence of article paragraph: identify summary and paragraph by summary section and article paragraph feature and identify corresponding summary key word and the first key word of section.
In such scheme, described weighted calculation concrete steps: the word weighting process of and summary identification first to section on the basis of basic weights, finally obtains the weights being with position weighting.
Weight computing formula: W k(f k, a k, h k)=f k+ f (a k)+g (h k)
f ( a k ) = t a × tf ( t k , d j ) × log ( N n k + 0.01 ) Σ j ( tf ( t k , d j ) × log ( N n k ) ) 2 × α
g ( h k ) = h a × tf ( t k , d j ) × log ( N n k + 0.01 ) Σ j ( tf ( t k , d j ) × log ( N n k ) ) 2 × σ
Tf (t k, d j): the word frequency of word k in document j.
T a: the word frequency during word k makes a summary.
H a: word k is in the word frequency of this section.
α, σ: the mediation factor, according to the factor values that a large amount of statistical test obtains.
T k: the number of times that word k occurs in article.
D j: vocabulary sum in document j.
N: total number of files.
N k: containing the number of files of this vocabulary.
Principle of the present invention is that employing instrument is resolved pdf document, in computer systems, which, ansj assembly is used to press paragraph extraction key word to the information after resolving, use xsimilarity assembly to contrast between two to the key word extracted and carry out synonym merger, adopt ansj component interface to calculate in key word weight storage database, finally can check the Weight markup information of paragraph in an electronic document.
The invention has the beneficial effects as follows:
The inventive method accurately can calculate the weight parameter of word, is conducive to the analysis of keyword, helps reader to understand the content of article and remember.
Accompanying drawing explanation
Fig. 1 is the process flow diagram of the embodiment of the present invention;
Embodiment
Below in conjunction with embodiment, the present invention is further illustrated:
The key word weight calculation method based on word position as shown in Figure 1, comprises the steps:
Document pre-service: for the document provided, if the document being similar to pdf will carry out pre-service obtain text message, concrete can with the text resolution instrument of corresponding correspondence; Analytical tool is adopted to be resolved by the page of pdf document, all page datas of pdf document can be got after parsing, identify catalogue page and page paragraph by catalogue, paragraph feature, and the subsequent treatment such as rational for these data convenient storage participle are called.。
Keyword extraction: key word is extracted to the text message after document pre-service, contrasts existing key table, in units of paragraph, keyword extraction is carried out to each section of every one page of document, and stored in computer system.。
Obtain factor of influence: weight is obtained to the key word extracted; Obtain basic weights on the one hand; Identify the word in summary, the first sentence of article paragraph on the other hand.
Weighted calculation: the factor of influence got is carried out final weight computing as the weight computing factor.
Export key word weight table: the final weights list providing key word.
In the present embodiment, described extraction key word is specially: adopt ansj participle assembly to extract paragraph key word to pdf content by paragraph participle.
In such scheme, described acquisition weight detailed step is:
Obtain basic weights: the weight computing interfaces such as the weight computing interface provided by ansj or Lucene are calculated weights to the key word extracted and obtain basic weights f.
Identify the word in summary, the first sentence of article paragraph: identify summary and paragraph by summary section and article paragraph feature and identify corresponding summary key word and the first key word of section.
In the present embodiment, described weighted calculation concrete steps: the word weighting process of and summary identification first to section on the basis of basic weights, finally obtains the weights being with position weighting.
Weight computing formula: W k(f k, a k, h k)=f k+ f (a k)+g (h k)
f ( a k ) = t a × tf ( t k , d j ) × log ( N n k + 0.01 ) Σ j ( tf ( t k , d j ) × log ( N n k ) ) 2 × α
g ( h k ) = h a × tf ( t k , d j ) × log ( N n k + 0.01 ) Σ j ( tf ( t k , d j ) × log ( N n k ) ) 2 × σ
Tf (t k, d j): the word frequency of word k in document j.
T a: the word frequency during word k makes a summary.
H a: word k is in the word frequency of this section.
α, σ: the mediation factor, according to the factor values that a large amount of statistical test obtains.
T k: the number of times that word k occurs in article.
D j: vocabulary sum in document j.
N: total number of files.
N k: containing the number of files of this vocabulary.
Protection scope of the present invention is not limited to the above embodiments, and obviously, those skilled in the art can carry out various change and distortion to the present invention and not depart from the scope of the present invention and spirit.If these are changed and distortion belongs in the scope of the claims in the present invention and equivalent technologies thereof, then the intent of the present invention also comprises these changes and distortion.

Claims (4)

1., based on the key word weight calculation method of word position, it is characterized in that, comprise the steps:
Document pre-service: in computer systems, which, utilizes tool change for pdf form for the document provided, and carries out pre-service and obtains text message, concrete can with the text resolution instrument of corresponding correspondence; Analytical tool is adopted to be resolved by the page of pdf document, all page datas of pdf document can be got after parsing, identify catalogue page and page paragraph by catalogue, paragraph feature, and the subsequent treatment such as rational for these data convenient storage participle are called;
Keyword extraction: key word is extracted to the text message after document pre-service, contrasts existing key table, in units of paragraph, keyword extraction is carried out to each section of every one page of document, and stored in computer system;
Obtain factor of influence: weight is obtained to the key word extracted; Obtain basic weights on the one hand; Identify the word in summary, the first sentence of article paragraph on the other hand;
Weighted calculation: the factor of influence got is carried out final weight computing as the weight computing factor;
Export key word weight table: the final weights list providing key word.
2. as claimed in claim 1 based on the key word weight calculation method of word position, it is characterized in that: described extraction key word is specially: adopt ansj participle assembly to extract paragraph key word to pdf content by paragraph participle.
3., as claimed in claim 1 based on the key word weight calculation method of word position, it is characterized in that: described acquisition weight detailed step is:
Obtain basic weights: the weight computing interfaces such as the weight computing interface provided by ansj or Lucene are calculated weights to the key word extracted and obtain basic weights f;
Identify the word in summary, the first sentence of article paragraph: identify summary and paragraph by summary section and article paragraph feature and identify corresponding summary key word and the first key word of section.
4. as claimed in claim 1 based on the key word weight calculation method of word position, it is characterized in that: described weighted calculation concrete steps: the word weighting process of and summary identification first to section on the basis of basic weights, finally obtains the weights being with position weighting;
Weight computing formula: W k(f k, a k, h k)=f k+ f (a k)+g (h k)
f ( a k ) = t a × tf ( t k , d j ) × log ( N n k + 0.01 ) Σ j ( tf ( t k , d j ) × log ( N n k ) ) 2 × α
g ( h k ) = h a × tf ( t k , d j ) × log ( N n k + 0.01 ) Σ j ( tf ( t k , d j ) × log ( N n k ) ) 2 × σ
Tf (t k, d j) the word frequency of word k in document j.
T a: the word frequency during word k makes a summary.
H a: word k is in the word frequency of this section.
α, σ: the mediation factor, according to the factor values that a large amount of statistical test obtains.
T k: the number of times that word k occurs in article.
D j: vocabulary sum in document j.
N: total number of files.
N k: containing the number of files of this vocabulary.
CN201410563853.6A 2014-10-22 2014-10-22 Keyword weight calculation method based on position of word Pending CN105426379A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410563853.6A CN105426379A (en) 2014-10-22 2014-10-22 Keyword weight calculation method based on position of word

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410563853.6A CN105426379A (en) 2014-10-22 2014-10-22 Keyword weight calculation method based on position of word

Publications (1)

Publication Number Publication Date
CN105426379A true CN105426379A (en) 2016-03-23

Family

ID=55504592

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410563853.6A Pending CN105426379A (en) 2014-10-22 2014-10-22 Keyword weight calculation method based on position of word

Country Status (1)

Country Link
CN (1) CN105426379A (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107092693A (en) * 2017-04-25 2017-08-25 厦门众智创库企业管理咨询有限公司 A kind of document keyword fast scanning method
CN109062895A (en) * 2018-07-23 2018-12-21 挖财网络技术有限公司 A kind of intelligent semantic processing method
CN109726400A (en) * 2018-12-29 2019-05-07 新华网股份有限公司 Entity word recognition result evaluation method, apparatus, equipment and entity word extraction system
CN109977198A (en) * 2019-04-01 2019-07-05 北京百度网讯科技有限公司 Establish method and apparatus, the hardware device, computer-readable medium of mapping relations
CN110413737A (en) * 2019-07-29 2019-11-05 腾讯科技(深圳)有限公司 A kind of determination method, apparatus, server and the readable storage medium storing program for executing of synonym
CN111611342A (en) * 2020-04-09 2020-09-01 中南大学 Method and device for obtaining lexical item and paragraph association weight

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101251841A (en) * 2007-05-17 2008-08-27 华东师范大学 Method for establishing and searching feature matrix of Web document based on semantics
CN101710317A (en) * 2009-11-17 2010-05-19 上海第二工业大学 Word partial weight calculating method based on word distribution

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101251841A (en) * 2007-05-17 2008-08-27 华东师范大学 Method for establishing and searching feature matrix of Web document based on semantics
CN101710317A (en) * 2009-11-17 2010-05-19 上海第二工业大学 Word partial weight calculating method based on word distribution

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
RADA MIHALCEA等: "TextRank Bringing_Order into Texts", 《EMNLP》 *
夏天: "词语位置加权TextRank的关键词抽取研究", 《现代图书情报技术》 *
杨林: "基于文本的关键词提取方法研究与实现", 《中国优秀硕士学位论文全文数据库》 *
石晶等: "基于LDA模型的主题词抽取方法", 《计算机工程》 *
薛征等: "基于位置权重和实体识别的关键词提取", 《中国电子学会信息论分会会议论文集》 *

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107092693A (en) * 2017-04-25 2017-08-25 厦门众智创库企业管理咨询有限公司 A kind of document keyword fast scanning method
CN109062895A (en) * 2018-07-23 2018-12-21 挖财网络技术有限公司 A kind of intelligent semantic processing method
CN109726400A (en) * 2018-12-29 2019-05-07 新华网股份有限公司 Entity word recognition result evaluation method, apparatus, equipment and entity word extraction system
CN109977198A (en) * 2019-04-01 2019-07-05 北京百度网讯科技有限公司 Establish method and apparatus, the hardware device, computer-readable medium of mapping relations
CN110413737A (en) * 2019-07-29 2019-11-05 腾讯科技(深圳)有限公司 A kind of determination method, apparatus, server and the readable storage medium storing program for executing of synonym
CN110413737B (en) * 2019-07-29 2022-10-14 腾讯科技(深圳)有限公司 Synonym determination method, synonym determination device, server and readable storage medium
CN111611342A (en) * 2020-04-09 2020-09-01 中南大学 Method and device for obtaining lexical item and paragraph association weight
CN111611342B (en) * 2020-04-09 2023-04-18 中南大学 Method and device for obtaining lexical item and paragraph association weight

Similar Documents

Publication Publication Date Title
CN108717406B (en) Text emotion analysis method and device and storage medium
CN111783394B (en) Training method of event extraction model, event extraction method, system and equipment
CN107291783B (en) Semantic matching method and intelligent equipment
CN105426379A (en) Keyword weight calculation method based on position of word
CN109190110A (en) A kind of training method of Named Entity Extraction Model, system and electronic equipment
CN106294396A (en) Keyword expansion method and keyword expansion system
US10824816B2 (en) Semantic parsing method and apparatus
CN103853834B (en) Text structure analysis-based Web document abstract generation method
CN109446521B (en) Named entity recognition method, named entity recognition device, electronic equipment and machine-readable storage medium
CN103377260B (en) The analysis method and device of a kind of network log URL
CN106970912A (en) Chinese sentence similarity calculating method, computing device and computer-readable storage medium
CN103823896A (en) Subject characteristic value algorithm and subject characteristic value algorithm-based project evaluation expert recommendation algorithm
CN104598535A (en) Event extraction method based on maximum entropy
CN104142912A (en) Accurate corpus category marking method and device
CN103970666A (en) Method for detecting repeated software defect reports
CN110019820B (en) Method for detecting time consistency of complaints and symptoms of current medical history in medical records
CN110929520B (en) Unnamed entity object extraction method and device, electronic equipment and storage medium
CN110741376A (en) Automatic document analysis for different natural languages
CN107861944A (en) A kind of text label extracting method and device based on Word2Vec
CN110728117A (en) Paragraph automatic identification method and system based on machine learning and natural language processing
CN104281694A (en) Analysis system of emotional tendency of text
CN110826330B (en) Name recognition method and device, computer equipment and readable storage medium
CN109344390A (en) A method of the card language Entity recognition based on multiple features neural network
CN109241521A (en) A kind of high attention rate sentence extracting method of scientific and technical literature based on adduction relationship
CN105608136B (en) A kind of semantic relevancy calculation method based on Chinese complex sentence

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20160323