CN105426379A - Keyword weight calculation method based on position of word - Google Patents
Keyword weight calculation method based on position of word Download PDFInfo
- Publication number
- CN105426379A CN105426379A CN201410563853.6A CN201410563853A CN105426379A CN 105426379 A CN105426379 A CN 105426379A CN 201410563853 A CN201410563853 A CN 201410563853A CN 105426379 A CN105426379 A CN 105426379A
- Authority
- CN
- China
- Prior art keywords
- word
- weight
- key word
- document
- paragraph
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Landscapes
- Machine Translation (AREA)
Abstract
The present invention provides a keyword weight calculation method based on a position of a word. The method comprises the following steps of: pre-processing a document: performing pre-processing on the provided document to obtain text information; extracting a keyword: extracting the keyword from the text information after preforming preprocessing on the document; acquiring an influence factor: acquiring a weight factor of the extracted keyword; acquiring a basic weight in one aspect; identifying words in a first sentence of an abstract and an article chapter in the other aspect; performing weighted calculation: regarding the acquired influence factor as a weight calculating factor to perform final weight calculation; and outputting a keyword weight table: outputting the final keyword weight table. According to the keyword weight calculation method based on the position of the word provided by the present invention, a weight parameter of the word can be calculated accurately, so that the keyword analysis is facilitated and the understanding and memorization of readers to the content of articles are facilitated.
Description
Technical field
The present invention relates to digital publication technical field, refer more particularly to a kind of key word weight calculation method based on word position.
Background technology
Key word is exactly word or word that user inputs when using search, that at utmost can summarize the information content that user will search, is generalization and the centralization of information.At the key word that Publishing Industry is spoken of, often refer to core and the main contents of article.
At present in the article of publication, the significance level of sentence can be reflected in the position that sentence occurs in article, in like manner the significance level of word in article also can be reflected in the appearance position of word in article, in a lot of situation, important word all appears at summary, the first sentence of article paragraph, and therefore the position of word can as of a weight computing factor.
The factor of word position, mostly based on word frequency, is not considered in the factor of influence of key word weight computing by the account form of current key word weights.
Summary of the invention
Technical matters to be solved by this invention is just to the technical deficiency of above-mentioned existence, a kind of weight parameter that accurately can calculate word is provided, be conducive to the analysis of keyword, help reader the key word weight calculation method based on word position understood the content of article and remember.
The technical solution adopted for the present invention to solve the technical problems is:
Based on the key word weight calculation method of word position, it is characterized in that, comprise the steps:
Document pre-service: in computer systems, which, utilizes tool change for pdf form for the document provided, and carries out pre-service and obtains text message; Analytical tool is adopted to be resolved by the page of pdf document, all page datas of pdf document can be got after parsing, identify catalogue page and page paragraph by catalogue, paragraph feature, and the subsequent treatment such as rational for these data convenient storage participle are called.
Keyword extraction: key word is extracted to the text message after document pre-service, contrasts existing key table, in units of paragraph, keyword extraction is carried out to each section of every one page of document, and stored in computer system.
Obtain factor of influence: weight is obtained to the key word extracted; Obtain basic weights on the one hand; Identify the word in summary, the first sentence of article paragraph on the other hand.
Weighted calculation: the factor of influence got is carried out final weight computing as the weight computing factor.
Export key word weight table: the final weights list providing key word.
In such scheme, described extraction key word is specially: adopt ansj participle assembly to extract paragraph key word to pdf content by paragraph participle.
In such scheme, described acquisition weight detailed step is:
Obtain basic weights: the weight computing interfaces such as the weight computing interface provided by ansj or Lucene are calculated weights to the key word extracted and obtain basic weights f.
Identify the word in summary, the first sentence of article paragraph: identify summary and paragraph by summary section and article paragraph feature and identify corresponding summary key word and the first key word of section.
In such scheme, described weighted calculation concrete steps: the word weighting process of and summary identification first to section on the basis of basic weights, finally obtains the weights being with position weighting.
Weight computing formula: W
k(f
k, a
k, h
k)=f
k+ f (a
k)+g (h
k)
Tf (t
k, d
j): the word frequency of word k in document j.
T
a: the word frequency during word k makes a summary.
H
a: word k is in the word frequency of this section.
α, σ: the mediation factor, according to the factor values that a large amount of statistical test obtains.
T
k: the number of times that word k occurs in article.
D
j: vocabulary sum in document j.
N: total number of files.
N
k: containing the number of files of this vocabulary.
Principle of the present invention is that employing instrument is resolved pdf document, in computer systems, which, ansj assembly is used to press paragraph extraction key word to the information after resolving, use xsimilarity assembly to contrast between two to the key word extracted and carry out synonym merger, adopt ansj component interface to calculate in key word weight storage database, finally can check the Weight markup information of paragraph in an electronic document.
The invention has the beneficial effects as follows:
The inventive method accurately can calculate the weight parameter of word, is conducive to the analysis of keyword, helps reader to understand the content of article and remember.
Accompanying drawing explanation
Fig. 1 is the process flow diagram of the embodiment of the present invention;
Embodiment
Below in conjunction with embodiment, the present invention is further illustrated:
The key word weight calculation method based on word position as shown in Figure 1, comprises the steps:
Document pre-service: for the document provided, if the document being similar to pdf will carry out pre-service obtain text message, concrete can with the text resolution instrument of corresponding correspondence; Analytical tool is adopted to be resolved by the page of pdf document, all page datas of pdf document can be got after parsing, identify catalogue page and page paragraph by catalogue, paragraph feature, and the subsequent treatment such as rational for these data convenient storage participle are called.。
Keyword extraction: key word is extracted to the text message after document pre-service, contrasts existing key table, in units of paragraph, keyword extraction is carried out to each section of every one page of document, and stored in computer system.。
Obtain factor of influence: weight is obtained to the key word extracted; Obtain basic weights on the one hand; Identify the word in summary, the first sentence of article paragraph on the other hand.
Weighted calculation: the factor of influence got is carried out final weight computing as the weight computing factor.
Export key word weight table: the final weights list providing key word.
In the present embodiment, described extraction key word is specially: adopt ansj participle assembly to extract paragraph key word to pdf content by paragraph participle.
In such scheme, described acquisition weight detailed step is:
Obtain basic weights: the weight computing interfaces such as the weight computing interface provided by ansj or Lucene are calculated weights to the key word extracted and obtain basic weights f.
Identify the word in summary, the first sentence of article paragraph: identify summary and paragraph by summary section and article paragraph feature and identify corresponding summary key word and the first key word of section.
In the present embodiment, described weighted calculation concrete steps: the word weighting process of and summary identification first to section on the basis of basic weights, finally obtains the weights being with position weighting.
Weight computing formula: W
k(f
k, a
k, h
k)=f
k+ f (a
k)+g (h
k)
Tf (t
k, d
j): the word frequency of word k in document j.
T
a: the word frequency during word k makes a summary.
H
a: word k is in the word frequency of this section.
α, σ: the mediation factor, according to the factor values that a large amount of statistical test obtains.
T
k: the number of times that word k occurs in article.
D
j: vocabulary sum in document j.
N: total number of files.
N
k: containing the number of files of this vocabulary.
Protection scope of the present invention is not limited to the above embodiments, and obviously, those skilled in the art can carry out various change and distortion to the present invention and not depart from the scope of the present invention and spirit.If these are changed and distortion belongs in the scope of the claims in the present invention and equivalent technologies thereof, then the intent of the present invention also comprises these changes and distortion.
Claims (4)
1., based on the key word weight calculation method of word position, it is characterized in that, comprise the steps:
Document pre-service: in computer systems, which, utilizes tool change for pdf form for the document provided, and carries out pre-service and obtains text message, concrete can with the text resolution instrument of corresponding correspondence; Analytical tool is adopted to be resolved by the page of pdf document, all page datas of pdf document can be got after parsing, identify catalogue page and page paragraph by catalogue, paragraph feature, and the subsequent treatment such as rational for these data convenient storage participle are called;
Keyword extraction: key word is extracted to the text message after document pre-service, contrasts existing key table, in units of paragraph, keyword extraction is carried out to each section of every one page of document, and stored in computer system;
Obtain factor of influence: weight is obtained to the key word extracted; Obtain basic weights on the one hand; Identify the word in summary, the first sentence of article paragraph on the other hand;
Weighted calculation: the factor of influence got is carried out final weight computing as the weight computing factor;
Export key word weight table: the final weights list providing key word.
2. as claimed in claim 1 based on the key word weight calculation method of word position, it is characterized in that: described extraction key word is specially: adopt ansj participle assembly to extract paragraph key word to pdf content by paragraph participle.
3., as claimed in claim 1 based on the key word weight calculation method of word position, it is characterized in that: described acquisition weight detailed step is:
Obtain basic weights: the weight computing interfaces such as the weight computing interface provided by ansj or Lucene are calculated weights to the key word extracted and obtain basic weights f;
Identify the word in summary, the first sentence of article paragraph: identify summary and paragraph by summary section and article paragraph feature and identify corresponding summary key word and the first key word of section.
4. as claimed in claim 1 based on the key word weight calculation method of word position, it is characterized in that: described weighted calculation concrete steps: the word weighting process of and summary identification first to section on the basis of basic weights, finally obtains the weights being with position weighting;
Weight computing formula: W
k(f
k, a
k, h
k)=f
k+ f (a
k)+g (h
k)
Tf (t
k, d
j) the word frequency of word k in document j.
T
a: the word frequency during word k makes a summary.
H
a: word k is in the word frequency of this section.
α, σ: the mediation factor, according to the factor values that a large amount of statistical test obtains.
T
k: the number of times that word k occurs in article.
D
j: vocabulary sum in document j.
N: total number of files.
N
k: containing the number of files of this vocabulary.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410563853.6A CN105426379A (en) | 2014-10-22 | 2014-10-22 | Keyword weight calculation method based on position of word |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410563853.6A CN105426379A (en) | 2014-10-22 | 2014-10-22 | Keyword weight calculation method based on position of word |
Publications (1)
Publication Number | Publication Date |
---|---|
CN105426379A true CN105426379A (en) | 2016-03-23 |
Family
ID=55504592
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201410563853.6A Pending CN105426379A (en) | 2014-10-22 | 2014-10-22 | Keyword weight calculation method based on position of word |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN105426379A (en) |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107092693A (en) * | 2017-04-25 | 2017-08-25 | 厦门众智创库企业管理咨询有限公司 | A kind of document keyword fast scanning method |
CN109062895A (en) * | 2018-07-23 | 2018-12-21 | 挖财网络技术有限公司 | A kind of intelligent semantic processing method |
CN109726400A (en) * | 2018-12-29 | 2019-05-07 | 新华网股份有限公司 | Entity word recognition result evaluation method, apparatus, equipment and entity word extraction system |
CN109977198A (en) * | 2019-04-01 | 2019-07-05 | 北京百度网讯科技有限公司 | Establish method and apparatus, the hardware device, computer-readable medium of mapping relations |
CN110413737A (en) * | 2019-07-29 | 2019-11-05 | 腾讯科技(深圳)有限公司 | A kind of determination method, apparatus, server and the readable storage medium storing program for executing of synonym |
CN111611342A (en) * | 2020-04-09 | 2020-09-01 | 中南大学 | Method and device for obtaining lexical item and paragraph association weight |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101251841A (en) * | 2007-05-17 | 2008-08-27 | 华东师范大学 | Method for establishing and searching feature matrix of Web document based on semantics |
CN101710317A (en) * | 2009-11-17 | 2010-05-19 | 上海第二工业大学 | Word partial weight calculating method based on word distribution |
-
2014
- 2014-10-22 CN CN201410563853.6A patent/CN105426379A/en active Pending
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101251841A (en) * | 2007-05-17 | 2008-08-27 | 华东师范大学 | Method for establishing and searching feature matrix of Web document based on semantics |
CN101710317A (en) * | 2009-11-17 | 2010-05-19 | 上海第二工业大学 | Word partial weight calculating method based on word distribution |
Non-Patent Citations (5)
Title |
---|
RADA MIHALCEA等: "TextRank Bringing_Order into Texts", 《EMNLP》 * |
夏天: "词语位置加权TextRank的关键词抽取研究", 《现代图书情报技术》 * |
杨林: "基于文本的关键词提取方法研究与实现", 《中国优秀硕士学位论文全文数据库》 * |
石晶等: "基于LDA模型的主题词抽取方法", 《计算机工程》 * |
薛征等: "基于位置权重和实体识别的关键词提取", 《中国电子学会信息论分会会议论文集》 * |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107092693A (en) * | 2017-04-25 | 2017-08-25 | 厦门众智创库企业管理咨询有限公司 | A kind of document keyword fast scanning method |
CN109062895A (en) * | 2018-07-23 | 2018-12-21 | 挖财网络技术有限公司 | A kind of intelligent semantic processing method |
CN109726400A (en) * | 2018-12-29 | 2019-05-07 | 新华网股份有限公司 | Entity word recognition result evaluation method, apparatus, equipment and entity word extraction system |
CN109977198A (en) * | 2019-04-01 | 2019-07-05 | 北京百度网讯科技有限公司 | Establish method and apparatus, the hardware device, computer-readable medium of mapping relations |
CN110413737A (en) * | 2019-07-29 | 2019-11-05 | 腾讯科技(深圳)有限公司 | A kind of determination method, apparatus, server and the readable storage medium storing program for executing of synonym |
CN110413737B (en) * | 2019-07-29 | 2022-10-14 | 腾讯科技(深圳)有限公司 | Synonym determination method, synonym determination device, server and readable storage medium |
CN111611342A (en) * | 2020-04-09 | 2020-09-01 | 中南大学 | Method and device for obtaining lexical item and paragraph association weight |
CN111611342B (en) * | 2020-04-09 | 2023-04-18 | 中南大学 | Method and device for obtaining lexical item and paragraph association weight |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108717406B (en) | Text emotion analysis method and device and storage medium | |
CN111783394B (en) | Training method of event extraction model, event extraction method, system and equipment | |
CN107291783B (en) | Semantic matching method and intelligent equipment | |
CN105426379A (en) | Keyword weight calculation method based on position of word | |
CN109190110A (en) | A kind of training method of Named Entity Extraction Model, system and electronic equipment | |
CN106294396A (en) | Keyword expansion method and keyword expansion system | |
US10824816B2 (en) | Semantic parsing method and apparatus | |
CN103853834B (en) | Text structure analysis-based Web document abstract generation method | |
CN109446521B (en) | Named entity recognition method, named entity recognition device, electronic equipment and machine-readable storage medium | |
CN103377260B (en) | The analysis method and device of a kind of network log URL | |
CN106970912A (en) | Chinese sentence similarity calculating method, computing device and computer-readable storage medium | |
CN103823896A (en) | Subject characteristic value algorithm and subject characteristic value algorithm-based project evaluation expert recommendation algorithm | |
CN104598535A (en) | Event extraction method based on maximum entropy | |
CN104142912A (en) | Accurate corpus category marking method and device | |
CN103970666A (en) | Method for detecting repeated software defect reports | |
CN110019820B (en) | Method for detecting time consistency of complaints and symptoms of current medical history in medical records | |
CN110929520B (en) | Unnamed entity object extraction method and device, electronic equipment and storage medium | |
CN110741376A (en) | Automatic document analysis for different natural languages | |
CN107861944A (en) | A kind of text label extracting method and device based on Word2Vec | |
CN110728117A (en) | Paragraph automatic identification method and system based on machine learning and natural language processing | |
CN104281694A (en) | Analysis system of emotional tendency of text | |
CN110826330B (en) | Name recognition method and device, computer equipment and readable storage medium | |
CN109344390A (en) | A method of the card language Entity recognition based on multiple features neural network | |
CN109241521A (en) | A kind of high attention rate sentence extracting method of scientific and technical literature based on adduction relationship | |
CN105608136B (en) | A kind of semantic relevancy calculation method based on Chinese complex sentence |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20160323 |