CN109255014A

CN109255014A - The recognition methods of file keyword accuracy is promoted based on many algorithms

Info

Publication number: CN109255014A
Application number: CN201811210049.4A
Authority: CN
Inventors: 张永静; 张彤; 郝佳; 高晓琼; 李世成; 郑春; 郑春一; 李景田; 司敬; 徐海; 左晓辉
Original assignee: Beijing Jinghang Computing Communication Research Institute
Current assignee: Beijing Jinghang Computing Communication Research Institute
Priority date: 2018-10-17
Filing date: 2018-10-17
Publication date: 2019-01-22

Abstract

The invention belongs to keyword retrieval technical fields, and in particular to a kind of recognition methods that the accuracy of file keyword is promoted based on many algorithms.By comparing each algorithm to keyword hit-count, the weight ratio of each algorithm configuration can be configured voluntarily or using default configuration, be calculated according to the weight ratio of each algorithm hit-count, and as final result.Algorithm includes that the Chinese key extraction algorithm, the Chinese key extraction algorithm based on High Dimensional Clustering Analysis technology, algorithm of disjunctive model is used to accurately identify the method for extraction, semantic-based Chinese text keyword extraction algorithm, the Chinese key extraction algorithm based on model-naive Bayesian to file and file keyword.By this way, in keyword retrieval technical field, by the recognition methods for promoting the accuracy of file keyword based on many algorithms.

Description

The recognition methods of file keyword accuracy is promoted based on many algorithms

Technical field

The invention belongs to keyword retrieval technical fields, and in particular to one kind promotes file keyword standard based on many algorithms The recognition methods of exactness.

Background technique

In natural language processing field, the text file of magnanimity is handled it is crucial that the most concerned problem of user is mentioned It takes out.Regardless of being that can often spy upon the theme of entire text by several keywords for long text or short text Thought.At the same time, searched for whether based on the recommendation of text or text based, for text key word dependence also very Greatly, the order of accuarcy of keyword extraction is directly related to the final effect of recommender system or search system.Therefore, keyword mentions Taking in text mining field is a critically important part.

Keyword identification retrieval is based on Unified Policy, using deep content analysis, to static data, dynamic data and Data in use carry out the relevant technologies of instant identification, monitoring, protection.

Most of scheme mainly uses disjunctive model algorithm at present, extracts to key words and crucial word string is extracted.It is existing There is technical solution due to single using algorithm, and various algorithms have respective advantage and characteristic, are calculated using single algorithm crucial Word can not evade the drawbacks of algorithm itself.Therefore, the keyword identification technology accuracy used on the market at present has to be hoisted.

Summary of the invention

(1) technical problems to be solved

The technical problem to be solved by the present invention is how to solve at present since algorithm is single, can not be tied in conjunction with a variety of scannings Fruit carries out the problem of accurate comprehensive analysis.

(2) technical solution

In order to solve the above technical problems, the present invention provides a kind of knowledge for promoting the accuracy of file keyword based on many algorithms Other method, the recognition methods are implemented based on identifying system, and the identifying system includes: that original text input module, text are pre- Processing module, the Chinese key extraction module based on disjunctive model, the Chinese key based on High Dimensional Clustering Analysis technology extract mould Block, semantic-based Chinese key extraction module, the Chinese key extraction module based on model-naive Bayesian, algorithm power Again than distribution module, keyword recognition result generation module；Specifically,

The recognition methods includes the following steps:

Step 1: the original text of pending keyword identification is inputted by the original text input module；

Step 2: text formatting being carried out to the original text that original text input module inputs by the Text Pretreatment module and is turned Pretreatment is changed, the candidate word handled for subsequent recognizer is formed；

Step 3: by the Chinese key extraction module based on disjunctive model, disjunctive model is based on, to from text The candidate word of preprocessing module, carries out key words extraction and crucial word string is extracted, and generates the calculated result based on disjunctive model, Obtain keyword number of extracted information；

Step 4: by the Chinese key extraction module based on High Dimensional Clustering Analysis technology, it is based on High Dimensional Clustering Analysis technology, it is right Candidate word from Text Pretreatment module, carries out key words extraction and crucial word string is extracted, and generates based on High Dimensional Clustering Analysis skill The calculated result of art obtains keyword number of extracted information；

Step 5: by being set forth in semantic Chinese key extraction module, semantic-based Chinese text keyword extraction is calculated Method carries out key words extraction and crucial word string is extracted, generate semantic-based to the candidate word from Text Pretreatment module Calculated result obtains keyword number of extracted information；

Step 6: by the Chinese key extraction module based on model-naive Bayesian, being based on naive Bayesian mould Type carries out key words extraction and crucial word string is extracted, generate and be based on simple shellfish to the candidate word from Text Pretreatment module The calculated result of this model of leaf obtains keyword number of extracted information；

Step 7: by the algorithm weights than distribution module, configuring the above-mentioned calculated result based on disjunctive model, based on height Tie up each comfortable final pass of calculated result, semantic-based calculated result, the calculated result of model-naive Bayesian of clustering technique Weight ratio in keyword result operation generating process；

Step 8: by the keyword recognition result generation module, comparing the calculated result based on disjunctive model, be based on height It ties up in the calculated result, semantic-based calculated result, the calculated result of model-naive Bayesian of clustering technique respectively to key The hit-count of word, according to above-mentioned preconfigured weight ratio, COMPREHENSIVE CALCULATING obtains final keyword recognition result.

Wherein, which is characterized in that the Chinese key extraction module based on disjunctive model, using based on disjunctive model Chinese key extraction algorithm, the identification of keyword is extracted as a classification, to candidate keywords each in text area Divide keyword or non-key word.

Wherein, which is characterized in that the disjunctive model is respectively established to key words and crucial word string, in key In the selection of word feature, each model established respectively chooses different features.

Wherein, which is characterized in that the Chinese key extraction module of the High Dimensional Clustering Analysis technology, by according to small dictionary Fast word segmentation, secondary participle, High Dimensional Clustering Analysis and keyword select the extraction that four steps realize keyword.

Wherein, which is characterized in that the semantic-based Chinese key extraction module incorporates phrase semantic feature During keyword extraction, constructs semantic similarity network and utilize degree Density Metric phrase semantic criticality between two parties.

Wherein, which is characterized in that the Chinese key extraction module based on model-naive Bayesian passes through first Training process obtains the parameters in model-naive Bayesian, then takes it as a basis, and completes keyword in test process and mentions It takes.

Wherein, which is characterized in that the algorithm weights than distribution module according to 2:3:4:3 ratio-dependent it is above-mentioned based on point From the calculated result of model, the calculated result based on High Dimensional Clustering Analysis technology, semantic-based calculated result, model-naive Bayesian Each comfortable final keyword results operation generating process of calculated result in weight ratio.

Wherein, which is characterized in that the weight ratio of the 2:3:4:3 is default configuration.

Wherein, which is characterized in that the weight ratio is voluntarily to configure according to concrete application scene.

Wherein, the format of the original text includes WORD format, PDF format.

(3) beneficial effect

Compared with prior art, the present invention uses the Chinese key extraction algorithm of disjunctive model, is based on High Dimensional Clustering Analysis The Chinese key extraction algorithm of technology, semantic-based Chinese text keyword extraction algorithm are based on model-naive Bayesian Chinese key extraction algorithm, comprehensive matching judgement, come promoted keyword extraction identification accuracy.

Each algorithm is compared to keyword hit-count, the weight ratio default of each algorithm configuration is calculated using 2:3:4:3 Recognition result, weight can voluntarily be configured according to concrete application scene, be carried out according to the weight ratio of each algorithm to hit-count It calculates, and as final result.

By this way, in keyword retrieval technical field, by promoting the accuracy of file keyword based on many algorithms Recognition methods.

Detailed description of the invention

Fig. 1 is the schematic diagram of technical solution of the present invention.

Specific embodiment

To keep the purpose of the present invention, content and advantage clearer, with reference to the accompanying drawings and examples, to of the invention Specific embodiment is described in further detail.

The recognition methods includes the following steps:

Wherein, the format of the original text includes WORD format, PDF format.

In addition, the present invention also provides a kind of identifying system for promoting the accuracy of file keyword based on many algorithms, such as Fig. 1 Shown, the identifying system includes:

Original text input module is used to input the original text of pending keyword identification；

Text Pretreatment module is used to carry out the original text that original text input module inputs at the pre- place of text formatting conversion Reason forms the candidate word handled for subsequent recognizer；

Chinese key extraction module based on disjunctive model is used for based on disjunctive model, to from Text Pretreatment The candidate word of module, carries out key words extraction and crucial word string is extracted, and generates the calculated result based on disjunctive model, acquisition is closed Keyword number of extracted information；

Chinese key extraction module based on High Dimensional Clustering Analysis technology is used for based on High Dimensional Clustering Analysis technology, to from text The candidate word of this preprocessing module, carries out key words extraction and crucial word string is extracted, and generates based on High Dimensional Clustering Analysis technology It calculates as a result, obtaining keyword number of extracted information；

Semantic-based Chinese key extraction module is used for semantic-based Chinese text keyword extraction (SKE) Algorithm carries out key words extraction and crucial word string is extracted, generate and be based on semanteme to the candidate word from Text Pretreatment module Calculated result, obtain keyword number of extracted information；

Chinese key extraction module based on model-naive Bayesian is used for based on model-naive Bayesian, to next From the candidate word of Text Pretreatment module, carries out key words extraction and crucial word string is extracted, generate based on naive Bayesian mould The calculated result of type obtains keyword number of extracted information；

Algorithm weights than distribution module, be used for concrete application scene configure the above-mentioned calculated result based on disjunctive model, Calculated result, semantic-based calculated result, each leisure of the calculated result of model-naive Bayesian based on High Dimensional Clustering Analysis technology Weight ratio in final keyword results operation generating process；

Keyword recognition result generation module is used to compare the calculated result based on disjunctive model, is based on High Dimensional Clustering Analysis The calculated result of technology, semantic-based calculated result, in the calculated result of model-naive Bayesian respectively to the life of keyword Middle number, according to above-mentioned preconfigured weight ratio, COMPREHENSIVE CALCULATING obtains final keyword recognition result.

Wherein, the Chinese key extraction module based on disjunctive model, it is crucial using the Chinese based on disjunctive model Word extraction algorithm extracts the identification of keyword as a classification, distinguishes keyword also to candidate keywords each in text It is non-keyword；

Wherein, disjunctive model is respectively established to key words and crucial word string, in the selection of keyword feature, The each model established respectively chooses different features.

Key words are extracted and crucial word string extracts the accuracy for improving extraction according to different features.The algorithm is Keyword identifies most common algorithm, and calculated result accounts for the 2/10 of result operation specific gravity.

Wherein, the Chinese key extraction module of the High Dimensional Clustering Analysis technology, to based on statistical information keyword extraction side The low problem of method accuracy rate proposes the Chinese key extraction algorithm based on High Dimensional Clustering Analysis technology；By according to the fast of small dictionary Speed participle, secondary participle, High Dimensional Clustering Analysis and keyword select the extraction that four steps realize keyword.

Theory analysis and experiment display, the Chinese key extracting method based on High Dimensional Clustering Analysis technology have better stabilization Property, higher efficiency and more accurate result.The algorithm speed is very fast and recognition accuracy is very high, and calculated result accounts for result operation The 3/10 of specific gravity.

Wherein, the semantic-based Chinese key extraction module, is mentioned using semantic-based Chinese text keyword Take (SKE) algorithm；During phrase semantic feature is incorporated keyword extraction by it, constructs semantic similarity network and utilize Density Metric phrase semantic criticality is spent between two parties.

Compared with the keyword extraction algorithm based on statistical nature, it is more excellent that SKE algorithm extracts key word algorithm performance.The calculation The keyword discrimination accuracy of method is high, and calculated result accounts for the 4/10 of result operation specific gravity.

Wherein, the Chinese key extraction module based on model-naive Bayesian, using based on naive Bayesian mould The Chinese key extraction algorithm of type；It obtains the parameters in model-naive Bayesian by training process first, then It takes it as a basis, completes keyword extraction in test process.Experiment shows that relative to traditional method, the algorithm can be from small rule More accurate keyword is extracted in the document sets of mould, and can neatly increase the characteristic item of characterization word importance, tool There is better scalability.The keyword of the algorithm identifies that accuracy is very high in small document, and calculated result accounts for result operation ratio The 3/10 of weight.

Wherein, the algorithm weights are more above-mentioned based on disjunctive model according to the ratio-dependent of 2:3:4:3 than distribution module Calculate result, the calculated result based on High Dimensional Clustering Analysis technology, semantic-based calculated result, the calculated result of model-naive Bayesian Weight ratio in each comfortable final keyword results operation generating process.

Wherein, the weight ratio of the 2:3:4:3 is default configuration.

Wherein, the weight ratio is voluntarily to configure according to concrete application scene.

Wherein, the format of the original text includes WORD format, PDF format.

Embodiment 1

The present embodiment provides a kind of methods for promoting the recognition accuracy of file keyword based on many algorithms, adopt to file With the Chinese key extraction algorithm of use disjunctive model, the Chinese key extraction algorithm based on High Dimensional Clustering Analysis technology, it is based on Semantic Chinese text keyword extraction (SKE) algorithm, the Chinese key extraction algorithm based on model-naive Bayesian carry out Keyword processing parsing simultaneously judges to promote accuracy by weight.

Wherein, the Chinese key extraction algorithm based on disjunctive model extracts and crucial word string key words It extracts, according to the Chinese key extraction algorithm based on disjunctive model, key words is extracted and crucial word string extracts the two Problem devises different features to improve the accuracy of extraction.

Wherein, the Chinese key extraction algorithm based on High Dimensional Clustering Analysis technology, to based on statistical information keyword The low problem of extracting method accuracy rate proposes the Chinese key extraction algorithm based on High Dimensional Clustering Analysis technology.Algorithm passes through foundation The fast word segmentation of small dictionary, secondary participle, High Dimensional Clustering Analysis and keyword select the extraction that four steps realize keyword.Theory point Analysis and experiment display, the Chinese key extracting method based on High Dimensional Clustering Analysis technology have better stability, higher efficiency And more accurate result.

Wherein, phrase semantic feature is incorporated and is closed by semantic-based Chinese text keyword extraction (SKE) algorithm In keyword extraction process, constructs semantic similarity network and utilize degree Density Metric phrase semantic criticality between two parties.With base It is compared in the keyword extraction algorithm of statistical nature, it is more excellent that SKE algorithm extracts key word algorithm performance.

Wherein, the Chinese key extraction algorithm based on model-naive Bayesian, the algorithm pass through training first Process obtains the parameters in model-naive Bayesian, then takes it as a basis, and completes keyword extraction in test process.It is real It tests and shows that, relative to traditional if*idf method, which can extract more accurate key from small-scale document sets Word, and can neatly increase the characteristic item of characterization word importance, there is better scalability.

Keyword is extracted by each algorithm, the keyword quantity to be accurately obtained in file/folder mentions It wins the confidence breath.Each algorithm is compared to keyword hit-count, the weight ratio default of each algorithm configuration is calculated using 2:3:4:3 to be known Not as a result, weight can voluntarily be configured according to concrete application scene, hit-count is counted according to the weight ratio of each algorithm It calculates, and as final result.

The above is only a preferred embodiment of the present invention, it is noted that for the ordinary skill people of the art For member, without departing from the technical principles of the invention, several improvement and deformations can also be made, these improvement and deformations Also it should be regarded as protection scope of the present invention.

Claims

1. a kind of recognition methods for promoting the accuracy of file keyword based on many algorithms, which is characterized in that the recognition methods Implemented based on identifying system, the identifying system includes: original text input module, Text Pretreatment module, based on splitting die The Chinese key extraction module of type, the Chinese key extraction module based on High Dimensional Clustering Analysis technology, semantic-based Chinese close Keyword extraction module, the Chinese key extraction module based on model-naive Bayesian, algorithm weights are than distribution module, keyword Recognition result generation module；Specifically,

The recognition methods includes the following steps:

Step 2: it is pre- that text formatting conversion being carried out to the original text that original text input module inputs by the Text Pretreatment module Processing forms the candidate word handled for subsequent recognizer；

Step 3: by the Chinese key extraction module based on disjunctive model, being based on disjunctive model, locate in advance to from text The candidate word of module is managed, key words extraction is carried out and crucial word string is extracted, generate the calculated result based on disjunctive model, acquisition Keyword number of extracted information；

Step 4: by the Chinese key extraction module based on High Dimensional Clustering Analysis technology, High Dimensional Clustering Analysis technology is based on, to coming from The candidate word of Text Pretreatment module, carries out key words extraction and crucial word string is extracted, and generates based on High Dimensional Clustering Analysis technology Calculated result obtains keyword number of extracted information；

Step 5: the Chinese key extraction module by being set forth in semanteme, semantic-based Chinese text keyword extraction algorithm, To the candidate word from Text Pretreatment module, carries out key words extraction and crucial word string is extracted, generate semantic-based meter It calculates as a result, obtaining keyword number of extracted information；

Step 6: by the Chinese key extraction module based on model-naive Bayesian, it is based on model-naive Bayesian, it is right Candidate word from Text Pretreatment module, carries out key words extraction and crucial word string is extracted, and generates based on naive Bayesian The calculated result of model obtains keyword number of extracted information；

Step 7: by the algorithm weights than distribution module, configuring the above-mentioned calculated result based on disjunctive model, gathered based on higher-dimension Each final keyword of leisure of the calculated result of class technology, semantic-based calculated result, the calculated result of model-naive Bayesian As a result the weight ratio in operation generating process；

Step 8: by the keyword recognition result generation module, comparing the calculated result based on disjunctive model, gathered based on higher-dimension The calculated result of class technology, semantic-based calculated result, in the calculated result of model-naive Bayesian respectively to keyword Hit-count, according to above-mentioned preconfigured weight ratio, COMPREHENSIVE CALCULATING obtains final keyword recognition result.

2. the recognition methods of file keyword accuracy is promoted based on many algorithms as described in claim 1, which is characterized in that The Chinese key extraction module based on disjunctive model, using the Chinese key extraction algorithm based on disjunctive model, The identification of keyword is extracted as a classification, distinguishes keyword or non-key word to candidate keywords each in text.

3. the recognition methods of file keyword accuracy is promoted based on many algorithms as claimed in claim 2, which is characterized in that The disjunctive model is respectively established to key words and crucial word string, in the selection of keyword feature, is established respectively Each model choose different features.

4. the recognition methods of file keyword accuracy is promoted based on many algorithms as described in claim 1, which is characterized in that The Chinese key extraction module of the High Dimensional Clustering Analysis technology, it is poly- by fast word segmentation, secondary participle, the higher-dimension according to small dictionary Class and keyword select the extraction that four steps realize keyword.

5. the recognition methods of file keyword accuracy is promoted based on many algorithms as described in claim 1, which is characterized in that The semantic-based Chinese key extraction module during phrase semantic feature is incorporated keyword extraction, constructs word Language semantic similarity network simultaneously utilizes degree Density Metric phrase semantic criticality between two parties.

6. the recognition methods of file keyword accuracy is promoted based on many algorithms as described in claim 1, which is characterized in that The Chinese key extraction module based on model-naive Bayesian obtains naive Bayesian mould by training process first Parameters in type, then take it as a basis, and complete keyword extraction in test process.

7. the recognition methods of file keyword accuracy is promoted based on many algorithms as described in claim 1, which is characterized in that The algorithm weights are than distribution module according to the above-mentioned calculated result based on disjunctive model of ratio-dependent of 2:3:4:3, based on high Tie up each comfortable final pass of calculated result, semantic-based calculated result, the calculated result of model-naive Bayesian of clustering technique Weight ratio in keyword result operation generating process.

8. the recognition methods of file keyword accuracy is promoted based on many algorithms as claimed in claim 7, which is characterized in that The weight ratio of the 2:3:4:3 is default configuration.

9. the recognition methods of file keyword accuracy is promoted based on many algorithms as described in claim 1, which is characterized in that The weight ratio is voluntarily to configure according to concrete application scene.

10. promoting the recognition methods of file keyword accuracy based on many algorithms as described in claim 1, feature exists In the format of the original text includes WORD format, PDF format.