Summary of the invention
(1) technical problems to be solved
The technical problem to be solved by the present invention is how to solve at present since algorithm is single, can not be tied in conjunction with a variety of scannings
Fruit carries out the problem of accurate comprehensive analysis.
(2) technical solution
In order to solve the above technical problems, the present invention provides a kind of knowledge for promoting the accuracy of file keyword based on many algorithms
Other method, the recognition methods are implemented based on identifying system, and the identifying system includes: that original text input module, text are pre-
Processing module, the Chinese key extraction module based on disjunctive model, the Chinese key based on High Dimensional Clustering Analysis technology extract mould
Block, semantic-based Chinese key extraction module, the Chinese key extraction module based on model-naive Bayesian, algorithm power
Again than distribution module, keyword recognition result generation module;Specifically,
The recognition methods includes the following steps:
Step 1: the original text of pending keyword identification is inputted by the original text input module;
Step 2: text formatting being carried out to the original text that original text input module inputs by the Text Pretreatment module and is turned
Pretreatment is changed, the candidate word handled for subsequent recognizer is formed;
Step 3: by the Chinese key extraction module based on disjunctive model, disjunctive model is based on, to from text
The candidate word of preprocessing module, carries out key words extraction and crucial word string is extracted, and generates the calculated result based on disjunctive model,
Obtain keyword number of extracted information;
Step 4: by the Chinese key extraction module based on High Dimensional Clustering Analysis technology, it is based on High Dimensional Clustering Analysis technology, it is right
Candidate word from Text Pretreatment module, carries out key words extraction and crucial word string is extracted, and generates based on High Dimensional Clustering Analysis skill
The calculated result of art obtains keyword number of extracted information;
Step 5: by being set forth in semantic Chinese key extraction module, semantic-based Chinese text keyword extraction is calculated
Method carries out key words extraction and crucial word string is extracted, generate semantic-based to the candidate word from Text Pretreatment module
Calculated result obtains keyword number of extracted information;
Step 6: by the Chinese key extraction module based on model-naive Bayesian, being based on naive Bayesian mould
Type carries out key words extraction and crucial word string is extracted, generate and be based on simple shellfish to the candidate word from Text Pretreatment module
The calculated result of this model of leaf obtains keyword number of extracted information;
Step 7: by the algorithm weights than distribution module, configuring the above-mentioned calculated result based on disjunctive model, based on height
Tie up each comfortable final pass of calculated result, semantic-based calculated result, the calculated result of model-naive Bayesian of clustering technique
Weight ratio in keyword result operation generating process;
Step 8: by the keyword recognition result generation module, comparing the calculated result based on disjunctive model, be based on height
It ties up in the calculated result, semantic-based calculated result, the calculated result of model-naive Bayesian of clustering technique respectively to key
The hit-count of word, according to above-mentioned preconfigured weight ratio, COMPREHENSIVE CALCULATING obtains final keyword recognition result.
Wherein, which is characterized in that the Chinese key extraction module based on disjunctive model, using based on disjunctive model
Chinese key extraction algorithm, the identification of keyword is extracted as a classification, to candidate keywords each in text area
Divide keyword or non-key word.
Wherein, which is characterized in that the disjunctive model is respectively established to key words and crucial word string, in key
In the selection of word feature, each model established respectively chooses different features.
Wherein, which is characterized in that the Chinese key extraction module of the High Dimensional Clustering Analysis technology, by according to small dictionary
Fast word segmentation, secondary participle, High Dimensional Clustering Analysis and keyword select the extraction that four steps realize keyword.
Wherein, which is characterized in that the semantic-based Chinese key extraction module incorporates phrase semantic feature
During keyword extraction, constructs semantic similarity network and utilize degree Density Metric phrase semantic criticality between two parties.
Wherein, which is characterized in that the Chinese key extraction module based on model-naive Bayesian passes through first
Training process obtains the parameters in model-naive Bayesian, then takes it as a basis, and completes keyword in test process and mentions
It takes.
Wherein, which is characterized in that the algorithm weights than distribution module according to 2:3:4:3 ratio-dependent it is above-mentioned based on point
From the calculated result of model, the calculated result based on High Dimensional Clustering Analysis technology, semantic-based calculated result, model-naive Bayesian
Each comfortable final keyword results operation generating process of calculated result in weight ratio.
Wherein, which is characterized in that the weight ratio of the 2:3:4:3 is default configuration.
Wherein, which is characterized in that the weight ratio is voluntarily to configure according to concrete application scene.
Wherein, the format of the original text includes WORD format, PDF format.
(3) beneficial effect
Compared with prior art, the present invention uses the Chinese key extraction algorithm of disjunctive model, is based on High Dimensional Clustering Analysis
The Chinese key extraction algorithm of technology, semantic-based Chinese text keyword extraction algorithm are based on model-naive Bayesian
Chinese key extraction algorithm, comprehensive matching judgement, come promoted keyword extraction identification accuracy.
Each algorithm is compared to keyword hit-count, the weight ratio default of each algorithm configuration is calculated using 2:3:4:3
Recognition result, weight can voluntarily be configured according to concrete application scene, be carried out according to the weight ratio of each algorithm to hit-count
It calculates, and as final result.
By this way, in keyword retrieval technical field, by promoting the accuracy of file keyword based on many algorithms
Recognition methods.
Specific embodiment
To keep the purpose of the present invention, content and advantage clearer, with reference to the accompanying drawings and examples, to of the invention
Specific embodiment is described in further detail.
In order to solve the above technical problems, the present invention provides a kind of knowledge for promoting the accuracy of file keyword based on many algorithms
Other method, the recognition methods are implemented based on identifying system, and the identifying system includes: that original text input module, text are pre-
Processing module, the Chinese key extraction module based on disjunctive model, the Chinese key based on High Dimensional Clustering Analysis technology extract mould
Block, semantic-based Chinese key extraction module, the Chinese key extraction module based on model-naive Bayesian, algorithm power
Again than distribution module, keyword recognition result generation module;Specifically,
The recognition methods includes the following steps:
Step 1: the original text of pending keyword identification is inputted by the original text input module;
Step 2: text formatting being carried out to the original text that original text input module inputs by the Text Pretreatment module and is turned
Pretreatment is changed, the candidate word handled for subsequent recognizer is formed;
Step 3: by the Chinese key extraction module based on disjunctive model, disjunctive model is based on, to from text
The candidate word of preprocessing module, carries out key words extraction and crucial word string is extracted, and generates the calculated result based on disjunctive model,
Obtain keyword number of extracted information;
Step 4: by the Chinese key extraction module based on High Dimensional Clustering Analysis technology, it is based on High Dimensional Clustering Analysis technology, it is right
Candidate word from Text Pretreatment module, carries out key words extraction and crucial word string is extracted, and generates based on High Dimensional Clustering Analysis skill
The calculated result of art obtains keyword number of extracted information;
Step 5: by being set forth in semantic Chinese key extraction module, semantic-based Chinese text keyword extraction is calculated
Method carries out key words extraction and crucial word string is extracted, generate semantic-based to the candidate word from Text Pretreatment module
Calculated result obtains keyword number of extracted information;
Step 6: by the Chinese key extraction module based on model-naive Bayesian, being based on naive Bayesian mould
Type carries out key words extraction and crucial word string is extracted, generate and be based on simple shellfish to the candidate word from Text Pretreatment module
The calculated result of this model of leaf obtains keyword number of extracted information;
Step 7: by the algorithm weights than distribution module, configuring the above-mentioned calculated result based on disjunctive model, based on height
Tie up each comfortable final pass of calculated result, semantic-based calculated result, the calculated result of model-naive Bayesian of clustering technique
Weight ratio in keyword result operation generating process;
Step 8: by the keyword recognition result generation module, comparing the calculated result based on disjunctive model, be based on height
It ties up in the calculated result, semantic-based calculated result, the calculated result of model-naive Bayesian of clustering technique respectively to key
The hit-count of word, according to above-mentioned preconfigured weight ratio, COMPREHENSIVE CALCULATING obtains final keyword recognition result.
Wherein, which is characterized in that the Chinese key extraction module based on disjunctive model, using based on disjunctive model
Chinese key extraction algorithm, the identification of keyword is extracted as a classification, to candidate keywords each in text area
Divide keyword or non-key word.
Wherein, which is characterized in that the disjunctive model is respectively established to key words and crucial word string, in key
In the selection of word feature, each model established respectively chooses different features.
Wherein, which is characterized in that the Chinese key extraction module of the High Dimensional Clustering Analysis technology, by according to small dictionary
Fast word segmentation, secondary participle, High Dimensional Clustering Analysis and keyword select the extraction that four steps realize keyword.
Wherein, which is characterized in that the semantic-based Chinese key extraction module incorporates phrase semantic feature
During keyword extraction, constructs semantic similarity network and utilize degree Density Metric phrase semantic criticality between two parties.
Wherein, which is characterized in that the Chinese key extraction module based on model-naive Bayesian passes through first
Training process obtains the parameters in model-naive Bayesian, then takes it as a basis, and completes keyword in test process and mentions
It takes.
Wherein, which is characterized in that the algorithm weights than distribution module according to 2:3:4:3 ratio-dependent it is above-mentioned based on point
From the calculated result of model, the calculated result based on High Dimensional Clustering Analysis technology, semantic-based calculated result, model-naive Bayesian
Each comfortable final keyword results operation generating process of calculated result in weight ratio.
Wherein, which is characterized in that the weight ratio of the 2:3:4:3 is default configuration.
Wherein, which is characterized in that the weight ratio is voluntarily to configure according to concrete application scene.
Wherein, the format of the original text includes WORD format, PDF format.
In addition, the present invention also provides a kind of identifying system for promoting the accuracy of file keyword based on many algorithms, such as Fig. 1
Shown, the identifying system includes:
Original text input module is used to input the original text of pending keyword identification;
Text Pretreatment module is used to carry out the original text that original text input module inputs at the pre- place of text formatting conversion
Reason forms the candidate word handled for subsequent recognizer;
Chinese key extraction module based on disjunctive model is used for based on disjunctive model, to from Text Pretreatment
The candidate word of module, carries out key words extraction and crucial word string is extracted, and generates the calculated result based on disjunctive model, acquisition is closed
Keyword number of extracted information;
Chinese key extraction module based on High Dimensional Clustering Analysis technology is used for based on High Dimensional Clustering Analysis technology, to from text
The candidate word of this preprocessing module, carries out key words extraction and crucial word string is extracted, and generates based on High Dimensional Clustering Analysis technology
It calculates as a result, obtaining keyword number of extracted information;
Semantic-based Chinese key extraction module is used for semantic-based Chinese text keyword extraction (SKE)
Algorithm carries out key words extraction and crucial word string is extracted, generate and be based on semanteme to the candidate word from Text Pretreatment module
Calculated result, obtain keyword number of extracted information;
Chinese key extraction module based on model-naive Bayesian is used for based on model-naive Bayesian, to next
From the candidate word of Text Pretreatment module, carries out key words extraction and crucial word string is extracted, generate based on naive Bayesian mould
The calculated result of type obtains keyword number of extracted information;
Algorithm weights than distribution module, be used for concrete application scene configure the above-mentioned calculated result based on disjunctive model,
Calculated result, semantic-based calculated result, each leisure of the calculated result of model-naive Bayesian based on High Dimensional Clustering Analysis technology
Weight ratio in final keyword results operation generating process;
Keyword recognition result generation module is used to compare the calculated result based on disjunctive model, is based on High Dimensional Clustering Analysis
The calculated result of technology, semantic-based calculated result, in the calculated result of model-naive Bayesian respectively to the life of keyword
Middle number, according to above-mentioned preconfigured weight ratio, COMPREHENSIVE CALCULATING obtains final keyword recognition result.
Wherein, the Chinese key extraction module based on disjunctive model, it is crucial using the Chinese based on disjunctive model
Word extraction algorithm extracts the identification of keyword as a classification, distinguishes keyword also to candidate keywords each in text
It is non-keyword;
Wherein, disjunctive model is respectively established to key words and crucial word string, in the selection of keyword feature,
The each model established respectively chooses different features.
Key words are extracted and crucial word string extracts the accuracy for improving extraction according to different features.The algorithm is
Keyword identifies most common algorithm, and calculated result accounts for the 2/10 of result operation specific gravity.
Wherein, the Chinese key extraction module of the High Dimensional Clustering Analysis technology, to based on statistical information keyword extraction side
The low problem of method accuracy rate proposes the Chinese key extraction algorithm based on High Dimensional Clustering Analysis technology;By according to the fast of small dictionary
Speed participle, secondary participle, High Dimensional Clustering Analysis and keyword select the extraction that four steps realize keyword.
Theory analysis and experiment display, the Chinese key extracting method based on High Dimensional Clustering Analysis technology have better stabilization
Property, higher efficiency and more accurate result.The algorithm speed is very fast and recognition accuracy is very high, and calculated result accounts for result operation
The 3/10 of specific gravity.
Wherein, the semantic-based Chinese key extraction module, is mentioned using semantic-based Chinese text keyword
Take (SKE) algorithm;During phrase semantic feature is incorporated keyword extraction by it, constructs semantic similarity network and utilize
Density Metric phrase semantic criticality is spent between two parties.
Compared with the keyword extraction algorithm based on statistical nature, it is more excellent that SKE algorithm extracts key word algorithm performance.The calculation
The keyword discrimination accuracy of method is high, and calculated result accounts for the 4/10 of result operation specific gravity.
Wherein, the Chinese key extraction module based on model-naive Bayesian, using based on naive Bayesian mould
The Chinese key extraction algorithm of type;It obtains the parameters in model-naive Bayesian by training process first, then
It takes it as a basis, completes keyword extraction in test process.Experiment shows that relative to traditional method, the algorithm can be from small rule
More accurate keyword is extracted in the document sets of mould, and can neatly increase the characteristic item of characterization word importance, tool
There is better scalability.The keyword of the algorithm identifies that accuracy is very high in small document, and calculated result accounts for result operation ratio
The 3/10 of weight.
Wherein, the algorithm weights are more above-mentioned based on disjunctive model according to the ratio-dependent of 2:3:4:3 than distribution module
Calculate result, the calculated result based on High Dimensional Clustering Analysis technology, semantic-based calculated result, the calculated result of model-naive Bayesian
Weight ratio in each comfortable final keyword results operation generating process.
Wherein, the weight ratio of the 2:3:4:3 is default configuration.
Wherein, the weight ratio is voluntarily to configure according to concrete application scene.
Wherein, the format of the original text includes WORD format, PDF format.
Embodiment 1
The present embodiment provides a kind of methods for promoting the recognition accuracy of file keyword based on many algorithms, adopt to file
With the Chinese key extraction algorithm of use disjunctive model, the Chinese key extraction algorithm based on High Dimensional Clustering Analysis technology, it is based on
Semantic Chinese text keyword extraction (SKE) algorithm, the Chinese key extraction algorithm based on model-naive Bayesian carry out
Keyword processing parsing simultaneously judges to promote accuracy by weight.
Wherein, the Chinese key extraction algorithm based on disjunctive model extracts and crucial word string key words
It extracts, according to the Chinese key extraction algorithm based on disjunctive model, key words is extracted and crucial word string extracts the two
Problem devises different features to improve the accuracy of extraction.
Wherein, the Chinese key extraction algorithm based on High Dimensional Clustering Analysis technology, to based on statistical information keyword
The low problem of extracting method accuracy rate proposes the Chinese key extraction algorithm based on High Dimensional Clustering Analysis technology.Algorithm passes through foundation
The fast word segmentation of small dictionary, secondary participle, High Dimensional Clustering Analysis and keyword select the extraction that four steps realize keyword.Theory point
Analysis and experiment display, the Chinese key extracting method based on High Dimensional Clustering Analysis technology have better stability, higher efficiency
And more accurate result.
Wherein, phrase semantic feature is incorporated and is closed by semantic-based Chinese text keyword extraction (SKE) algorithm
In keyword extraction process, constructs semantic similarity network and utilize degree Density Metric phrase semantic criticality between two parties.With base
It is compared in the keyword extraction algorithm of statistical nature, it is more excellent that SKE algorithm extracts key word algorithm performance.
Wherein, the Chinese key extraction algorithm based on model-naive Bayesian, the algorithm pass through training first
Process obtains the parameters in model-naive Bayesian, then takes it as a basis, and completes keyword extraction in test process.It is real
It tests and shows that, relative to traditional if*idf method, which can extract more accurate key from small-scale document sets
Word, and can neatly increase the characteristic item of characterization word importance, there is better scalability.
Keyword is extracted by each algorithm, the keyword quantity to be accurately obtained in file/folder mentions
It wins the confidence breath.Each algorithm is compared to keyword hit-count, the weight ratio default of each algorithm configuration is calculated using 2:3:4:3 to be known
Not as a result, weight can voluntarily be configured according to concrete application scene, hit-count is counted according to the weight ratio of each algorithm
It calculates, and as final result.
The above is only a preferred embodiment of the present invention, it is noted that for the ordinary skill people of the art
For member, without departing from the technical principles of the invention, several improvement and deformations can also be made, these improvement and deformations
Also it should be regarded as protection scope of the present invention.