CN104899262B

CN104899262B - A kind of information categorization method for supporting User Defined to sort out rule

Info

Publication number: CN104899262B
Application number: CN201510262625.XA
Authority: CN
Inventors: 叶俊民; 祝黄建; 叶竹君; 陈曙
Original assignee: Huazhong Normal University
Current assignee: Huazhong Normal University
Priority date: 2015-05-22
Filing date: 2015-05-22
Publication date: 2017-12-22
Anticipated expiration: 2035-05-22
Also published as: CN104899262A

Abstract

The invention belongs to database application field, more particularly to a kind of method of information categorization in database for supporting User Defined to sort out rule, based on the classification rule of user's customization, database is supported to carry out sentence retrieval, obtain the information content close or similar to retrieval sentence, or the information content with potential relation, the inventive method will help user to get more comprehensive information.

Description

A kind of information categorization method for supporting User Defined to sort out rule

Technical field

The invention belongs to database application field, and in particular in a kind of database for supporting User Defined to sort out rule The method of information categorization.

Background technology

Information categorization refers to for certain purpose, using certain classification principle and method as guidance, according to the information content, property The demand of matter and correlation, database information is categorizedly organized by certain structural system.

The operation principle of information categorization is：First, information is stored in database, while the key content of information is extracted, Foundation as classification；Second, sort out rule according to related needs to define；3rd, will be interior in database according to rule is sorted out Hold similar or similar information to put together.

The technology related to " information categorization " is information retrieval technique, and for database, retrieval is usually defeated according to user The search key that enters accurately search or fuzzy search, obtains the information to match with retrieval content, and by this information Return to user.

At present, precise search whether is carried out to database or carries out fuzzy search, use is all based on keyword Retrieval technique, such retrieval can not obtain, to retrieving the related information content that content is close, similar, also can not obtaining and examining The related information content of potential relation be present in rope content.

The content of the invention

The purpose of the present invention is exactly to overcome above-mentioned weak point of the prior art, there is provided one kind supports user to make by oneself Justice sorts out the information categorization method of rule, supports the database retrieval towards sentence, so as to realize to related or close or have The information categorization of potential relation.

The present invention is a kind of information categorization method for supporting User Defined to sort out rule, with the classification rule of user's customization Based on, support database to carry out sentence retrieval, obtain the information content close or similar to retrieval sentence, including following step Suddenly：

（1）Information categorization rule modeling, dependency rule during by for information categorization is described with a figure, every in figure Individual one key word information of node on behalf, including key words content and keyword weight, each edge in figure represent two keys Relation information between word, including relation content and relation weight, in concrete operations, with a triple, i.e., subject, predicate, Object information represents a line in figure, i.e., the relation between two nodes of subject and object is predicate, and user passes through customization Dependency rule when above-mentioned rule relation figure is to be customized for information categorization；

（2）Rule-based retrieval sentence participle, the rule relation figure customized by traverse user, is obtained in this rule All keywords, keyword set is formed, after user inputs retrieval sentence, the keyword of matching is found out in keyword set, Obtain word segmentation result；

（3）Rule-based search key extension, with by step（2）In the word segmentation result obtained after word segmentation processing Each keyword is acted upon respectively as kernel keyword, under the control of the search number of plies of user's customization, is obtained therewith Close or related keyword and associated weight, finally obtain expanded keyword collection.In addition, it is contemplated that in rule keyword it Between incidence relation be figure shape topographical form, therefore in order to improve Reasoning Efficiency, it is necessary to limit the extension number of plies of keyword, That is the search number of plies of user's customization；

（4）The keyword set obtained using extension, carry out precise search in database or fuzzy search obtains accordingly Content.According to rule relation figure, related to the kernel keyword handled or similar keyword can be expanded, so When recycling these keywords further to be retrieved, it is possible to obtain in related to this retrieval sentence or similar information Hold.Similarly, according to rule relation figure, the key with the kernel keyword handled with potential applications relation can be expanded Word, when further being retrieved using these keywords, so that it may obtain the letter that there is potential applications relation with this retrieval sentence Cease content.

The present invention is applied to all kinds of users for having information categorization demand, supports the information categorization rule that user's on-demand customization is related Then, such user can change dependency rule at any time or formulate new classification rule.The present invention key step be with Based on the classification rule of family customization, the difference of rule, the operation of retrieval participle and keyword expansion are on the one hand sorted out according to customization The Different Results that will be obtained are operated, this causes the effect of information categorization to be changed with the customization of rule, on the other hand, uses Family can sort out rule according to the effect constantly improve of information categorization.Information categorization, resulting classification knot are carried out using the present invention Fruit is except obtaining with addition to the result of initial retrieval sentence direct correlation, can also obtain or tool related or similar to initial retrieval sentence There is the result of potential relation, so user will be helped to get more comprehensive information.

Brief description of the drawings

Fig. 1 is the rule-based retrieval sentence segmentation methods flow chart of the present invention.

Fig. 2 is the rule-based keyword expansion algorithm flow chart of the present invention.

Embodiment

When the inventive method is implemented, dependency rule graph of a relation is constructed by step 1, and be deposited into database.Below Exemplified by realizing the application program of the inventive method under eclipse development environments with Java language on developing engine, specifically Bright technical solution of the present invention.

Step 1：The modeling of information categorization rule.

The appropriate regular modeling tool of selection, the rule described in graph form is established according to user's request.Information will be used for Dependency rule during classification is described with a figure, in each node on behalf one key word information, including keyword in figure Hold and keyword weight, each edge in figure represent the relation information between two keywords, including relation content and relation power Weight, in concrete operations, with a triple, i.e., subject, predicate, object information represent a line in figure, i.e. subject and guest Relation between two nodes of language is predicate, phase when user is by customizing above-mentioned rule relation figure to be customized for information categorization Close rule.

The present embodiment defines a web interface, and rule file is uploaded for user, by parsing the rule file, will The triplet information deposit database arrived, facilitates subsequent step to use.Obtained triplet information deposit database will be being parsed, Meanwhile by traveling through these triples, it can obtain a keyword set for being used for subsequent step.

Step 2：Rule-based retrieval sentence participle.

Be with traditional participle program difference, participle operation of the invention be based on the regular of user's customization, Therefore in different rules, the word segmentation result of same retrieval sentence may be different.

As shown in figure 1, rule-based retrieval sentence segmentation methods are as follows：

Step 1, the character string currently considered is set since subscript i, i=0；

Step 2, since i, if desirable string length is more than or equal to MaxLen, one length of interception is MaxLen character string CutWord, it is CutWord otherwise to intercept remaining substring, wherein, MaxLen is in regular keyword set The extreme length of keyword；

Step 3, judge whether CutWord is word in regular keyword set, if it is, CutWord is added to Word segmentation result collection, step 5 is jumped to, otherwise goes to step 4；

Step 4, if CutWord length is 0, step 5 is gone to, otherwise delete CutWord the last character Symbol, then goes to step 3；

Step 5, delete the part of matching, i values plus 1, if i has been above or equal to searching character string length, program Stop, returning to word segmentation result collection, otherwise go to step 2.

Correlated variables implication such as table 1 in above-mentioned rule-based retrieval sentence segmentation methods.

Variable in the rule-based retrieval sentence segmentation methods of table 1.

Variable name	Types of variables	Implication
			CutWord	String	The keyword intercepted out every time from retrieval sentence
i	int	The original position of interception keyword every time
			MaxLen	int	Length keywords threshold value, length keywords are respectively less than this value

Step 3：Rule-based search key extension.

This step from database by reading triplet information, and composition rule graph of a relation, then with each keyword Centered on, search out other associated or similar keywords, and by parsing obtain relation weight therebetween and All obtained keywords, are finally ranked up by the weight of other related keywords by comprehensive weight.

As shown in Fig. 2 rule-based search key expansion algorithm is as follows：

Step 1, if word segmentation result collection is sky, step 9 is jumped to, otherwise, is taken out a keyword, and delete, jump to step 2；

Step 2, empty and treat expanded keyword collection,Information addsWith spreading result collection, current search number of plies j=2, transposing step three are set；

Step 3, if j exceedes the search number of plies of customization, step 1 is jumped to, otherwise j adds 1, jumps to step Four；

Step 4, ifCollection be combined into sky, then jump to step 7, otherwise fromIn select a pass Keyword, and it is deleted, jump to step 5；

Step 5, withCentered on, searched in rule and obtain associated triplet information set, skip to Step 6；

Step 6, ifFor sky, then step 4 is jumped to, otherwise therefrom select a triplet information, And delete it.Pass through parsing, obtain withA related keyword, and by parse relation weight and Weight integrates obtained weight, willInformation, including comprehensive weightGather among one extension of deposit, skip to step 6；

Step 7, removeIn repeat element, ifStep 3 is then skipped to for sky, is otherwise therefrom selected One keyword, jump to step 8；

Step 8,Add, and judgeWhether it had been expanded, if do not had Have, thenAdd, jump to step 7；

Step 9, removeIn repeat element, after weight descending sort, returning result, program stopped.

Correlated variables definition such as table 2 in above-mentioned rule-based keyword expansion algorithm.

Variable in the rule-based keyword expansion algorithm of table 2.

Note：AtomWord in table 2 represents key word information, includes the content and weight of keyword.

Tripe in table 2 represents triplet information, i.e.,（Subject, predicate, object）.

After keyword expansion result is obtained, precise search or fuzzy is carried out in database using these keywords Retrieval, you can to obtain retrieval result, finally retrieval result sorts according to the associated weight of keyword.The present invention's In implementation, user can be regular with the related information categorization of on-demand customization, including newly-built rule and alteration ruler, and in retrieval When, user can directly retrieve a sentence, be not limited solely to retrieve single keyword, the present invention can be customized with user Classification rule based on, participle operation is carried out to retrieval sentence, extracted and the relevant search key of classification rule.For Segment obtained each keyword, the present invention can obtain phase by carrying out keyword expansion in the rule that is customized in user Pass or other similar keywords, by carrying out database retrieval to these keywords, obtain and user's initial retrieval content Content similar in correlation.Other keywords that with search key there are potential applications to associate in rule can also be similarly obtained, Therefore also obtained that there is the potential content contacted with user's initial retrieval content.

Claims

A kind of 1. information categorization method for supporting User Defined to sort out rule, it is characterised in that this method comprises the following steps：

（1）Information categorization rule modeling, dependency rule during by for information categorization are described with a figure, each section in figure Point represents a key word information, including key words content and keyword weight, each edge in figure represent two keywords it Between relation information, including relation content and relation weight, in concrete operations, with a triple, i.e. subject, predicate, object Information represents a line in figure, i.e., the relation between two nodes of subject and object is predicate, and user is above-mentioned by customizing Dependency rule when rule relation figure is to be customized for information categorization；

（2）Rule-based retrieval sentence participle, the rule relation figure customized by traverse user, obtain all in this rule Keyword, keyword set is formed, after user inputs retrieval sentence, the keyword of matching is found out in keyword set, is obtained Word segmentation result；

（3）Rule-based search key extension, with by step（2）It is each in the word segmentation result obtained after word segmentation processing Individual keyword is acted upon respectively as kernel keyword, under the control of the search number of plies of user's customization, is obtained close therewith Or related keyword and associated weight, finally obtain expanded keyword collection；

（4）The keyword set obtained using extension, carries out precise search in database or fuzzy search is obtained in corresponding Hold.
2. the information categorization method according to claim 1 for supporting User Defined to sort out rule, it is characterised in that：Step （1）Described in information categorization rule modeling process, including newly-built or modification information sorts out rule, i.e., user can be by new Build a figure or modified on the basis of original figure.
3. the information categorization method according to claim 1 for supporting User Defined to sort out rule, it is characterised in that step （2）Described in rule-based retrieval sentence participle process it is as follows：

The first step, the character string currently considered is set since subscript i, i=0；

Second step, since i, if desirable string length is more than or equal to MaxLen, one length of interception is MaxLen Character string CutWord, otherwise to intercept remaining substring be CutWord, wherein, MaxLen is keyword in regular keyword set Extreme length；

3rd step, judge whether CutWord is word in regular keyword set, if it is, CutWord is added into participle Result set, changes to the 5th step, otherwise goes to the 4th step；

4th step, if CutWord length is 0, the 5th step is gone to, otherwise deletes CutWord last character, Then go to the 3rd step；

5th step, delete the part of matching, i values plus 1, if i has been above or stopped equal to searching character string length, program Only, word segmentation result collection is returned, otherwise goes to second step.
4. the information categorization method according to claim 1 for supporting User Defined to sort out rule, it is characterised in that step （3）In rule-based search key expansion process it is as follows：

The first step, if word segmentation result collection is sky, the 9th step is gone to, otherwise, is taken out a keyword, and delete Remove, go to second step；

Second step, empty and treat expanded keyword collection,Information addsWith spreading result collection, current search number of plies j=2 are set, go to the 3rd step；

3rd step, if j exceedes the search number of plies of customization, the first step is gone to, otherwise j adds 1, goes to the 4th step；

4th step, ifCollection be combined into sky, then go to the 7th step, otherwise fromIn select a keyword, And it is deleted, go to the 5th step；

5th step, withCentered on, searched in rule and obtain associated triplet information set, go to the 6th Step；

6th step, ifFor sky, then the 4th step is gone to, otherwise therefrom select a triplet information, and delete It, passes through parsing, obtain withA related keyword, and by parse relation weight andWeight is comprehensive Close obtained weight, willInformation, including comprehensive weightGather among one extension of deposit, Go to the 7th step；

7th step, removesIn repeat element, ifThe 3rd step is then gone to for sky, otherwise therefrom selects a pass Keyword, go to the 8th step；

8th step,Add, and judgeWhether it had been expanded, if it is not, handleAdd, go to the 7th step；

9th step, removesIn repeat element, after weight descending sort, returning result, program stopped.