CN103034726B

CN103034726B - Text filtering system and method

Info

Publication number: CN103034726B
Application number: CN201210553556.4A
Authority: CN
Inventors: 闫俊英
Original assignee: Shanghai Dianji University
Current assignee: Cloud open source data technology (Shanghai) Co., Ltd.
Priority date: 2012-12-18
Filing date: 2012-12-18
Publication date: 2016-05-25
Anticipated expiration: 2032-12-18
Also published as: CN103034726A

Abstract

The invention discloses a kind of text filtering system and method, the method comprises the steps: to set up filtering model according to user's filtration needs; By to the training of one group of filtration sample, form the ontology library of the filtration needs that approaches user; And extract the Feature Words of text to be filtered, then the entity in recognition feature word, and carry out the extraction of entity relationship, form the entity relationship of text to be filtered to vector, calculate the similarity of filtering model and text to be filtered, the text higher than similarity threshold is filtered, the present invention is according to the user's who sets up filtering model, by entity relation extraction, accurately express the feature of the text filtering, can improve the accuracy of filtration.

Description

Text filtering system and method

Technical field

The present invention, about a kind of text filtering system and method, particularly relates to a kind of based on entity relation extractionText filtering system and method.

Background technology

Text filtering receives more concern for many years always, has better in the field such as information retrieval and filtrationApplication prospect. In current text filtering method, the fuzzy clustering method of some employings based on genetic algorithm,Each individuality in population is carried out to fuzzy similarity matrix direct clustering, then adopt institute according to the result of clusterThe fitness function proposing is assessed the fitness of population, but the precision that this method is filtered depends on poly-The effect of class, can not well express for user's filtration needs. Some adopts improved classification to calculateMethod is filtered bad text message, improves traditional KNN algorithm from the angle of data Layer, equally to usingThe demand at family is expressed accurate not. Some filter method also adopts the filtration needs of expressing user of body,But accurate not for the method for building up of ontology library of expressing user filtering demand, this will affect text greatlyFiltering accuracy. Some filter algorithm has adopted the text filtering of adaptive learning, although can be to user'sFiltering profile carries out adaptive study, can adjust filtering model, still adopts the mode of characteristic vector notEnergy accurate expression user's filtration needs.

Summary of the invention

For overcoming above-mentioned the deficiencies in the prior art, the present invention's object be to provide a kind of text filtering system andMethod, it by entity relation extraction, accurately expresses the literary composition filtering according to the user's who sets up filtering modelThis feature, can improve the accuracy of filtration.

For reaching above-mentioned and other object, the present invention proposes a kind of text filtering system, at least comprises:

Module is set up in filtering model, sets up filtering model for the filtration needs according to user;

Adaptive learning module, by the training of one group of filtration sample, forms the filtration need that approach userThe ontology library of asking; And

Text filtering module, extracts the Feature Words of text to be filtered, the then entity in recognition feature word, andCarry out the extraction of entity relationship, form the entity relationship of text to be filtered to vector, calculate filtering model and treatFilter the similarity of text, the text higher than similarity threshold is filtered.

Further, this filtering model is set up module first according to user's filtration needs, clearly will buildThe field that body covers and scope are determined field and the scope of body, then at the related field model of bodyIn enclosing, carry out the Collection and analysis of information, specify the relation between Key Concepts and concept, and use accurate artLanguage is expressed, and finally sets up body frame.

Further, this body takes triple Topic (C, P, S) to represent, wherein C represents by filteringNoun conceptual abstraction in field out, has the set of the concept class of same alike result and behavior structure, adoptsVector space model represents; P describes the attribute of concept and relation; Structural relation between S representation class.

Further, this adaptive learning module is instructed this group filtration sample with increment type alternative mannerPractice.

Further, text filtration module also comprises:

Pretreatment module, removes the pretreatment operation such as stop words to text to be filtered;

Feature Words extracts module, will extract the feature of expressing content of text through pretreated text to be filteredVector;

Entity relation extraction module, first according to the characteristic vector of the extracted page, identification entity, and baseIn heuristic rule, obtain the contextual feature of entity, then build the characteristic vector of contextual feature word,Adopt application characteristic frequency function to quantize to characteristic item, adopt the associating clustering algorithm of k-means, comeRealize the right cluster of entity, finally the right relation of entity is marked; And

Similarity is calculated module, calculates the similarity of text to be filtered and filtering model, to higher than similarity thresholdThe text of value filters.

Further, this similarity is calculated module according to vector space model, more than two characteristic vector anglesTheir similarity of string value representation, calculates the similarity of text to be filtered and filtering model, according to what setThreshold value, falls the text filtering that exceedes threshold value.

For reaching above-mentioned and other object, the present invention also provides a kind of text filtering method, comprises the steps:

Step 1, sets up filtering model according to user's filtration needs;

Step 2, by the training of one group of filtration sample, forms the body of the filtration needs that approaches userStorehouse; And

Step 3, extracts the Feature Words of text to be filtered, the then entity in recognition feature word, and carry out realityThe extraction of body relation, forms the entity relationship of text to be filtered to vector, calculates filtering model and literary composition to be filteredThis similarity, filters the text higher than similarity threshold.

8, a kind of text filtering method as claimed in claim 7, is characterized in that, step 3 comprise asLower step:

Text to be filtered is removed to the pretreatment operation such as stop words;

The characteristic vector of expressing content of text will be extracted through pretreated text to be filtered;

Carry out the extraction of entity relationship, form the entity relationship of text to be filtered to vector; And

Calculate the similarity of text to be filtered and filtering model, the text higher than similarity threshold was carried outFilter.

Further, the extraction step of this entity relationship also comprises the steps:

First according to the characteristic vector of the extracted page, identify entity;

Based on heuristic rule, obtain the contextual feature of entity;

Build the characteristic vector of contextual feature word, adopt application characteristic frequency function to carry out numerical value to characteristic itemChange;

Adopt the associating clustering algorithm of k-means, realize the right cluster of entity; And

The relation right to entity marks. The entity that text to be filtered so just adopts the relation that marked to andThe vector of relation represents.

Further, step 1 also comprises the steps:

According to user's filtration needs, the field that the body that clearly will build covers and scope are determined bodyField and scope;

In the related territory of body, carry out the Collection and analysis of information, specify Key Concepts and conceptBetween relation, and express with accurate term; And

Set up body frame.

Compared with prior art, a kind of text filtering system and method for the present invention is by adopting body to set upFilter model, simultaneously at filtration stage, adopts entity relation extraction method to enter the Feature Words of text to be filteredThe mark of row entity relationship, therefore can express text to be filtered more exactly, then treats by calculatingText and the similarity of filtering model of filter, fall the text filtering higher than threshold value, and the present invention is due to can be smartReally express user's filtration needs, thereby have higher filtering accuracy.

Brief description of the drawings

Fig. 1 is the system architecture diagram of a kind of text filtering system of the present invention;

Fig. 2 is the flow chart of steps of a kind of text filtering method of the present invention;

Fig. 3 is the details flow chart of step 203 in the preferred embodiment of the present invention of Fig. 2.

Detailed description of the invention

Below by specific instantiation accompanying drawings embodiments of the present invention, art technologyPersonnel can understand other advantage of the present invention and effect easily by content disclosed in the present specification. The present inventionAlso can be implemented or be applied by other different instantiation, the every details in this description also can baseIn different viewpoints and application, under spirit of the present invention, carry out various modifications and change not deviating from.

Fig. 1 is a kind of text filtering system architecture diagram of the present invention. As shown in Figure 1, a kind of text mistake of the present inventionFilter system, at least comprises: module 10, adaptive learning module 11 and text filtering mould are set up in filtering modelGroup 12.

Wherein filtering model is set up module 10 and is set up filtering model for the filtration needs according to user. FilterModel is set up module 10 first according to user's filtration needs, the field that the body that clearly will build covers andScope is determined field and the scope of body, then in the related territory of body, carries out the collection of informationAnd analysis, specify the relation between Key Concepts and concept, and express with accurate term, finally buildVertical body frame. In preferred embodiment of the present invention, body takes triple Topic (C, P, S) to represent,Wherein: C represents by the noun conceptual abstraction in filtration art out to have the general of same alike result and behavior structureRead the set of class; P describes the attribute of concept and relation; Structural relation between S representation class, as parent, sonClass etc. C adopts vector space model (VSM) to represent, uses two tuple C_i(Key_i，Weight_i)，Wherein Key_iRepresent keyword, Weight_iRepresent the weight of keyword.

Adaptive learning module 11, by the training of one group of filtration sample, forms the filtration that approaches userThe ontology library of demand. In preferred embodiment of the present invention, adaptive learning module 11 increment type alternative mannerTo the training of one group of filtration sample, the document that setting fixed value m is filtered as the new needs of observation goes outThe window size of existing quantity, arranges flexibly according to the parameter n of evaluation metrics, and establishes and train iterations to be5. In increment iterative training process, need to determine each characteristic item number increasing, to avoid generation moreNoise. According to the validity feature value increasing, choose in the existing ontology library of being increased to of some, richRich user's filtration needs model. Therefore along with continuous study, ontology library is more and more close to user's mistakeFilter demand, the necessary feature of ontology library also reduces gradually.

Text filtering module 12 extracts the Feature Words of text to be filtered, the then entity in recognition feature word,And carry out the extraction of entity relationship, form the entity relationship of text to be filtered to vector, calculate filtering model withThe similarity of text to be filtered, filters the text lower than similarity threshold. Specifically, text mistakeFilter module 12 further comprises: pretreatment module 120, Feature Words extracts module 121, entity relation extractionModule 122 and similarity are calculated module 123. Wherein pretreatment module 120 is removed text to be filteredThe pretreatment operation such as stop words; Feature Words extracts module 121 and will extract through pretreated text to be filteredExpress the characteristic vector of content of text; Entity relation extraction module 122 is first according to the spy of the extracted pageLevy vector, identification entity, and based on heuristic rule, obtain the contextual feature of entity, then buildThe below characteristic vector of Feature Words, adopts application characteristic frequency function to quantize to characteristic item, adoptsThe associating clustering algorithm of k-means, realizes the right cluster of entity, finally the right relation of entity marked,Text to be filtered so just for the entity that adopts the relation that marked to and the vector of relation represented; SimilarityCalculate module 123 and calculate the similarity of text to be filtered and filtering model, like this because text to be filtered adoptsEntity to and the characteristic vector of relation represent, filtering model is also characteristic vector, according to vector space model,The cosine value of two characteristic vector angles can represent their similarity, can calculate thus to be filtered failingThis and the similarity Sim of filtering model_j, according to the threshold value of setting, the text filtering that exceedes threshold value falls.

Fig. 2 is the flow chart of steps of a kind of text filtering method of the present invention. As shown in Figure 2, one of the present inventionText filtering method, comprises the steps:

Step 201, sets up filtering model according to user's filtration needs. Specifically, first according to userFiltration needs, the field that the body that clearly will build covers and scope are determined field and the scope of body,Then in the related territory of body, carry out the Collection and analysis of information, specify Key Concepts and conceptBetween relation, and express with accurate term, finally set up body frame. In the better reality of the present inventionExecute in example, body takes triple Topic (C, P, S) to represent, wherein: C represents by filtration artNoun conceptual abstraction out, has the set of the concept class of same alike result and behavior structure; P describes concept and passThe attribute of system; Structural relation between S representation class, as parent, subclass etc. C adopts vector space model(VSM) represent, use two tuple C_i(Key_i，Weight_i), wherein Key_iRepresent keyword, Weight_iRepresent the weight of keyword.

Step 202, by the training of one group of filtration sample, forms the basis of the filtration needs that approaches userBody storehouse. In preferred embodiment of the present invention, adopt increment type alternative manner to filter sample training to one group,There is the window size of quantity in the document that setting fixed value m is filtered as the new needs of observation, according to commentingSurvey the parameter n of index and arrange flexibly, and to establish training iterations be 5. In increment iterative training process,Need to determine each characteristic item number increasing, to avoid producing more noise. According to the effective spy who increasesThe value of levying, chooses in the existing ontology library of being increased to of some, enriches user's filtration needs model. CauseThis is along with continuous study, and ontology library is more and more close to user's filtration needs, the necessary spy of ontology libraryLevy also and reduce gradually.

Step 203, extracts the Feature Words of text to be filtered, the then entity in recognition feature word, and carry outThe extraction of entity relationship, forms the entity relationship of text to be filtered to vector, calculates filtering model and to be filteredThe similarity of text, filters the text higher than similarity threshold.

Fig. 3 is the details flow chart of step 203 in the preferred embodiment of the present invention of Fig. 2. As shown in Figure 3,Step 203 also comprises:

Step 301, removes the pretreatment operation such as stop words to text to be filtered;

Step 302, will extract the characteristic vector of expressing content of text through pretreated text to be filtered;

Step 303, carries out the extraction of entity relationship, forms the entity relationship of text to be filtered to vector; WithAnd

Step 304, calculates the similarity of text to be filtered and filtering model, to the literary composition higher than similarity thresholdOriginally filter. Due to text to be filtered adopted entity to and the characteristic vector of relation represent, filter mouldType is also characteristic vector, and according to vector space model, the cosine value of two characteristic vector angles can represent theirsSimilarity. Can calculate thus the similarity Sim of text to be filtered and filtering model_j, according to the threshold of settingValue, the text filtering that exceedes threshold value falls.

Preferably, step 303 further comprises the steps:

A. first according to the characteristic vector of the extracted page, identify entity;

B. based on heuristic rule, obtain the contextual feature of entity;

C. build the characteristic vector of contextual feature word, adopt application characteristic frequency function to characteristic item numberValue;

D. adopt the associating clustering algorithm of k-means, realize the right cluster of entity; And

E. the right relation of entity is marked. Text to be filtered so just adopts the entity pair of the relation that markedAnd the vector of relation represents.

Visible, a kind of text filtering system and method for the present invention is by adopting body to set up filtering model, sameTime at filtration stage, adopt entity relation extraction method to carry out entity relationship to the Feature Words of text to be filteredMark, therefore can express more exactly text to be filtered, then by calculate text to be filtered withThe similarity of filtering model, falls the text filtering higher than threshold value, and the present invention is owing to can accurately expressing useThe filtration needs at family, thereby have higher filtering accuracy.

Above-described embodiment is illustrative principle of the present invention and effect thereof only, but not for limiting the present invention.Any those skilled in the art all can, under spirit of the present invention and category, carry out above-described embodimentModify and change. Therefore, the scope of the present invention, should be as listed in claims.

Claims

1. a text filtering system, at least comprises:

Text filtering module, extracts the Feature Words of text to be filtered, the then entity in recognition feature word, andCarry out the extraction of entity relationship, form the entity relationship of text to be filtered to vector, calculate filtering model and treatFilter the similarity of text, the text higher than similarity threshold is filtered;

Described text filtering module also comprises:

Pretreatment module, removes stop words pretreatment operation to text to be filtered;

2. a kind of text filtering system as claimed in claim 1, is characterized in that: this filtering model is builtFormwork erection group is first according to user's filtration needs, and the field that the body that clearly will build covers and scope are determinedThe field of body and scope are then carried out the Collection and analysis of information in the related territory of body,Specify the relation between Key Concepts and concept, and express with accurate term, finally set up body frameFrame.

3. a kind of text filtering system as claimed in claim 2, is characterized in that: this body takes threeTuple Topic (C, P, S) represents, wherein C represents by the noun conceptual abstraction in filtration art out, toolThere is the set of the concept class of same alike result and behavior structure, adopt vector space model to represent; P describes conceptAttribute with relation; Structural relation between S representation class.

4. a kind of text filtering system as claimed in claim 1, is characterized in that: this adaptive learningModule filters sample training with increment type alternative manner to this group.

5. a kind of text filtering system as claimed in claim 1, is characterized in that: this similarity is calculatedModule, according to vector space model, represents the cosine value of two characteristic vector angles their similarity, calculatesThe similarity of text to be filtered and filtering model, according to the threshold value of setting, falls the text filtering that exceedes threshold value.

6. a text filtering method, comprises the steps:

Step 1, sets up filtering model according to user's filtration needs;

Step 3, removes stop words pretreatment operation to text to be filtered, will treat through pretreatedFilter text extracts the characteristic vector of expressing content of text, according to the characteristic vector of the extracted page, and identificationEntity; Based on heuristic rule, obtain the contextual feature of entity; The feature that builds contextual feature word toAmount, adopts application characteristic frequency function to quantize to characteristic item; Adopt the associating clustering algorithm of k-means,Realize the right cluster of entity; The relation right to entity marks, and forms the entity relationship of text to be filteredTo vector, calculate the similarity of filtering model and text to be filtered, the text higher than similarity threshold is carried outFilter.

7. a kind of text filtering method as claimed in claim 6, is characterized in that, step 1 also comprisesFollowing steps:

Set up body frame.