CN103034726B - Text filtering system and method - Google Patents

Text filtering system and method Download PDF

Info

Publication number
CN103034726B
CN103034726B CN201210553556.4A CN201210553556A CN103034726B CN 103034726 B CN103034726 B CN 103034726B CN 201210553556 A CN201210553556 A CN 201210553556A CN 103034726 B CN103034726 B CN 103034726B
Authority
CN
China
Prior art keywords
text
entity
filtering
filtered
similarity
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201210553556.4A
Other languages
Chinese (zh)
Other versions
CN103034726A (en
Inventor
闫俊英
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Cloud open source data technology (Shanghai) Co., Ltd.
Original Assignee
Shanghai Dianji University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Dianji University filed Critical Shanghai Dianji University
Priority to CN201210553556.4A priority Critical patent/CN103034726B/en
Publication of CN103034726A publication Critical patent/CN103034726A/en
Application granted granted Critical
Publication of CN103034726B publication Critical patent/CN103034726B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Abstract

The invention discloses a kind of text filtering system and method, the method comprises the steps: to set up filtering model according to user's filtration needs; By to the training of one group of filtration sample, form the ontology library of the filtration needs that approaches user; And extract the Feature Words of text to be filtered, then the entity in recognition feature word, and carry out the extraction of entity relationship, form the entity relationship of text to be filtered to vector, calculate the similarity of filtering model and text to be filtered, the text higher than similarity threshold is filtered, the present invention is according to the user's who sets up filtering model, by entity relation extraction, accurately express the feature of the text filtering, can improve the accuracy of filtration.

Description

Text filtering system and method
Technical field
The present invention, about a kind of text filtering system and method, particularly relates to a kind of based on entity relation extractionText filtering system and method.
Background technology
Text filtering receives more concern for many years always, has better in the field such as information retrieval and filtrationApplication prospect. In current text filtering method, the fuzzy clustering method of some employings based on genetic algorithm,Each individuality in population is carried out to fuzzy similarity matrix direct clustering, then adopt institute according to the result of clusterThe fitness function proposing is assessed the fitness of population, but the precision that this method is filtered depends on poly-The effect of class, can not well express for user's filtration needs. Some adopts improved classification to calculateMethod is filtered bad text message, improves traditional KNN algorithm from the angle of data Layer, equally to usingThe demand at family is expressed accurate not. Some filter method also adopts the filtration needs of expressing user of body,But accurate not for the method for building up of ontology library of expressing user filtering demand, this will affect text greatlyFiltering accuracy. Some filter algorithm has adopted the text filtering of adaptive learning, although can be to user'sFiltering profile carries out adaptive study, can adjust filtering model, still adopts the mode of characteristic vector notEnergy accurate expression user's filtration needs.
Summary of the invention
For overcoming above-mentioned the deficiencies in the prior art, the present invention's object be to provide a kind of text filtering system andMethod, it by entity relation extraction, accurately expresses the literary composition filtering according to the user's who sets up filtering modelThis feature, can improve the accuracy of filtration.
For reaching above-mentioned and other object, the present invention proposes a kind of text filtering system, at least comprises:
Module is set up in filtering model, sets up filtering model for the filtration needs according to user;
Adaptive learning module, by the training of one group of filtration sample, forms the filtration need that approach userThe ontology library of asking; And
Text filtering module, extracts the Feature Words of text to be filtered, the then entity in recognition feature word, andCarry out the extraction of entity relationship, form the entity relationship of text to be filtered to vector, calculate filtering model and treatFilter the similarity of text, the text higher than similarity threshold is filtered.
Further, this filtering model is set up module first according to user's filtration needs, clearly will buildThe field that body covers and scope are determined field and the scope of body, then at the related field model of bodyIn enclosing, carry out the Collection and analysis of information, specify the relation between Key Concepts and concept, and use accurate artLanguage is expressed, and finally sets up body frame.
Further, this body takes triple Topic (C, P, S) to represent, wherein C represents by filteringNoun conceptual abstraction in field out, has the set of the concept class of same alike result and behavior structure, adoptsVector space model represents; P describes the attribute of concept and relation; Structural relation between S representation class.
Further, this adaptive learning module is instructed this group filtration sample with increment type alternative mannerPractice.
Further, text filtration module also comprises:
Pretreatment module, removes the pretreatment operation such as stop words to text to be filtered;
Feature Words extracts module, will extract the feature of expressing content of text through pretreated text to be filteredVector;
Entity relation extraction module, first according to the characteristic vector of the extracted page, identification entity, and baseIn heuristic rule, obtain the contextual feature of entity, then build the characteristic vector of contextual feature word,Adopt application characteristic frequency function to quantize to characteristic item, adopt the associating clustering algorithm of k-means, comeRealize the right cluster of entity, finally the right relation of entity is marked; And
Similarity is calculated module, calculates the similarity of text to be filtered and filtering model, to higher than similarity thresholdThe text of value filters.
Further, this similarity is calculated module according to vector space model, more than two characteristic vector anglesTheir similarity of string value representation, calculates the similarity of text to be filtered and filtering model, according to what setThreshold value, falls the text filtering that exceedes threshold value.
For reaching above-mentioned and other object, the present invention also provides a kind of text filtering method, comprises the steps:
Step 1, sets up filtering model according to user's filtration needs;
Step 2, by the training of one group of filtration sample, forms the body of the filtration needs that approaches userStorehouse; And
Step 3, extracts the Feature Words of text to be filtered, the then entity in recognition feature word, and carry out realityThe extraction of body relation, forms the entity relationship of text to be filtered to vector, calculates filtering model and literary composition to be filteredThis similarity, filters the text higher than similarity threshold.
8, a kind of text filtering method as claimed in claim 7, is characterized in that, step 3 comprise asLower step:
Text to be filtered is removed to the pretreatment operation such as stop words;
The characteristic vector of expressing content of text will be extracted through pretreated text to be filtered;
Carry out the extraction of entity relationship, form the entity relationship of text to be filtered to vector; And
Calculate the similarity of text to be filtered and filtering model, the text higher than similarity threshold was carried outFilter.
Further, the extraction step of this entity relationship also comprises the steps:
First according to the characteristic vector of the extracted page, identify entity;
Based on heuristic rule, obtain the contextual feature of entity;
Build the characteristic vector of contextual feature word, adopt application characteristic frequency function to carry out numerical value to characteristic itemChange;
Adopt the associating clustering algorithm of k-means, realize the right cluster of entity; And
The relation right to entity marks. The entity that text to be filtered so just adopts the relation that marked to andThe vector of relation represents.
Further, step 1 also comprises the steps:
According to user's filtration needs, the field that the body that clearly will build covers and scope are determined bodyField and scope;
In the related territory of body, carry out the Collection and analysis of information, specify Key Concepts and conceptBetween relation, and express with accurate term; And
Set up body frame.
Compared with prior art, a kind of text filtering system and method for the present invention is by adopting body to set upFilter model, simultaneously at filtration stage, adopts entity relation extraction method to enter the Feature Words of text to be filteredThe mark of row entity relationship, therefore can express text to be filtered more exactly, then treats by calculatingText and the similarity of filtering model of filter, fall the text filtering higher than threshold value, and the present invention is due to can be smartReally express user's filtration needs, thereby have higher filtering accuracy.
Brief description of the drawings
Fig. 1 is the system architecture diagram of a kind of text filtering system of the present invention;
Fig. 2 is the flow chart of steps of a kind of text filtering method of the present invention;
Fig. 3 is the details flow chart of step 203 in the preferred embodiment of the present invention of Fig. 2.
Detailed description of the invention
Below by specific instantiation accompanying drawings embodiments of the present invention, art technologyPersonnel can understand other advantage of the present invention and effect easily by content disclosed in the present specification. The present inventionAlso can be implemented or be applied by other different instantiation, the every details in this description also can baseIn different viewpoints and application, under spirit of the present invention, carry out various modifications and change not deviating from.
Fig. 1 is a kind of text filtering system architecture diagram of the present invention. As shown in Figure 1, a kind of text mistake of the present inventionFilter system, at least comprises: module 10, adaptive learning module 11 and text filtering mould are set up in filtering modelGroup 12.
Wherein filtering model is set up module 10 and is set up filtering model for the filtration needs according to user. FilterModel is set up module 10 first according to user's filtration needs, the field that the body that clearly will build covers andScope is determined field and the scope of body, then in the related territory of body, carries out the collection of informationAnd analysis, specify the relation between Key Concepts and concept, and express with accurate term, finally buildVertical body frame. In preferred embodiment of the present invention, body takes triple Topic (C, P, S) to represent,Wherein: C represents by the noun conceptual abstraction in filtration art out to have the general of same alike result and behavior structureRead the set of class; P describes the attribute of concept and relation; Structural relation between S representation class, as parent, sonClass etc. C adopts vector space model (VSM) to represent, uses two tuple Ci(Keyi,Weighti),Wherein KeyiRepresent keyword, WeightiRepresent the weight of keyword.
Adaptive learning module 11, by the training of one group of filtration sample, forms the filtration that approaches userThe ontology library of demand. In preferred embodiment of the present invention, adaptive learning module 11 increment type alternative mannerTo the training of one group of filtration sample, the document that setting fixed value m is filtered as the new needs of observation goes outThe window size of existing quantity, arranges flexibly according to the parameter n of evaluation metrics, and establishes and train iterations to be5. In increment iterative training process, need to determine each characteristic item number increasing, to avoid generation moreNoise. According to the validity feature value increasing, choose in the existing ontology library of being increased to of some, richRich user's filtration needs model. Therefore along with continuous study, ontology library is more and more close to user's mistakeFilter demand, the necessary feature of ontology library also reduces gradually.
Text filtering module 12 extracts the Feature Words of text to be filtered, the then entity in recognition feature word,And carry out the extraction of entity relationship, form the entity relationship of text to be filtered to vector, calculate filtering model withThe similarity of text to be filtered, filters the text lower than similarity threshold. Specifically, text mistakeFilter module 12 further comprises: pretreatment module 120, Feature Words extracts module 121, entity relation extractionModule 122 and similarity are calculated module 123. Wherein pretreatment module 120 is removed text to be filteredThe pretreatment operation such as stop words; Feature Words extracts module 121 and will extract through pretreated text to be filteredExpress the characteristic vector of content of text; Entity relation extraction module 122 is first according to the spy of the extracted pageLevy vector, identification entity, and based on heuristic rule, obtain the contextual feature of entity, then buildThe below characteristic vector of Feature Words, adopts application characteristic frequency function to quantize to characteristic item, adoptsThe associating clustering algorithm of k-means, realizes the right cluster of entity, finally the right relation of entity marked,Text to be filtered so just for the entity that adopts the relation that marked to and the vector of relation represented; SimilarityCalculate module 123 and calculate the similarity of text to be filtered and filtering model, like this because text to be filtered adoptsEntity to and the characteristic vector of relation represent, filtering model is also characteristic vector, according to vector space model,The cosine value of two characteristic vector angles can represent their similarity, can calculate thus to be filtered failingThis and the similarity Sim of filtering modelj, according to the threshold value of setting, the text filtering that exceedes threshold value falls.
Fig. 2 is the flow chart of steps of a kind of text filtering method of the present invention. As shown in Figure 2, one of the present inventionText filtering method, comprises the steps:
Step 201, sets up filtering model according to user's filtration needs. Specifically, first according to userFiltration needs, the field that the body that clearly will build covers and scope are determined field and the scope of body,Then in the related territory of body, carry out the Collection and analysis of information, specify Key Concepts and conceptBetween relation, and express with accurate term, finally set up body frame. In the better reality of the present inventionExecute in example, body takes triple Topic (C, P, S) to represent, wherein: C represents by filtration artNoun conceptual abstraction out, has the set of the concept class of same alike result and behavior structure; P describes concept and passThe attribute of system; Structural relation between S representation class, as parent, subclass etc. C adopts vector space model(VSM) represent, use two tuple Ci(Keyi,Weighti), wherein KeyiRepresent keyword, WeightiRepresent the weight of keyword.
Step 202, by the training of one group of filtration sample, forms the basis of the filtration needs that approaches userBody storehouse. In preferred embodiment of the present invention, adopt increment type alternative manner to filter sample training to one group,There is the window size of quantity in the document that setting fixed value m is filtered as the new needs of observation, according to commentingSurvey the parameter n of index and arrange flexibly, and to establish training iterations be 5. In increment iterative training process,Need to determine each characteristic item number increasing, to avoid producing more noise. According to the effective spy who increasesThe value of levying, chooses in the existing ontology library of being increased to of some, enriches user's filtration needs model. CauseThis is along with continuous study, and ontology library is more and more close to user's filtration needs, the necessary spy of ontology libraryLevy also and reduce gradually.
Step 203, extracts the Feature Words of text to be filtered, the then entity in recognition feature word, and carry outThe extraction of entity relationship, forms the entity relationship of text to be filtered to vector, calculates filtering model and to be filteredThe similarity of text, filters the text higher than similarity threshold.
Fig. 3 is the details flow chart of step 203 in the preferred embodiment of the present invention of Fig. 2. As shown in Figure 3,Step 203 also comprises:
Step 301, removes the pretreatment operation such as stop words to text to be filtered;
Step 302, will extract the characteristic vector of expressing content of text through pretreated text to be filtered;
Step 303, carries out the extraction of entity relationship, forms the entity relationship of text to be filtered to vector; WithAnd
Step 304, calculates the similarity of text to be filtered and filtering model, to the literary composition higher than similarity thresholdOriginally filter. Due to text to be filtered adopted entity to and the characteristic vector of relation represent, filter mouldType is also characteristic vector, and according to vector space model, the cosine value of two characteristic vector angles can represent theirsSimilarity. Can calculate thus the similarity Sim of text to be filtered and filtering modelj, according to the threshold of settingValue, the text filtering that exceedes threshold value falls.
Preferably, step 303 further comprises the steps:
A. first according to the characteristic vector of the extracted page, identify entity;
B. based on heuristic rule, obtain the contextual feature of entity;
C. build the characteristic vector of contextual feature word, adopt application characteristic frequency function to characteristic item numberValue;
D. adopt the associating clustering algorithm of k-means, realize the right cluster of entity; And
E. the right relation of entity is marked. Text to be filtered so just adopts the entity pair of the relation that markedAnd the vector of relation represents.
Visible, a kind of text filtering system and method for the present invention is by adopting body to set up filtering model, sameTime at filtration stage, adopt entity relation extraction method to carry out entity relationship to the Feature Words of text to be filteredMark, therefore can express more exactly text to be filtered, then by calculate text to be filtered withThe similarity of filtering model, falls the text filtering higher than threshold value, and the present invention is owing to can accurately expressing useThe filtration needs at family, thereby have higher filtering accuracy.
Above-described embodiment is illustrative principle of the present invention and effect thereof only, but not for limiting the present invention.Any those skilled in the art all can, under spirit of the present invention and category, carry out above-described embodimentModify and change. Therefore, the scope of the present invention, should be as listed in claims.

Claims (7)

1. a text filtering system, at least comprises:
Module is set up in filtering model, sets up filtering model for the filtration needs according to user;
Adaptive learning module, by the training of one group of filtration sample, forms the filtration need that approach userThe ontology library of asking; And
Text filtering module, extracts the Feature Words of text to be filtered, the then entity in recognition feature word, andCarry out the extraction of entity relationship, form the entity relationship of text to be filtered to vector, calculate filtering model and treatFilter the similarity of text, the text higher than similarity threshold is filtered;
Described text filtering module also comprises:
Pretreatment module, removes stop words pretreatment operation to text to be filtered;
Feature Words extracts module, will extract the feature of expressing content of text through pretreated text to be filteredVector;
Entity relation extraction module, first according to the characteristic vector of the extracted page, identification entity, and baseIn heuristic rule, obtain the contextual feature of entity, then build the characteristic vector of contextual feature word,Adopt application characteristic frequency function to quantize to characteristic item, adopt the associating clustering algorithm of k-means, comeRealize the right cluster of entity, finally the right relation of entity is marked; And
Similarity is calculated module, calculates the similarity of text to be filtered and filtering model, to higher than similarity thresholdThe text of value filters.
2. a kind of text filtering system as claimed in claim 1, is characterized in that: this filtering model is builtFormwork erection group is first according to user's filtration needs, and the field that the body that clearly will build covers and scope are determinedThe field of body and scope are then carried out the Collection and analysis of information in the related territory of body,Specify the relation between Key Concepts and concept, and express with accurate term, finally set up body frameFrame.
3. a kind of text filtering system as claimed in claim 2, is characterized in that: this body takes threeTuple Topic (C, P, S) represents, wherein C represents by the noun conceptual abstraction in filtration art out, toolThere is the set of the concept class of same alike result and behavior structure, adopt vector space model to represent; P describes conceptAttribute with relation; Structural relation between S representation class.
4. a kind of text filtering system as claimed in claim 1, is characterized in that: this adaptive learningModule filters sample training with increment type alternative manner to this group.
5. a kind of text filtering system as claimed in claim 1, is characterized in that: this similarity is calculatedModule, according to vector space model, represents the cosine value of two characteristic vector angles their similarity, calculatesThe similarity of text to be filtered and filtering model, according to the threshold value of setting, falls the text filtering that exceedes threshold value.
6. a text filtering method, comprises the steps:
Step 1, sets up filtering model according to user's filtration needs;
Step 2, by the training of one group of filtration sample, forms the body of the filtration needs that approaches userStorehouse; And
Step 3, removes stop words pretreatment operation to text to be filtered, will treat through pretreatedFilter text extracts the characteristic vector of expressing content of text, according to the characteristic vector of the extracted page, and identificationEntity; Based on heuristic rule, obtain the contextual feature of entity; The feature that builds contextual feature word toAmount, adopts application characteristic frequency function to quantize to characteristic item; Adopt the associating clustering algorithm of k-means,Realize the right cluster of entity; The relation right to entity marks, and forms the entity relationship of text to be filteredTo vector, calculate the similarity of filtering model and text to be filtered, the text higher than similarity threshold is carried outFilter.
7. a kind of text filtering method as claimed in claim 6, is characterized in that, step 1 also comprisesFollowing steps:
According to user's filtration needs, the field that the body that clearly will build covers and scope are determined bodyField and scope;
In the related territory of body, carry out the Collection and analysis of information, specify Key Concepts and conceptBetween relation, and express with accurate term; And
Set up body frame.
CN201210553556.4A 2012-12-18 2012-12-18 Text filtering system and method Active CN103034726B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201210553556.4A CN103034726B (en) 2012-12-18 2012-12-18 Text filtering system and method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201210553556.4A CN103034726B (en) 2012-12-18 2012-12-18 Text filtering system and method

Publications (2)

Publication Number Publication Date
CN103034726A CN103034726A (en) 2013-04-10
CN103034726B true CN103034726B (en) 2016-05-25

Family

ID=48021620

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201210553556.4A Active CN103034726B (en) 2012-12-18 2012-12-18 Text filtering system and method

Country Status (1)

Country Link
CN (1) CN103034726B (en)

Families Citing this family (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105224569B (en) 2014-06-30 2018-09-07 华为技术有限公司 A kind of data filtering, the method and device for constructing data filter
CN104809176B (en) * 2015-04-13 2018-08-07 中央民族大学 Tibetan language entity relation extraction method
CN111611786B (en) * 2017-04-07 2023-03-21 创新先进技术有限公司 Text similarity calculation method and device
CN107329949B (en) * 2017-05-24 2021-01-01 北京捷通华声科技股份有限公司 Semantic matching method and system
CN110019771B (en) * 2017-07-28 2021-08-13 北京国双科技有限公司 Text processing method and device
CN108428382A (en) * 2018-02-14 2018-08-21 广东外语外贸大学 It is a kind of spoken to repeat methods of marking and system
CN108845988B (en) * 2018-06-07 2022-06-10 苏州大学 Entity identification method, device, equipment and computer readable storage medium
CN109325108B (en) * 2018-08-13 2022-05-27 北京百度网讯科技有限公司 Query processing method, device, server and storage medium
CN110188330B (en) * 2019-05-31 2021-07-16 腾讯科技(深圳)有限公司 Method and device for determining similar text information, electronic equipment and storage medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102521402A (en) * 2011-12-23 2012-06-27 上海电机学院 Text filtering system and method
CN102591862A (en) * 2011-01-05 2012-07-18 华东师范大学 Control method and device of Chinese entity relationship extraction based on word co-occurrence

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP5392227B2 (en) * 2010-10-14 2014-01-22 株式会社Jvcケンウッド Filtering apparatus and filtering method

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102591862A (en) * 2011-01-05 2012-07-18 华东师范大学 Control method and device of Chinese entity relationship extraction based on word co-occurrence
CN102521402A (en) * 2011-12-23 2012-06-27 上海电机学院 Text filtering system and method

Also Published As

Publication number Publication date
CN103034726A (en) 2013-04-10

Similar Documents

Publication Publication Date Title
CN103034726B (en) Text filtering system and method
CN104199972B (en) A kind of name entity relation extraction and construction method based on deep learning
CN103117060B (en) For modeling method, the modeling of the acoustic model of speech recognition
CN104574192B (en) Method and device for identifying same user in multiple social networks
CN103500175B (en) A kind of method based on sentiment analysis on-line checking microblog hot event
CN107992531A (en) News personalization intelligent recommendation method and system based on deep learning
CN107705066A (en) Information input method and electronic equipment during a kind of commodity storage
CN109271522A (en) Comment sensibility classification method and system based on depth mixed model transfer learning
CN107122455A (en) A kind of network user's enhancing method for expressing based on microblogging
CN109960763A (en) A kind of photography community personalization friend recommendation method based on user's fine granularity photography preference
CN107480125A (en) A kind of relational links method of knowledge based collection of illustrative plates
CN109461037A (en) Comment on viewpoint clustering method, device and terminal
CN109508379A (en) A kind of short text clustering method indicating and combine similarity based on weighted words vector
CN102521402B (en) Text filtering system and method
CN105718598A (en) AT based time model construction method and network emergency early warning method
CN107992542A (en) A kind of similar article based on topic model recommends method
CN110717332B (en) News and case similarity calculation method based on asymmetric twin network
CN104156436A (en) Social association cloud media collaborative filtering and recommending method
CN106909643A (en) The social media big data motif discovery method of knowledge based collection of illustrative plates
CN107291895B (en) Quick hierarchical document query method
CN102289522A (en) Method of intelligently classifying texts
CN104462053A (en) Inner-text personal pronoun anaphora resolution method based on semantic features
CN108550401A (en) A kind of illness data correlation method based on Apriori
CN107357785A (en) Theme feature word abstracting method and system, feeling polarities determination methods and system
CN108875034A (en) A kind of Chinese Text Categorization based on stratification shot and long term memory network

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
TR01 Transfer of patent right

Effective date of registration: 20171016

Address after: 201306 116A26 room, No. 99 main building, West Road, West Lake, Nanhui, Pudong New Area, Shanghai

Patentee after: Cloud open source data technology (Shanghai) Co., Ltd.

Address before: 200240 Jiangchuan Road, Shanghai, No. 690, No.

Patentee before: Shanghai Dianji University

TR01 Transfer of patent right