CN102521402B - Text filtering system and method - Google Patents

Text filtering system and method Download PDF

Info

Publication number
CN102521402B
CN102521402B CN201110440801.6A CN201110440801A CN102521402B CN 102521402 B CN102521402 B CN 102521402B CN 201110440801 A CN201110440801 A CN 201110440801A CN 102521402 B CN102521402 B CN 102521402B
Authority
CN
China
Prior art keywords
text
filtered
module
filtering
user
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN201110440801.6A
Other languages
Chinese (zh)
Other versions
CN102521402A (en
Inventor
闫俊英
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Dianji University
Original Assignee
Shanghai Dianji University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Dianji University filed Critical Shanghai Dianji University
Priority to CN201110440801.6A priority Critical patent/CN102521402B/en
Publication of CN102521402A publication Critical patent/CN102521402A/en
Application granted granted Critical
Publication of CN102521402B publication Critical patent/CN102521402B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Machine Translation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a text filtering system and a text filtering method. The system at least comprises an ontology base construction module, an adaptive learning module and a text filtering module, wherein the ontology base construction module is used for constructing an ontology base according to the filtering requirements of a user; the adaptive learning module dynamically regulates the ontology base constructed by the ontology base construction module by performing training and learning on a group of filtering samples to make the ontology base gradually meet the filtering requirements of the user; and the text filtering module performs preprocessing, characteristic word set extraction and similarity matching on a text to be filtered to obtain relevance between the text to be filtered and an ontology, and filters the text to be filtered according to the relevance. By the system and the method, a filtering model for the user can be accurately expressed; and in the filtration, the filtering model expressed by the ontology for the user can be regulated by automatic learning, and a filtering threshold value can be dynamically regulated to achieve a good filtering effect.

Description

Text filtering system and method
Technical field
The present invention relates to a kind of text filtering system and method, particularly relate to a kind of adaptive text filtering system and method based on body.
Background technology
In information retrieval and filtration art, text filtering is a study hotspot always.In domestic and foreign literature, there have been many employing diverse ways to realize text filtering at present.
In current text filtering method, mainly comprise the fuzzy clustering text filtering method based on genetic algorithm, the text filtering method that adopts improved sorting algorithm, the text filtering method that adopts adaptive learning filter algorithm and the text filtering method that only adopts body.Wherein, the fuzzy clustering method of employing based on genetic algorithm, to each individuality in population, carry out fuzzy similarity matrix direct clustering, then according to the result of cluster, adopt the fitness function proposing to assess the fitness of population, yet the precision that this text filtering method filters depends on the effect of cluster, for user's filtration needs, can not well express; Adopt the text filtering method of improved sorting algorithm to filter bad text message, from the angle of data Layer, improve traditional KNN algorithm, its shortcoming is to express accurate not to user's demand equally; Adopt the text filtering method of adaptive learning filter algorithm, can carry out adaptive learning by the mode of training set of stereotypes, can adjust filtering model, but its expression for user's filtration needs is accurate equally not; Only adopt the text filtering method of body, the precision of filtration depends on the foundation of body, if ontology library creates words not accurately, will greatly affect the precision of text filtering.
In sum, in the text filtering method of known prior art, exist user's demand is expressed not accurately or ontology library creates the problem that not accurately affects text filtering precision, be necessary to propose improved technological means therefore in fact, solve this problem
Summary of the invention
The deficiency existing for overcoming above-mentioned prior art, fundamental purpose of the present invention is to provide a kind of text filtering system and method, it not only can accurately express user's filtering model, and can when filtering, carry out autonomous learning, adjust the user filtering model that adopts body to express, and can dynamically adjust filtration threshold value, to reach better filter effect.
For reaching above-mentioned and other object, the invention provides a kind of text filtering system, at least comprise:
Ontology library is set up module, for the filtration needs according to user, sets up ontology library;
Adaptive learning module, dynamically adjusts this ontology library is set up to the ontology library of module foundation by one group of filtration sample being carried out to training study, makes it move closer to the filtration needs in user; And
Text filtering module, by text to be filtered being carried out to pre-service, extracting after feature word set and similarity matching treatment, obtains the degree of correlation of this text to be filtered and body, and according to this degree of correlation, this text to be filtered is filtered.
Further, this ontology library is set up module and is at least comprised:
Module is determined in field, and for according to user's filtration needs, the field that the body that clearly will build covers and scope are to determine field and the scope of body;
Collection analysis module, for carry out the Collection and analysis of information in the related territory of body, defines the relation between Key Concepts and concept, and expresses with accurate term; And
Body frame is set up module, for setting up body frame according to collection analysis result.
Further, this body takes tlv triple Topic (C, P, S) to represent, wherein, C represents by the noun conceptual abstraction in filtration art out, to have the set of the concept class of same alike result and behavior structure; P describes the attribute of concept and relation; Structural relation between S representation class, as parent, subclass etc.
Further, this adaptive learning module adopts increment type alternative manner to filter sample to one group to carry out training study and dynamically adjust with the ontology library of this ontology library being set up to module and being set up.
Further, text filtration module at least comprises
Pre-service module, for removing stop words operation to this text to be filtered;
Feature word set extracts module, for this text to be filtered being extracted to the Feature Words of expressing content of text, according to the different position of Feature Words and frequency, gives corresponding weight, and identical term weight function value is added, and forms text feature word set;
Similarity is calculated module, according to vector space model, calculates the degree of correlation of this text to be filtered and this body; And
Filter module, according to this degree of correlation and a threshold value of setting, this text to be filtered is filtered.
Further, this filtration module filters the text lower than this threshold value in this band filtration text.
For reaching above-mentioned and other object, the invention provides a kind of text filtering method, it at least comprises the steps:
According to user's filtration needs, set up ontology library;
To one group, filter sample and carry out training study so that the ontology library of being set up is dynamically adjusted, make it move closer to the filtration needs in user; And
Text to be filtered is carried out pre-service, extracted after feature word set and similarity matching treatment, obtain the degree of correlation of this text to be filtered and body, and according to this degree of correlation, this text to be filtered is filtered.
Further, the step that this filtration needs according to user is set up ontology library at least also comprises the steps:
According to user's filtration needs, the field that the body that clearly will build covers and scope are determined field and the scope of body;
In the related territory of body, carry out the Collection and analysis of information, define the relation between Key Concepts and concept, and express with accurate term; And
According to collection analysis result, set up body frame.
Further, this ontology library is dynamically adjusted and adopted increment type alternative manner to realize.
The step of further, this text to be filtered being filtered at least also comprises the steps:
Text to be filtered is removed to stop words operation;
Extract the Feature Words of expressing content of text in this text to be filtered, according to the different position of Feature Words and frequency, give corresponding weight, and identical term weight function value is added, form text feature word set;
According to vector space model, calculate the degree of correlation of this text to be filtered and body; And
According to the relation of the threshold value of a setting and this degree of correlation, this text to be filtered is filtered.
Compared with prior art, a kind of text filtering system and method for the present invention by set up ontology library can be more accurate express user's filtration needs, simultaneously in order further to guarantee that ontology library is closer to user's filtration needs, the present invention adopts the mode of adaptive learning, by one group of sample is carried out to training study, partial dynamic is adjusted ontology library, the conventional method that has overcome traditional eigenvector method and set up ontology library is expressed out of true and causes the shortcoming that filtering accuracy is not high user's request, in addition, the present invention adopts vector space model to calculate the similarity of text and ontology library to be filtered at filtration stage, text filtering lower than threshold value is fallen, can dynamically adjust filtration threshold value, to reach better filter effect, facts have proved, the adaptive text filtering method of this employing of the present invention based on body can obtain higher filtering accuracy.
Accompanying drawing explanation
Fig. 1 is the system architecture diagram of a kind of text filtering system of the present invention;
Fig. 2 is the flow chart of steps of a kind of text filtering method of the present invention.
Embodiment
Below, by specific instantiation accompanying drawings embodiments of the present invention, those skilled in the art can understand other advantage of the present invention and effect easily by content disclosed in the present specification.The present invention also can be implemented or be applied by other different instantiation, and the every details in this instructions also can be based on different viewpoints and application, carries out various modifications and change not deviating under spirit of the present invention.
Fig. 1 is the system architecture diagram of a kind of text filtering system of the present invention.As shown in Figure 1, a kind of text filtering system of the present invention, at least comprises: ontology library is set up module 10, adaptive learning module 11 and text filtering module 12.
Wherein ontology library is set up module 10 and is set up ontology library for the filtration needs according to user, and it at least comprises that field determines that module 101, collection analysis module 102 and body frame set up module 103.Field determines that module 101 is first according to user's filtration needs, and the field that the body that clearly will build covers and scope are to determine field and the scope of body; Collection analysis module 102 for carrying out the Collection and analysis of information in the related territory of body, define the relation between Key Concepts and concept, and express with accurate term, for example, in preferred embodiment of the present invention, body is taked tlv triple Topic (C, P, S) represent, wherein: C represents by the noun conceptual abstraction in filtration art out, to have the set of the concept class of same alike result and behavior structure; P describes the attribute of concept and relation; Structural relation between S representation class, as parent, subclass etc.C adopts vector space model (VSM) to represent, uses two tuple C i(Key i, Weight i), Key wherein irepresent keyword, Weight ithe weight that represents keyword; Body frame is set up module 103 and is set up body frame for the collection analysis result according to collection analysis module 102.
Adaptive learning module 11 carries out training study and ontology library is set up to the ontology library that module 10 sets up dynamically adjusts by filtering sample to one group, makes it move closer to the filtration needs in user.In preferred embodiment of the present invention, adaptive learning module 11 adopts increment type alternative manner to filter sample training to one group, there is the window size of quantity in the document that setting fixed value m is filtered as the new needs of observation, according to the parameter n of evaluation metrics, arrange flexibly, and establish training iterations be 5, in increment iterative training process, need to determine each characteristic item number increasing, to avoid producing more noise, according to the validity feature value increasing, choose in the existing ontology library of being increased to of some, enrich user's filtration needs model.Therefore along with continuous study, ontology library is more and more close to user's filtration needs, and the necessary feature of ontology library also reduces gradually.
Text filtering module 12, by text to be filtered being carried out to pre-service, extracting after feature word set and similarity matching treatment, filters text to be filtered according to the degree of correlation of text to be filtered and body.It at least comprises that pre-service module 121, feature word set extract module 122, similarity is calculated module 123 and filtered module 124.Wherein, 121 pairs of texts to be filtered of pre-service module are through removing the pretreatment operation such as stop words, feature word set extracts module 122 for extracting the Feature Words of expressing content of text, and give corresponding weight according to the different position of Feature Words and frequency, identical term weight function value is added, form text feature word set Ti={ (Word1k, Weight1k) }, text to be filtered has like this adopted proper vector to represent; Similarity is calculated module 123 according to vector space model, and the cosine value of two proper vector angles can represent their degree of correlation, can calculate thus the degree of correlation Sim of a text to be filtered and body j; Filter 124 of modules according to this degree of correlation Sim jwith the threshold value of setting, text to be filtered is filtered, the text lower than threshold value is filtered.
Fig. 2 is the flow chart of steps of a kind of text filtering method of the present invention.As shown in Figure 2, a kind of text filtering method of the present invention, at least comprises the steps:
Step 201, sets up ontology library according to user's filtration needs.In this step, first according to user's filtration needs, the field that the body that clearly will build covers and scope are determined field and the scope of body; Then in the related territory of body, carry out the Collection and analysis of information, define the relation between Key Concepts and concept, and express with accurate term; Finally, set up body frame.In preferred embodiment of the present invention, body takes tlv triple Topic (C, P, S) to represent, wherein: C represents by the noun conceptual abstraction in filtration art out, to have the set of the concept class of same alike result and behavior structure; P describes the attribute of concept and relation; Structural relation between S representation class, as parent, subclass etc., C adopts vector space model (VSM) to represent, uses two tuple C i(Key i, Weight i), Key wherein irepresent keyword, Weight ithe weight that represents keyword.
Step 202, filters sample to one group and carries out training study so that the ontology library of being set up is dynamically adjusted, and makes it move closer to the filtration needs in user.In preferred embodiment of the present invention, adopt increment type alternative manner to filter sample training to one group, there is the window size of quantity in the document that setting fixed value m is filtered as the new needs of observation, according to the parameter n of evaluation metrics, arrange flexibly, and establish training iterations be 5, in increment iterative training process, need to determine each characteristic item number increasing, to avoid producing more noise, according to the validity feature value increasing, choose in the existing ontology library of being increased to of some, enrich user's filtration needs model, therefore along with continuous study, ontology library is more and more close to user's filtration needs, the necessary feature of ontology library also reduces gradually.
Step 203, carries out pre-service, extracts after feature word set and similarity matching treatment text to be filtered, according to the degree of correlation of text to be filtered and body, text to be filtered is filtered.Its detailed process is as follows: first text process to be filtered is removed to the pretreatment operation such as stop words; Then extract the Feature Words of expressing content of text, and give corresponding weight according to the different position of Feature Words and frequency, identical term weight function value is added, form text feature word set Ti={ (Word1k, Weight1k) }, text to be filtered has like this adopted proper vector to represent; Then according to vector space model, the cosine value of two proper vector angles can represent their degree of correlation.Can calculate thus the degree of correlation Sim of a text to be filtered and body j; Finally according to the threshold value and the degree of correlation Sim that set jrelation text to be filtered is filtered, the text lower than threshold value is filtered.
Visible, because body can be to carrying out clear and definite definition between field concept and concept, a kind of text filtering system and method for the present invention by set up ontology library can be more accurate express user's filtration needs, simultaneously in order further to guarantee that ontology library is closer to user's filtration needs, the present invention adopts the mode of adaptive learning, by one group of sample is carried out to training study, partial dynamic is adjusted ontology library, the conventional method that has overcome traditional eigenvector method and set up ontology library is expressed out of true and causes the shortcoming that filtering accuracy is not high user's request, in addition, the present invention adopts vector space model to calculate the similarity of text and ontology library to be filtered at filtration stage, text filtering lower than threshold value is fallen, and can dynamically adjust filtration threshold value, to reach better filter effect, facts have proved, the adaptive text filtering method of this employing of the present invention based on body can obtain higher filtering accuracy.
Above-described embodiment is illustrative principle of the present invention and effect thereof only, but not for limiting the present invention.Any those skilled in the art all can, under spirit of the present invention and category, modify and change above-described embodiment.Therefore, the scope of the present invention, should be as listed in claims.

Claims (8)

1. a text filtering system, at least comprises:
Ontology library is set up module, for the filtration needs according to user, sets up ontology library;
Adaptive learning module, dynamically adjusts this ontology library is set up to the ontology library of module foundation by one group of filtration sample being carried out to training study, makes it move closer to the filtration needs in user; And
Text filtering module, by text to be filtered being carried out to pre-service, extracting after feature word set and similarity matching treatment, obtains the degree of correlation of this text to be filtered and body, and according to this degree of correlation, this text to be filtered is filtered;
The text is filtered module and is at least comprised:
Pre-service module, for removing stop words operation to this text to be filtered;
Feature word set extracts module, for this text to be filtered being extracted to the Feature Words of expressing content of text, according to the different position of Feature Words and frequency, gives corresponding weight, and identical term weight function value is added, and forms text feature word set;
Similarity is calculated module, according to vector space model, calculates the degree of correlation of this text to be filtered and this body; And
Filter module, according to this degree of correlation and a threshold value of setting, this text to be filtered is filtered.
2. text filtering system as claimed in claim 1, is characterized in that, this ontology library is set up module and at least comprised:
Module is determined in field, and for according to user's filtration needs, the field that the body that clearly will build covers and scope are to determine field and the scope of body;
Collection analysis module, for carry out the Collection and analysis of information in the related territory of body, defines the relation between Key Concepts and concept, and expresses with accurate term; And
Body frame is set up module, for setting up body frame according to collection analysis result.
3. text filtering system as claimed in claim 2, is characterized in that: this body is taked tlv triple Topic(C, P, S) represent, wherein, C represents by the noun conceptual abstraction in filtration art out, to have the set of the concept class of same alike result and behavior structure; P describes the attribute of concept and relation; Structural relation between S representation class.
4. text filtering system as claimed in claim 1, is characterized in that: this adaptive learning module adopts increment type alternative manner to carry out training study to one group of filtration sample and dynamically adjusts this ontology library is set up to the ontology library of module foundation.
5. text filtering system as claimed in claim 1, is characterized in that: this filtration module filters the text lower than this threshold value in this text to be filtered.
6. a text filtering method, at least comprises the steps:
According to user's filtration needs, set up ontology library;
To one group, filter sample and carry out training study so that the ontology library of being set up is dynamically adjusted, make it move closer to the filtration needs in user; And
Text to be filtered is carried out pre-service, extracted after feature word set and similarity matching treatment, obtain the degree of correlation of this text to be filtered and body, and according to this degree of correlation, this text to be filtered is filtered;
The step that this text to be filtered is filtered at least also comprises the steps:
Text to be filtered is removed to stop words operation;
Extract the Feature Words of expressing content of text in this text to be filtered, according to the different position of Feature Words and frequency, give corresponding weight, and identical term weight function value is added, form text feature word set;
According to vector space model, calculate the degree of correlation of this text to be filtered and body; And
According to the relation of the threshold value of a setting and this degree of correlation, this text to be filtered is filtered.
7. a kind of text filtering method as claimed in claim 6, is characterized in that, the step that this filtration needs according to user is set up ontology library at least also comprises the steps:
According to user's filtration needs, the field that the body that clearly will build covers and scope are determined field and the scope of body;
In the related territory of body, carry out the Collection and analysis of information, define the relation between Key Concepts and concept, and express with accurate term; And
According to collection analysis result, set up body frame.
8. a kind of text filtering method as claimed in claim 6, is characterized in that: this ontology library is dynamically adjusted and adopted increment type alternative manner to realize.
CN201110440801.6A 2011-12-23 2011-12-23 Text filtering system and method Expired - Fee Related CN102521402B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201110440801.6A CN102521402B (en) 2011-12-23 2011-12-23 Text filtering system and method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201110440801.6A CN102521402B (en) 2011-12-23 2011-12-23 Text filtering system and method

Publications (2)

Publication Number Publication Date
CN102521402A CN102521402A (en) 2012-06-27
CN102521402B true CN102521402B (en) 2014-02-19

Family

ID=46292315

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201110440801.6A Expired - Fee Related CN102521402B (en) 2011-12-23 2011-12-23 Text filtering system and method

Country Status (1)

Country Link
CN (1) CN102521402B (en)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102880636A (en) * 2012-08-03 2013-01-16 深圳证券信息有限公司 Bad information detection method and server
CN103034726B (en) * 2012-12-18 2016-05-25 上海电机学院 Text filtering system and method
CN103902619B (en) * 2012-12-28 2018-10-23 中国移动通信集团公司 A kind of network public-opinion monitoring method and system
CN105224569B (en) 2014-06-30 2018-09-07 华为技术有限公司 A kind of data filtering, the method and device for constructing data filter
CN104615714B (en) * 2015-02-05 2019-05-24 北京中搜云商网络技术有限公司 Blog article rearrangement based on text similarity and microblog channel feature
CN108428382A (en) * 2018-02-14 2018-08-21 广东外语外贸大学 It is a kind of spoken to repeat methods of marking and system

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101751409A (en) * 2008-11-28 2010-06-23 上海电机学院 Application of immune system in search engine
CN101794311A (en) * 2010-03-05 2010-08-04 南京邮电大学 Fuzzy data mining based automatic classification method of Chinese web pages
CN101901247A (en) * 2010-03-29 2010-12-01 北京师范大学 Vertical engine searching method and system for domain body restraint

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101751409A (en) * 2008-11-28 2010-06-23 上海电机学院 Application of immune system in search engine
CN101794311A (en) * 2010-03-05 2010-08-04 南京邮电大学 Fuzzy data mining based automatic classification method of Chinese web pages
CN101901247A (en) * 2010-03-29 2010-12-01 北京师范大学 Vertical engine searching method and system for domain body restraint

Also Published As

Publication number Publication date
CN102521402A (en) 2012-06-27

Similar Documents

Publication Publication Date Title
CN103034726B (en) Text filtering system and method
CN102521402B (en) Text filtering system and method
CN103117060B (en) For modeling method, the modeling of the acoustic model of speech recognition
CN104199972B (en) A kind of name entity relation extraction and construction method based on deep learning
CN108932950B (en) Sound scene identification method based on label amplification and multi-spectral diagram fusion
CN107861939A (en) A kind of domain entities disambiguation method for merging term vector and topic model
CN108287858A (en) The semantic extracting method and device of natural language
CN108509425A (en) A kind of Chinese new word discovery method based on novel degree
CN106528532A (en) Text error correction method and device and terminal
CN109508379A (en) A kind of short text clustering method indicating and combine similarity based on weighted words vector
CN106095791B (en) A kind of abstract sample information searching system based on context
CN102236639B (en) Update the system and method for language model
CN110717332B (en) News and case similarity calculation method based on asymmetric twin network
CN101127042A (en) Sensibility classification method based on language model
CN104008166A (en) Dialogue short text clustering method based on form and semantic similarity
CN108280164B (en) Short text filtering and classifying method based on category related words
CN102289522A (en) Method of intelligently classifying texts
CN107688576B (en) Construction and tendency classification method of CNN-SVM model
US10387805B2 (en) System and method for ranking news feeds
CN104679738A (en) Method and device for mining Internet hot words
CN106844786A (en) A kind of public sentiment region focus based on text similarity finds method
CN103810162A (en) Method and system for recommending network information
CN109918648B (en) Rumor depth detection method based on dynamic sliding window feature score
CN106682123A (en) Hot event acquiring method and device
CN105956158B (en) The method that network neologisms based on massive micro-blog text and user information automatically extract

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20140219

Termination date: 20161223