CN102521402B - Text filtering system and method - Google Patents
Text filtering system and method Download PDFInfo
- Publication number
- CN102521402B CN102521402B CN201110440801.6A CN201110440801A CN102521402B CN 102521402 B CN102521402 B CN 102521402B CN 201110440801 A CN201110440801 A CN 201110440801A CN 102521402 B CN102521402 B CN 102521402B
- Authority
- CN
- China
- Prior art keywords
- text
- filtered
- module
- filtering
- user
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Expired - Fee Related
Links
Images
Landscapes
- Machine Translation (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a text filtering system and a text filtering method. The system at least comprises an ontology base construction module, an adaptive learning module and a text filtering module, wherein the ontology base construction module is used for constructing an ontology base according to the filtering requirements of a user; the adaptive learning module dynamically regulates the ontology base constructed by the ontology base construction module by performing training and learning on a group of filtering samples to make the ontology base gradually meet the filtering requirements of the user; and the text filtering module performs preprocessing, characteristic word set extraction and similarity matching on a text to be filtered to obtain relevance between the text to be filtered and an ontology, and filters the text to be filtered according to the relevance. By the system and the method, a filtering model for the user can be accurately expressed; and in the filtration, the filtering model expressed by the ontology for the user can be regulated by automatic learning, and a filtering threshold value can be dynamically regulated to achieve a good filtering effect.
Description
Technical field
The present invention relates to a kind of text filtering system and method, particularly relate to a kind of adaptive text filtering system and method based on body.
Background technology
In information retrieval and filtration art, text filtering is a study hotspot always.In domestic and foreign literature, there have been many employing diverse ways to realize text filtering at present.
In current text filtering method, mainly comprise the fuzzy clustering text filtering method based on genetic algorithm, the text filtering method that adopts improved sorting algorithm, the text filtering method that adopts adaptive learning filter algorithm and the text filtering method that only adopts body.Wherein, the fuzzy clustering method of employing based on genetic algorithm, to each individuality in population, carry out fuzzy similarity matrix direct clustering, then according to the result of cluster, adopt the fitness function proposing to assess the fitness of population, yet the precision that this text filtering method filters depends on the effect of cluster, for user's filtration needs, can not well express; Adopt the text filtering method of improved sorting algorithm to filter bad text message, from the angle of data Layer, improve traditional KNN algorithm, its shortcoming is to express accurate not to user's demand equally; Adopt the text filtering method of adaptive learning filter algorithm, can carry out adaptive learning by the mode of training set of stereotypes, can adjust filtering model, but its expression for user's filtration needs is accurate equally not; Only adopt the text filtering method of body, the precision of filtration depends on the foundation of body, if ontology library creates words not accurately, will greatly affect the precision of text filtering.
In sum, in the text filtering method of known prior art, exist user's demand is expressed not accurately or ontology library creates the problem that not accurately affects text filtering precision, be necessary to propose improved technological means therefore in fact, solve this problem
Summary of the invention
The deficiency existing for overcoming above-mentioned prior art, fundamental purpose of the present invention is to provide a kind of text filtering system and method, it not only can accurately express user's filtering model, and can when filtering, carry out autonomous learning, adjust the user filtering model that adopts body to express, and can dynamically adjust filtration threshold value, to reach better filter effect.
For reaching above-mentioned and other object, the invention provides a kind of text filtering system, at least comprise:
Ontology library is set up module, for the filtration needs according to user, sets up ontology library;
Adaptive learning module, dynamically adjusts this ontology library is set up to the ontology library of module foundation by one group of filtration sample being carried out to training study, makes it move closer to the filtration needs in user; And
Text filtering module, by text to be filtered being carried out to pre-service, extracting after feature word set and similarity matching treatment, obtains the degree of correlation of this text to be filtered and body, and according to this degree of correlation, this text to be filtered is filtered.
Further, this ontology library is set up module and is at least comprised:
Module is determined in field, and for according to user's filtration needs, the field that the body that clearly will build covers and scope are to determine field and the scope of body;
Collection analysis module, for carry out the Collection and analysis of information in the related territory of body, defines the relation between Key Concepts and concept, and expresses with accurate term; And
Body frame is set up module, for setting up body frame according to collection analysis result.
Further, this body takes tlv triple Topic (C, P, S) to represent, wherein, C represents by the noun conceptual abstraction in filtration art out, to have the set of the concept class of same alike result and behavior structure; P describes the attribute of concept and relation; Structural relation between S representation class, as parent, subclass etc.
Further, this adaptive learning module adopts increment type alternative manner to filter sample to one group to carry out training study and dynamically adjust with the ontology library of this ontology library being set up to module and being set up.
Further, text filtration module at least comprises
Pre-service module, for removing stop words operation to this text to be filtered;
Feature word set extracts module, for this text to be filtered being extracted to the Feature Words of expressing content of text, according to the different position of Feature Words and frequency, gives corresponding weight, and identical term weight function value is added, and forms text feature word set;
Similarity is calculated module, according to vector space model, calculates the degree of correlation of this text to be filtered and this body; And
Filter module, according to this degree of correlation and a threshold value of setting, this text to be filtered is filtered.
Further, this filtration module filters the text lower than this threshold value in this band filtration text.
For reaching above-mentioned and other object, the invention provides a kind of text filtering method, it at least comprises the steps:
According to user's filtration needs, set up ontology library;
To one group, filter sample and carry out training study so that the ontology library of being set up is dynamically adjusted, make it move closer to the filtration needs in user; And
Text to be filtered is carried out pre-service, extracted after feature word set and similarity matching treatment, obtain the degree of correlation of this text to be filtered and body, and according to this degree of correlation, this text to be filtered is filtered.
Further, the step that this filtration needs according to user is set up ontology library at least also comprises the steps:
According to user's filtration needs, the field that the body that clearly will build covers and scope are determined field and the scope of body;
In the related territory of body, carry out the Collection and analysis of information, define the relation between Key Concepts and concept, and express with accurate term; And
According to collection analysis result, set up body frame.
Further, this ontology library is dynamically adjusted and adopted increment type alternative manner to realize.
The step of further, this text to be filtered being filtered at least also comprises the steps:
Text to be filtered is removed to stop words operation;
Extract the Feature Words of expressing content of text in this text to be filtered, according to the different position of Feature Words and frequency, give corresponding weight, and identical term weight function value is added, form text feature word set;
According to vector space model, calculate the degree of correlation of this text to be filtered and body; And
According to the relation of the threshold value of a setting and this degree of correlation, this text to be filtered is filtered.
Compared with prior art, a kind of text filtering system and method for the present invention by set up ontology library can be more accurate express user's filtration needs, simultaneously in order further to guarantee that ontology library is closer to user's filtration needs, the present invention adopts the mode of adaptive learning, by one group of sample is carried out to training study, partial dynamic is adjusted ontology library, the conventional method that has overcome traditional eigenvector method and set up ontology library is expressed out of true and causes the shortcoming that filtering accuracy is not high user's request, in addition, the present invention adopts vector space model to calculate the similarity of text and ontology library to be filtered at filtration stage, text filtering lower than threshold value is fallen, can dynamically adjust filtration threshold value, to reach better filter effect, facts have proved, the adaptive text filtering method of this employing of the present invention based on body can obtain higher filtering accuracy.
Accompanying drawing explanation
Fig. 1 is the system architecture diagram of a kind of text filtering system of the present invention;
Fig. 2 is the flow chart of steps of a kind of text filtering method of the present invention.
Embodiment
Below, by specific instantiation accompanying drawings embodiments of the present invention, those skilled in the art can understand other advantage of the present invention and effect easily by content disclosed in the present specification.The present invention also can be implemented or be applied by other different instantiation, and the every details in this instructions also can be based on different viewpoints and application, carries out various modifications and change not deviating under spirit of the present invention.
Fig. 1 is the system architecture diagram of a kind of text filtering system of the present invention.As shown in Figure 1, a kind of text filtering system of the present invention, at least comprises: ontology library is set up module 10, adaptive learning module 11 and text filtering module 12.
Wherein ontology library is set up module 10 and is set up ontology library for the filtration needs according to user, and it at least comprises that field determines that module 101, collection analysis module 102 and body frame set up module 103.Field determines that module 101 is first according to user's filtration needs, and the field that the body that clearly will build covers and scope are to determine field and the scope of body; Collection analysis module 102 for carrying out the Collection and analysis of information in the related territory of body, define the relation between Key Concepts and concept, and express with accurate term, for example, in preferred embodiment of the present invention, body is taked tlv triple Topic (C, P, S) represent, wherein: C represents by the noun conceptual abstraction in filtration art out, to have the set of the concept class of same alike result and behavior structure; P describes the attribute of concept and relation; Structural relation between S representation class, as parent, subclass etc.C adopts vector space model (VSM) to represent, uses two tuple C
i(Key
i, Weight
i), Key wherein
irepresent keyword, Weight
ithe weight that represents keyword; Body frame is set up module 103 and is set up body frame for the collection analysis result according to collection analysis module 102.
Adaptive learning module 11 carries out training study and ontology library is set up to the ontology library that module 10 sets up dynamically adjusts by filtering sample to one group, makes it move closer to the filtration needs in user.In preferred embodiment of the present invention, adaptive learning module 11 adopts increment type alternative manner to filter sample training to one group, there is the window size of quantity in the document that setting fixed value m is filtered as the new needs of observation, according to the parameter n of evaluation metrics, arrange flexibly, and establish training iterations be 5, in increment iterative training process, need to determine each characteristic item number increasing, to avoid producing more noise, according to the validity feature value increasing, choose in the existing ontology library of being increased to of some, enrich user's filtration needs model.Therefore along with continuous study, ontology library is more and more close to user's filtration needs, and the necessary feature of ontology library also reduces gradually.
Fig. 2 is the flow chart of steps of a kind of text filtering method of the present invention.As shown in Figure 2, a kind of text filtering method of the present invention, at least comprises the steps:
Visible, because body can be to carrying out clear and definite definition between field concept and concept, a kind of text filtering system and method for the present invention by set up ontology library can be more accurate express user's filtration needs, simultaneously in order further to guarantee that ontology library is closer to user's filtration needs, the present invention adopts the mode of adaptive learning, by one group of sample is carried out to training study, partial dynamic is adjusted ontology library, the conventional method that has overcome traditional eigenvector method and set up ontology library is expressed out of true and causes the shortcoming that filtering accuracy is not high user's request, in addition, the present invention adopts vector space model to calculate the similarity of text and ontology library to be filtered at filtration stage, text filtering lower than threshold value is fallen, and can dynamically adjust filtration threshold value, to reach better filter effect, facts have proved, the adaptive text filtering method of this employing of the present invention based on body can obtain higher filtering accuracy.
Above-described embodiment is illustrative principle of the present invention and effect thereof only, but not for limiting the present invention.Any those skilled in the art all can, under spirit of the present invention and category, modify and change above-described embodiment.Therefore, the scope of the present invention, should be as listed in claims.
Claims (8)
1. a text filtering system, at least comprises:
Ontology library is set up module, for the filtration needs according to user, sets up ontology library;
Adaptive learning module, dynamically adjusts this ontology library is set up to the ontology library of module foundation by one group of filtration sample being carried out to training study, makes it move closer to the filtration needs in user; And
Text filtering module, by text to be filtered being carried out to pre-service, extracting after feature word set and similarity matching treatment, obtains the degree of correlation of this text to be filtered and body, and according to this degree of correlation, this text to be filtered is filtered;
The text is filtered module and is at least comprised:
Pre-service module, for removing stop words operation to this text to be filtered;
Feature word set extracts module, for this text to be filtered being extracted to the Feature Words of expressing content of text, according to the different position of Feature Words and frequency, gives corresponding weight, and identical term weight function value is added, and forms text feature word set;
Similarity is calculated module, according to vector space model, calculates the degree of correlation of this text to be filtered and this body; And
Filter module, according to this degree of correlation and a threshold value of setting, this text to be filtered is filtered.
2. text filtering system as claimed in claim 1, is characterized in that, this ontology library is set up module and at least comprised:
Module is determined in field, and for according to user's filtration needs, the field that the body that clearly will build covers and scope are to determine field and the scope of body;
Collection analysis module, for carry out the Collection and analysis of information in the related territory of body, defines the relation between Key Concepts and concept, and expresses with accurate term; And
Body frame is set up module, for setting up body frame according to collection analysis result.
3. text filtering system as claimed in claim 2, is characterized in that: this body is taked tlv triple Topic(C, P, S) represent, wherein, C represents by the noun conceptual abstraction in filtration art out, to have the set of the concept class of same alike result and behavior structure; P describes the attribute of concept and relation; Structural relation between S representation class.
4. text filtering system as claimed in claim 1, is characterized in that: this adaptive learning module adopts increment type alternative manner to carry out training study to one group of filtration sample and dynamically adjusts this ontology library is set up to the ontology library of module foundation.
5. text filtering system as claimed in claim 1, is characterized in that: this filtration module filters the text lower than this threshold value in this text to be filtered.
6. a text filtering method, at least comprises the steps:
According to user's filtration needs, set up ontology library;
To one group, filter sample and carry out training study so that the ontology library of being set up is dynamically adjusted, make it move closer to the filtration needs in user; And
Text to be filtered is carried out pre-service, extracted after feature word set and similarity matching treatment, obtain the degree of correlation of this text to be filtered and body, and according to this degree of correlation, this text to be filtered is filtered;
The step that this text to be filtered is filtered at least also comprises the steps:
Text to be filtered is removed to stop words operation;
Extract the Feature Words of expressing content of text in this text to be filtered, according to the different position of Feature Words and frequency, give corresponding weight, and identical term weight function value is added, form text feature word set;
According to vector space model, calculate the degree of correlation of this text to be filtered and body; And
According to the relation of the threshold value of a setting and this degree of correlation, this text to be filtered is filtered.
7. a kind of text filtering method as claimed in claim 6, is characterized in that, the step that this filtration needs according to user is set up ontology library at least also comprises the steps:
According to user's filtration needs, the field that the body that clearly will build covers and scope are determined field and the scope of body;
In the related territory of body, carry out the Collection and analysis of information, define the relation between Key Concepts and concept, and express with accurate term; And
According to collection analysis result, set up body frame.
8. a kind of text filtering method as claimed in claim 6, is characterized in that: this ontology library is dynamically adjusted and adopted increment type alternative manner to realize.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201110440801.6A CN102521402B (en) | 2011-12-23 | 2011-12-23 | Text filtering system and method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201110440801.6A CN102521402B (en) | 2011-12-23 | 2011-12-23 | Text filtering system and method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN102521402A CN102521402A (en) | 2012-06-27 |
CN102521402B true CN102521402B (en) | 2014-02-19 |
Family
ID=46292315
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201110440801.6A Expired - Fee Related CN102521402B (en) | 2011-12-23 | 2011-12-23 | Text filtering system and method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN102521402B (en) |
Families Citing this family (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102880636A (en) * | 2012-08-03 | 2013-01-16 | 深圳证券信息有限公司 | Bad information detection method and server |
CN103034726B (en) * | 2012-12-18 | 2016-05-25 | 上海电机学院 | Text filtering system and method |
CN103902619B (en) * | 2012-12-28 | 2018-10-23 | 中国移动通信集团公司 | A kind of network public-opinion monitoring method and system |
CN105224569B (en) | 2014-06-30 | 2018-09-07 | 华为技术有限公司 | A kind of data filtering, the method and device for constructing data filter |
CN104615714B (en) * | 2015-02-05 | 2019-05-24 | 北京中搜云商网络技术有限公司 | Blog article rearrangement based on text similarity and microblog channel feature |
CN108428382A (en) * | 2018-02-14 | 2018-08-21 | 广东外语外贸大学 | It is a kind of spoken to repeat methods of marking and system |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101751409A (en) * | 2008-11-28 | 2010-06-23 | 上海电机学院 | Application of immune system in search engine |
CN101794311A (en) * | 2010-03-05 | 2010-08-04 | 南京邮电大学 | Fuzzy data mining based automatic classification method of Chinese web pages |
CN101901247A (en) * | 2010-03-29 | 2010-12-01 | 北京师范大学 | Vertical engine searching method and system for domain body restraint |
-
2011
- 2011-12-23 CN CN201110440801.6A patent/CN102521402B/en not_active Expired - Fee Related
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101751409A (en) * | 2008-11-28 | 2010-06-23 | 上海电机学院 | Application of immune system in search engine |
CN101794311A (en) * | 2010-03-05 | 2010-08-04 | 南京邮电大学 | Fuzzy data mining based automatic classification method of Chinese web pages |
CN101901247A (en) * | 2010-03-29 | 2010-12-01 | 北京师范大学 | Vertical engine searching method and system for domain body restraint |
Also Published As
Publication number | Publication date |
---|---|
CN102521402A (en) | 2012-06-27 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN103034726B (en) | Text filtering system and method | |
CN102521402B (en) | Text filtering system and method | |
CN103117060B (en) | For modeling method, the modeling of the acoustic model of speech recognition | |
CN104199972B (en) | A kind of name entity relation extraction and construction method based on deep learning | |
CN108932950B (en) | Sound scene identification method based on label amplification and multi-spectral diagram fusion | |
CN107861939A (en) | A kind of domain entities disambiguation method for merging term vector and topic model | |
CN108287858A (en) | The semantic extracting method and device of natural language | |
CN108509425A (en) | A kind of Chinese new word discovery method based on novel degree | |
CN106528532A (en) | Text error correction method and device and terminal | |
CN109508379A (en) | A kind of short text clustering method indicating and combine similarity based on weighted words vector | |
CN106095791B (en) | A kind of abstract sample information searching system based on context | |
CN102236639B (en) | Update the system and method for language model | |
CN110717332B (en) | News and case similarity calculation method based on asymmetric twin network | |
CN101127042A (en) | Sensibility classification method based on language model | |
CN104008166A (en) | Dialogue short text clustering method based on form and semantic similarity | |
CN108280164B (en) | Short text filtering and classifying method based on category related words | |
CN102289522A (en) | Method of intelligently classifying texts | |
CN107688576B (en) | Construction and tendency classification method of CNN-SVM model | |
US10387805B2 (en) | System and method for ranking news feeds | |
CN104679738A (en) | Method and device for mining Internet hot words | |
CN106844786A (en) | A kind of public sentiment region focus based on text similarity finds method | |
CN103810162A (en) | Method and system for recommending network information | |
CN109918648B (en) | Rumor depth detection method based on dynamic sliding window feature score | |
CN106682123A (en) | Hot event acquiring method and device | |
CN105956158B (en) | The method that network neologisms based on massive micro-blog text and user information automatically extract |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C14 | Grant of patent or utility model | ||
GR01 | Patent grant | ||
CF01 | Termination of patent right due to non-payment of annual fee | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20140219 Termination date: 20161223 |