CN102521402A - Text filtering system and method - Google Patents
Text filtering system and method Download PDFInfo
- Publication number
- CN102521402A CN102521402A CN2011104408016A CN201110440801A CN102521402A CN 102521402 A CN102521402 A CN 102521402A CN 2011104408016 A CN2011104408016 A CN 2011104408016A CN 201110440801 A CN201110440801 A CN 201110440801A CN 102521402 A CN102521402 A CN 102521402A
- Authority
- CN
- China
- Prior art keywords
- text
- filtered
- module
- filtering
- user
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Machine Translation (AREA)
Abstract
The invention discloses a text filtering system and a text filtering method. The system at least comprises an ontology base construction module, an adaptive learning module and a text filtering module, wherein the ontology base construction module is used for constructing an ontology base according to the filtering requirements of a user; the adaptive learning module dynamically regulates the ontology base constructed by the ontology base construction module by performing training and learning on a group of filtering samples to make the ontology base gradually meet the filtering requirements of the user; and the text filtering module performs preprocessing, characteristic word set extraction and similarity matching on a text to be filtered to obtain relevance between the text to be filtered and an ontology, and filters the text to be filtered according to the relevance. By the system and the method, a filtering model for the user can be accurately expressed; and in the filtration, the filtering model expressed by the ontology for the user can be regulated by automatic learning, and a filtering threshold value can be dynamically regulated to achieve a good filtering effect.
Description
Technical field
The present invention relates to a kind of text filtering system and method, particularly relate to a kind of adaptive text filtering system and method based on body.
Background technology
In information retrieval and filtration art, text filtering is a research focus always.There have been many employing diverse ways to realize text filtering in the domestic and foreign literature at present.
In present text filtering method, mainly comprise the fuzzy clustering text filtering method based on genetic algorithm, the text filtering method that adopts improved sorting algorithm, the text filtering method that adopts the adaptive learning filter algorithm and the text filtering method that only adopts body.Wherein, Employing is based on the fuzzy clustering method of genetic algorithm; To each individuality in the population, carry out the direct cluster of fuzzy similarity matrix, adopt the fitness function that is proposed to assess the fitness of population according to clustering result then; Yet the precision that this text filtering method filters depends on the effect of cluster, can not well express for user's filtration needs; Adopt the text filtering method of improved sorting algorithm that bad text message is filtered, improve traditional KNN algorithm from the angle of data Layer, its shortcoming is to express accurate inadequately to user's demand equally; Adopt the text filtering method of adaptive learning filter algorithm, can carry out adaptive learning, can adjust filtering model, but its expression for user's filtration needs is accurate equally inadequately through the mode of training set of stereotypes; Only adopt the text filtering method of body, the precision of filtration depends on the foundation of body, if ontology library is created inadequately words accurately, will influence the precision of text filtering greatly.
In sum, can know to exist in the text filtering method of prior art user's demand is expressed inadequately accurately or ontology library is created the problem that accurately influences the text filtering precision inadequately, be necessary to propose improved technological means therefore in fact, solve this problem
Summary of the invention
For overcoming the deficiency that above-mentioned prior art exists; Fundamental purpose of the present invention is to provide a kind of text filtering system and method; It not only can accurately express user's filtering model, and can when filtering, carry out autonomous learning, the user filtering model that adjustment adopts body to express; And can dynamically adjust the filtration threshold value, to reach better filter effect.
For reaching above-mentioned and other purpose, the present invention provides a kind of text filtering system, comprises at least:
Ontology library is set up module, is used for setting up ontology library according to user's filtration needs;
The adaptive learning module is dynamically adjusted with the ontology library of this ontology library being set up module foundation through one group of filtration sample is carried out training study, makes it move closer to the filtration needs in the user; And
The text filtering module after text to be filtered being carried out pre-service, extracting characteristic word set and similarity matching treatment, obtains the degree of correlation of this text to be filtered and body, and according to this degree of correlation this text to be filtered is filtered.
Further, this ontology library is set up module and is comprised at least:
Module is confirmed in the field, is used for the filtration needs according to the user, and field that body covered that clearly will make up and scope are to confirm the field and the scope of body;
The collection analysis module is used in the related territory of body, carrying out information collecting and analysis, the relation between clear and definite emphasis notion and the notion, and use accurate expressed in terms; And
Body frame is set up module, is used for setting up body frame according to the collection analysis result.
Further, this body is taked tlv triple Topic (C, P is represented that S) wherein, C representes to be come out by the noun conceptual abstraction in the filtration art, has the set of the notion class of same alike result and behavior structure; P describes the attribute of notion and relation; Structural relation between the S representation class is like parent, subclass etc.
Further, this adaptive learning module adopts increment type alternative manner to filter sample to one group to carry out training study and dynamically adjust with the ontology library of this ontology library being set up module and being set up.
Further, text filtration module comprises at least
The pre-service module is used for this text to be filtered is removed the stop words operation;
The characteristic word set extracts module, is used for this text to be filtered is extracted the characteristic speech of expressing content of text, gives corresponding weights according to position and frequency that the characteristic speech is different, and with identical characteristic speech weighted value addition, forms the text feature word set;
The similarity calculation module according to vector space model, calculates the degree of correlation of this text to be filtered and this body; And
Filter module,, this text to be filtered is filtered according to this degree of correlation and a preset threshold.
Further, this filtration module filters the text that is lower than this threshold value in this band filtration text.
For reaching above-mentioned and other purposes, the present invention provides a kind of text filtering method, and it comprises the steps: at least
Filtration needs according to the user is set up ontology library;
Filter sample to one group and carry out training study, make it move closer to filtration needs in the user so that the ontology library of being set up is dynamically adjusted; And
After text to be filtered carried out pre-service, extracts characteristic word set and similarity matching treatment, obtain the degree of correlation of this text to be filtered and body, and this text to be filtered is filtered according to this degree of correlation.
Further, this filtration needs according to user step of setting up ontology library at least also comprises the steps:
According to user's filtration needs, field and scope that field that body covered that clearly will make up and scope are confirmed body;
In the related territory of body, carry out information collecting and analysis, the relation between clear and definite emphasis notion and the notion, and use accurate expressed in terms; And
The result sets up body frame according to collection analysis.
Further, this ontology library dynamically being adjusted employing increment type alternative manner realizes.
The step of further, this text to be filtered being filtered at least also comprises the steps:
Text to be filtered is removed the stop words operation;
Extract the characteristic speech of expressing content of text in this text to be filtered, give corresponding weights, and, form the text feature word set identical characteristic speech weighted value addition according to position and frequency that the characteristic speech is different;
According to vector space model, calculate the degree of correlation of this text to be filtered and body; And
Relation according to a preset threshold and this degree of correlation is filtered this text to be filtered.
Compared with prior art; A kind of text filtering system and method for the present invention can be expressed user's filtration needs through setting up ontology library more accurately, and in order to guarantee that further ontology library more approaches user's filtration needs, the present invention adopts the mode of adaptive learning simultaneously; Through one group of sample is carried out training study; Partial dynamic adjustment ontology library, the conventional method that has overcome the traditional characteristic vector method and set up ontology library is expressed out of true and is caused the not high shortcoming of filtering accuracy user's request, in addition; The present invention adopts vector space model to calculate the similarity of text and ontology library to be filtered at filtration stage; The text filtering that will be lower than threshold value falls, and can dynamically adjust the filtration threshold value, to reach better filter effect; Facts have proved that this employing of the present invention can obtain higher filtering accuracy based on the adaptive text filtering method of body.
Description of drawings
Fig. 1 is the system architecture diagram of a kind of text filtering of the present invention system;
Fig. 2 is the flow chart of steps of a kind of text filtering method of the present invention.
Embodiment
Below through specific instantiation and accompanying drawings embodiment of the present invention, those skilled in the art can understand other advantage of the present invention and effect easily by the content that this instructions disclosed.The present invention also can implement or use through other different instantiation, and each item details in this instructions also can be based on different viewpoints and application, carries out various modifications and change under the spirit of the present invention not deviating from.
Fig. 1 is the system architecture diagram of a kind of text filtering of the present invention system.As shown in Figure 1, a kind of text filtering of the present invention system comprises at least: ontology library is set up module 10, adaptive learning module 11 and text filtering module 12.
Wherein ontology library is set up module 10 and is used for setting up ontology library according to user's filtration needs, and it comprises that at least the field confirms that module 101, collection analysis module 102 and body frame set up module 103.Module 101 is confirmed at first according to user's filtration needs in the field, and field that body covered that clearly will make up and scope are to confirm the field and the scope of body; Collection analysis module 102 is used in the related territory of body, carrying out information collecting and analysis, the relation between clear and definite emphasis notion and the notion, and come out with accurate expressed in terms; For example, in preferred embodiment of the present invention, body is taked tlv triple Topic (C; P; S) represent that wherein: C representes to be come out by the noun conceptual abstraction in the filtration art, has the set of the notion class of same alike result and behavior structure; P describes the attribute of notion and relation; Structural relation between the S representation class is like parent, subclass etc.C adopts vector space model (VSM) to represent, uses doublet C
i(Key
i, Weight
i), Key wherein
iThe expression keyword, Weight
iThe weight of expression keyword; Body frame is set up module 103 and is used for setting up body frame according to the collection analysis result of collection analysis module 102.
Adaptive learning module 11 carries out training study and ontology library is set up the ontology library that module 10 sets up dynamically adjusts through filtering sample to one group, makes it move closer to the filtration needs in the user.In preferred embodiment of the present invention, adaptive learning module 11 adopts the increment type alternative manner to filter the sample training to one group, and the window size of quantity appears in the document that setting fixed value m is filtered as the new needs of observation; Parameter n according to evaluation metrics comes to be provided with flexibly; And to establish the training iterations be 5, in the increment iterative training process, need to confirm each characteristic item number that increases; To avoid producing more noise; According to the validity feature value that increases, choose in the existing ontology library of being increased to of some, enrich user's filtration needs model.Therefore along with continuous study, ontology library more and more approaches user's filtration needs, and the necessary characteristic of ontology library also reduces gradually.
Fig. 2 is the flow chart of steps of a kind of text filtering method of the present invention.As shown in Figure 2, a kind of text filtering method of the present invention comprises the steps: at least
It is thus clear that because body can be to carrying out clear and definite definition between field concept and notion, a kind of text filtering system and method for the present invention can be expressed user's filtration needs through setting up ontology library more accurately; Simultaneously in order to guarantee that further ontology library more approaches user's filtration needs; The present invention adopts the mode of adaptive learning, through one group of sample being carried out training study, partial dynamic adjustment ontology library; The conventional method that has overcome the traditional characteristic vector method and set up ontology library is expressed out of true and is caused the not high shortcoming of filtering accuracy user's request; In addition, the present invention adopts vector space model to calculate the similarity of text and ontology library to be filtered at filtration stage, and the text filtering that will be lower than threshold value falls; And can dynamically adjust the filtration threshold value; To reach better filter effect, facts have proved that this employing of the present invention can obtain higher filtering accuracy based on the adaptive text filtering method of body.
The foregoing description is illustrative principle of the present invention and effect thereof only, but not is used to limit the present invention.Any those skilled in the art all can be under spirit of the present invention and category, and the foregoing description is modified and changed.Therefore, rights protection scope of the present invention should be listed like claims.
Claims (10)
1. text filtering system comprises at least:
Ontology library is set up module, is used for setting up ontology library according to user's filtration needs;
The adaptive learning module is dynamically adjusted with the ontology library of this ontology library being set up module foundation through one group of filtration sample is carried out training study, makes it move closer to the filtration needs in the user; And
The text filtering module after text to be filtered being carried out pre-service, extracting characteristic word set and similarity matching treatment, obtains the degree of correlation of this text to be filtered and body, and according to this degree of correlation this text to be filtered is filtered.
2. text filtering as claimed in claim 1 system is characterized in that this ontology library is set up module and comprised at least:
Module is confirmed in the field, is used for the filtration needs according to the user, and field that body covered that clearly will make up and scope are to confirm the field and the scope of body;
The collection analysis module is used in the related territory of body, carrying out information collecting and analysis, the relation between clear and definite emphasis notion and the notion, and use accurate expressed in terms; And
Body frame is set up module, is used for setting up body frame according to the collection analysis result.
3. text filtering as claimed in claim 2 system, it is characterized in that: this body is taked tlv triple Topic, and (C, P represent that S) wherein, C representes to be come out by the noun conceptual abstraction in the filtration art, has the set of the notion class of same alike result and behavior structure; P describes the attribute of notion and relation; Structural relation between the S representation class is like parent, subclass etc.
4. text filtering as claimed in claim 1 system is characterized in that: this adaptive learning module adopts increment type alternative manner to filter sample to one group to carry out training study and dynamically adjust with the ontology library of this ontology library being set up module and being set up.
5. text filtering as claimed in claim 1 system is characterized in that, the text is filtered module and comprised at least:
The pre-service module is used for this text to be filtered is removed the stop words operation;
The characteristic word set extracts module, is used for this text to be filtered is extracted the characteristic speech of expressing content of text, gives corresponding weights according to position and frequency that the characteristic speech is different, and with identical characteristic speech weighted value addition, forms the text feature word set;
The similarity calculation module according to vector space model, calculates the degree of correlation of this text to be filtered and this body; And
Filter module,, this text to be filtered is filtered according to this degree of correlation and a preset threshold.
6. text filtering as claimed in claim 5 system is characterized in that: this filtration module filters the text that is lower than this threshold value in text to this band and filters.
7. a text filtering method comprises the steps: at least
Filtration needs according to the user is set up ontology library;
Filter sample to one group and carry out training study, make it move closer to filtration needs in the user so that the ontology library of being set up is dynamically adjusted; And
After text to be filtered carried out pre-service, extracts characteristic word set and similarity matching treatment, obtain the degree of correlation of this text to be filtered and body, and this text to be filtered is filtered according to this degree of correlation.
8. a kind of text filtering method as claimed in claim 7 is characterized in that, the step that this filtration needs according to the user is set up ontology library at least also comprises the steps:
According to user's filtration needs, field and scope that field that body covered that clearly will make up and scope are confirmed body;
In the related territory of body, carry out information collecting and analysis, the relation between clear and definite emphasis notion and the notion, and use accurate expressed in terms; And
The result sets up body frame according to collection analysis.
9. a kind of text filtering method as claimed in claim 7 is characterized in that: this ontology library is dynamically adjusted adopted the increment type alternative manner to realize.
10. a kind of text filtering method as claimed in claim 7 is characterized in that, the step that this text to be filtered is filtered at least also comprises the steps:
Text to be filtered is removed the stop words operation;
Extract the characteristic speech of expressing content of text in this text to be filtered, give corresponding weights, and, form the text feature word set identical characteristic speech weighted value addition according to position and frequency that the characteristic speech is different;
According to vector space model, calculate the degree of correlation of this text to be filtered and body; And this text to be filtered is filtered according to the relation of a preset threshold and this degree of correlation.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201110440801.6A CN102521402B (en) | 2011-12-23 | 2011-12-23 | Text filtering system and method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201110440801.6A CN102521402B (en) | 2011-12-23 | 2011-12-23 | Text filtering system and method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN102521402A true CN102521402A (en) | 2012-06-27 |
CN102521402B CN102521402B (en) | 2014-02-19 |
Family
ID=46292315
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201110440801.6A Expired - Fee Related CN102521402B (en) | 2011-12-23 | 2011-12-23 | Text filtering system and method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN102521402B (en) |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102880636A (en) * | 2012-08-03 | 2013-01-16 | 深圳证券信息有限公司 | Bad information detection method and server |
CN103034726A (en) * | 2012-12-18 | 2013-04-10 | 上海电机学院 | Text filtering system and method |
CN103902619A (en) * | 2012-12-28 | 2014-07-02 | 中国移动通信集团公司 | Internet public opinion monitoring method and system |
CN104615714A (en) * | 2015-02-05 | 2015-05-13 | 北京中搜网络技术股份有限公司 | Blog duplicate removal method based on text similarities and microblog channel features |
US9755616B2 (en) | 2014-06-30 | 2017-09-05 | Huawei Technologies Co., Ltd. | Method and apparatus for data filtering, and method and apparatus for constructing data filter |
CN108428382A (en) * | 2018-02-14 | 2018-08-21 | 广东外语外贸大学 | It is a kind of spoken to repeat methods of marking and system |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101751409A (en) * | 2008-11-28 | 2010-06-23 | 上海电机学院 | Application of immune system in search engine |
CN101794311A (en) * | 2010-03-05 | 2010-08-04 | 南京邮电大学 | Fuzzy data mining based automatic classification method of Chinese web pages |
CN101901247A (en) * | 2010-03-29 | 2010-12-01 | 北京师范大学 | Vertical engine searching method and system for domain body restraint |
-
2011
- 2011-12-23 CN CN201110440801.6A patent/CN102521402B/en not_active Expired - Fee Related
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101751409A (en) * | 2008-11-28 | 2010-06-23 | 上海电机学院 | Application of immune system in search engine |
CN101794311A (en) * | 2010-03-05 | 2010-08-04 | 南京邮电大学 | Fuzzy data mining based automatic classification method of Chinese web pages |
CN101901247A (en) * | 2010-03-29 | 2010-12-01 | 北京师范大学 | Vertical engine searching method and system for domain body restraint |
Cited By (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102880636A (en) * | 2012-08-03 | 2013-01-16 | 深圳证券信息有限公司 | Bad information detection method and server |
CN103034726A (en) * | 2012-12-18 | 2013-04-10 | 上海电机学院 | Text filtering system and method |
CN103034726B (en) * | 2012-12-18 | 2016-05-25 | 上海电机学院 | Text filtering system and method |
CN103902619A (en) * | 2012-12-28 | 2014-07-02 | 中国移动通信集团公司 | Internet public opinion monitoring method and system |
CN103902619B (en) * | 2012-12-28 | 2018-10-23 | 中国移动通信集团公司 | A kind of network public-opinion monitoring method and system |
US9755616B2 (en) | 2014-06-30 | 2017-09-05 | Huawei Technologies Co., Ltd. | Method and apparatus for data filtering, and method and apparatus for constructing data filter |
CN104615714A (en) * | 2015-02-05 | 2015-05-13 | 北京中搜网络技术股份有限公司 | Blog duplicate removal method based on text similarities and microblog channel features |
CN104615714B (en) * | 2015-02-05 | 2019-05-24 | 北京中搜云商网络技术有限公司 | Blog article rearrangement based on text similarity and microblog channel feature |
CN108428382A (en) * | 2018-02-14 | 2018-08-21 | 广东外语外贸大学 | It is a kind of spoken to repeat methods of marking and system |
Also Published As
Publication number | Publication date |
---|---|
CN102521402B (en) | 2014-02-19 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN103034726B (en) | Text filtering system and method | |
CN107102989B (en) | Entity disambiguation method based on word vector and convolutional neural network | |
CN103117060B (en) | For modeling method, the modeling of the acoustic model of speech recognition | |
CN106528532B (en) | Text error correction method, device and terminal | |
CN110717332B (en) | News and case similarity calculation method based on asymmetric twin network | |
CN102521402B (en) | Text filtering system and method | |
CN108052593A (en) | A kind of subject key words extracting method based on descriptor vector sum network structure | |
CN104679738B (en) | Internet hot words mining method and device | |
CN107153658A (en) | A kind of public sentiment hot word based on weighted keyword algorithm finds method | |
CN102236639B (en) | Update the system and method for language model | |
CN107291886A (en) | A kind of microblog topic detecting method and system based on incremental clustering algorithm | |
CN105608200A (en) | Network public opinion tendency prediction analysis method | |
CN108280164B (en) | Short text filtering and classifying method based on category related words | |
CN103870474A (en) | News topic organizing method and device | |
CN102929861A (en) | Method and system for calculating text emotion index | |
CN104462286A (en) | Microblog topic finding method based on modified LDA | |
CN106844786A (en) | A kind of public sentiment region focus based on text similarity finds method | |
CN102968410A (en) | Text classification method based on RBF (Radial Basis Function) neural network algorithm and semantic feature selection | |
CN103870001A (en) | Input method candidate item generating method and electronic device | |
CN103810162A (en) | Method and system for recommending network information | |
CN109657058A (en) | A kind of abstracting method of notice information | |
CN106682123A (en) | Hot event acquiring method and device | |
CN106095791A (en) | A kind of abstract sample information searching system based on context and abstract sample characteristics method for expressing thereof | |
CN108363784A (en) | A kind of public sentiment trend estimate method based on text machine learning | |
CN109472021A (en) | Critical sentence screening technique and device in medical literature based on deep learning |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C14 | Grant of patent or utility model | ||
GR01 | Patent grant | ||
CF01 | Termination of patent right due to non-payment of annual fee | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20140219 Termination date: 20161223 |