CN102521402A - Text filtering system and method - Google Patents

Text filtering system and method Download PDF

Info

Publication number
CN102521402A
CN102521402A CN2011104408016A CN201110440801A CN102521402A CN 102521402 A CN102521402 A CN 102521402A CN 2011104408016 A CN2011104408016 A CN 2011104408016A CN 201110440801 A CN201110440801 A CN 201110440801A CN 102521402 A CN102521402 A CN 102521402A
Authority
CN
China
Prior art keywords
text
filtering
ontology
filtered
module
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN2011104408016A
Other languages
Chinese (zh)
Other versions
CN102521402B (en
Inventor
闫俊英
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Dianji University
Original Assignee
Shanghai Dianji University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Dianji University filed Critical Shanghai Dianji University
Priority to CN201110440801.6A priority Critical patent/CN102521402B/en
Publication of CN102521402A publication Critical patent/CN102521402A/en
Application granted granted Critical
Publication of CN102521402B publication Critical patent/CN102521402B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)

Abstract

本发明公开一种文本过滤系统及方法,该系统至少包括:本体库建立模组,用于根据用户的过滤需求建立本体库;自适应学习模组,通过对一组过滤样本进行训练学习以对该本体库建立模组建立的本体库动态调整,使其逐渐接近于用户的过滤需求;以及文本过滤模组,通过对待过滤文本进行预处理、抽取特征词集及相似度匹配处理后,获得该待过滤文本与本体的相关度,并根据该相关度对该待过滤文本进行过滤,通过本发明,不仅能够准确表达用户的过滤模型,并能够在过滤时进行自主学习,调整采用本体表达的用户过滤模型,并能够动态调整过滤阈值,以达到更好的过滤效果。

The invention discloses a text filtering system and method. The system at least includes: an ontology library building module, which is used to build an ontology library according to the user's filtering requirements; an adaptive learning module, which trains and learns a group of filtering samples to The ontology library established by the ontology library building module is dynamically adjusted to make it gradually approach the user's filtering requirements; and the text filtering module obtains the The degree of correlation between the text to be filtered and the ontology, and the text to be filtered is filtered according to the degree of correlation. Through the present invention, not only can the user's filtering model be accurately expressed, but also self-learning can be performed during filtering, and the users who use the ontology expression can be adjusted. Filtering model, and can dynamically adjust the filtering threshold to achieve better filtering effect.

Description

文本过滤系统及方法Text filtering system and method

技术领域 technical field

本发明涉及一种文本过滤系统及方法,特别是涉及一种基于本体的自适应的文本过滤系统及方法。The invention relates to a text filtering system and method, in particular to an ontology-based adaptive text filtering system and method.

背景技术 Background technique

在信息检索及过滤领域中,文本过滤一直是一个研究热点。目前国内外文献中已经有不少采用不同的方法来实现文本过滤。In the field of information retrieval and filtering, text filtering has always been a research hotspot. At present, many domestic and foreign literatures have adopted different methods to realize text filtering.

在目前的文本过滤方法中,主要包括基于遗传算法的模糊聚类文本过滤方法、采用改进的分类算法的文本过滤方法、采用自适应学习过滤算法的文本过滤方法以及只采用本体的文本过滤方法。其中,采用基于遗传算法的模糊聚类方法,对种群中的每个个体,进行模糊相似矩阵直接聚类,然后根据聚类的结果采用所提出的适应度函数来评估种群的适应度,然而这种文本过滤方法过滤的精度取决于聚类的效果,对于用户的过滤需求不能进行很好的表达;采用改进的分类算法的文本过滤方法对不良文本信息进行过滤,从数据层的角度改进传统的KNN算法,其缺点同样是对用户的需求表达不够精确;采用自适应学习过滤算法的文本过滤方法,能够通过训练样板集的方式来进行自适应学习,能够调整过滤模型,但其对于用户的过滤需求的表达同样不够精确;只采用本体的文本过滤方法,过滤的精度取决于本体的建立,如果本体库创建不够精确的话,将会大大影响文本过滤的精度。The current text filtering methods mainly include the fuzzy clustering text filtering method based on genetic algorithm, the text filtering method using improved classification algorithm, the text filtering method using adaptive learning filtering algorithm and the text filtering method only using ontology. Among them, the fuzzy clustering method based on genetic algorithm is used to directly cluster each individual in the population with a fuzzy similarity matrix, and then use the proposed fitness function to evaluate the fitness of the population according to the clustering results. However, this The filtering accuracy of this text filtering method depends on the effect of clustering, and it cannot express the user's filtering needs well; the text filtering method using an improved classification algorithm filters bad text information, and improves the traditional one from the perspective of the data layer. The disadvantage of the KNN algorithm is also that it is not accurate enough to express the user's needs; the text filtering method using the adaptive learning filtering algorithm can perform adaptive learning by training the sample set, and can adjust the filtering model, but its filtering effect on the user The expression of requirements is also not precise enough; only the text filtering method of ontology is used, and the filtering accuracy depends on the establishment of ontology. If the ontology database is not created accurately, it will greatly affect the accuracy of text filtering.

综上所述,可知先前技术之文本过滤方法中存在对用户的需求表达不够精确或本体库创建不够精确影响文本过滤精度的问题,因此实有必要提出改进的技术手段,来解决此一问题To sum up, it can be seen that in the text filtering method of the prior art, there is a problem that the expression of the user's needs is not accurate enough or the creation of the ontology database is not precise enough to affect the accuracy of the text filtering. Therefore, it is necessary to propose an improved technical means to solve this problem.

发明内容 Contents of the invention

为克服上述现有技术存在的不足,本发明的主要目的在于提供一种文本过滤系统及方法,其不仅能够准确表达用户的过滤模型,并能够在过滤时进行自主学习,调整采用本体表达的用户过滤模型,并能够动态调整过滤阈值,以达到更好的过滤效果。In order to overcome the deficiencies in the above-mentioned prior art, the main purpose of the present invention is to provide a text filtering system and method, which can not only accurately express the user's filtering model, but also can carry out independent learning during filtering, and adjust the user's text using ontology expression. Filtering model, and can dynamically adjust the filtering threshold to achieve better filtering effect.

为达上述及其它目的,本发明提供一种文本过滤系统,至少包括:To achieve the above and other purposes, the present invention provides a text filtering system, at least including:

本体库建立模组,用于根据用户的过滤需求建立本体库;Ontology library building module, used to build ontology library according to user's filtering requirements;

自适应学习模组,通过对一组过滤样本进行训练学习以对该本体库建立模组建立的本体库动态调整,使其逐渐接近于用户的过滤需求;以及The self-adaptive learning module dynamically adjusts the ontology library established by the ontology library building module by training and learning a group of filtering samples, making it gradually approach the user's filtering requirements; and

文本过滤模组,通过对待过滤文本进行预处理、抽取特征词集及相似度匹配处理后,获得该待过滤文本与本体的相关度,并根据该相关度对该待过滤文本进行过滤。The text filtering module obtains the degree of correlation between the text to be filtered and the ontology after preprocessing the text to be filtered, extracting the feature word set and matching the similarity, and filters the text to be filtered according to the degree of correlation.

进一步地,该本体库建立模组至少包括:Further, the ontology library building module at least includes:

领域确定模组,用于根据用户的过滤需求,明确要构建的本体所覆盖的领域和范围以确定本体的领域与范围;The domain determination module is used to clarify the domain and scope covered by the ontology to be built according to the user's filtering requirements, so as to determine the domain and scope of the ontology;

收集分析模组,用于在本体所涉及的领域范围内进行信息的收集和分析,明确重点概念和概念之间的关系,并且用精确的术语表达;以及The collection and analysis module is used to collect and analyze information within the scope of the ontology, clarify key concepts and the relationship between concepts, and express them in precise terms; and

本体框架建立模组,用于根据收集分析结果建立本体框架。The ontology frame building module is used to build the ontology frame according to the collection and analysis results.

进一步地,该本体采取三元组Topic(C,P,S)来表示,其中,C表示由过滤领域内的名词概念抽象出来,具有相同属性和行为结构的概念类的集合;P描述概念和关系的属性;S表示类之间的结构关系,如父类、子类等。Further, the ontology is represented by a triplet Topic(C, P, S), where C represents a collection of concept classes with the same attribute and behavior structure abstracted from noun concepts in the filtering field; P describes concepts and The attribute of the relationship; S represents the structural relationship between classes, such as parent class, subclass, etc.

进一步地,该自适应学习模组采用增量式迭代方法对一组过滤样本进行训练学习以对该本体库建立模组建立的本体库动态调整。Further, the self-adaptive learning module adopts an incremental iterative method to train and learn a group of filtered samples to dynamically adjust the ontology library built by the ontology library building module.

进一步地,该文本过滤模组至少包括Further, the text filtering module includes at least

预处理模组,用于对该待过滤文本进行去除停用词操作;A preprocessing module for removing stop words from the text to be filtered;

特征词集抽取模组,用于对该待过滤文本抽取出表达文本内容的特征词,根据特征词不同的位置及频率赋予相应的权重,并将相同的特征词权重值相加,形成文本特征词集;The feature word set extraction module is used to extract the feature words that express the text content from the text to be filtered, assign corresponding weights according to the different positions and frequencies of the feature words, and add the same feature word weight values to form text features vocabulary;

相似度计算模组,根据向量空间模型,计算出该待过滤文本与该本体的相关度;以及The similarity calculation module calculates the correlation between the text to be filtered and the ontology according to the vector space model; and

过滤模组,根据该相关度与一设定的阈值,对该待过滤文本进行过滤。The filtering module filters the text to be filtered according to the correlation degree and a set threshold.

进一步地,该过滤模组对该带过滤文本中低于该阈值的文本进行过滤。Further, the filtering module filters the texts that are lower than the threshold in the filtered texts.

为达上述及其他目的,本发明提供一种文本过滤方法,其至少包括如下步骤:In order to achieve the above and other purposes, the present invention provides a text filtering method, which at least includes the following steps:

根据用户的过滤需求建立本体库;Build an ontology library according to the user's filtering requirements;

对一组过滤样本进行训练学习以对所建立的本体库动态调整,使其逐渐接近于用户的过滤需求;以及Carry out training and learning on a set of filtered samples to dynamically adjust the established ontology library, making it gradually approach the user's filtering requirements; and

对待过滤文本进行预处理、抽取特征词集及相似度匹配处理后,获得该待过滤文本与本体的相关度,并根据该相关度对该待过滤文本进行过滤。After the text to be filtered is preprocessed, feature word set extracted and similarity matching processed, the correlation between the text to be filtered and the ontology is obtained, and the text to be filtered is filtered according to the correlation.

进一步地,该根据用户的过滤需求建立本体库的步骤至少还包括如下步骤:Further, the step of establishing an ontology library according to the user's filtering requirements at least includes the following steps:

根据用户的过滤需求,明确要构建的本体所覆盖的领域和范围确定本体的领域与范围;According to the user's filtering requirements, specify the field and scope covered by the ontology to be constructed to determine the field and scope of the ontology;

在本体所涉及的领域范围内进行信息的收集和分析,明确重点概念和概念之间的关系,并且用精确的术语表达;以及Collect and analyze information within the scope of the ontology, clarify the key concepts and the relationship between concepts, and express them in precise terms; and

根据收集分析结果建立本体框架。Based on the collected and analyzed results, an ontology framework is established.

进一步地,对该本体库动态调整采用增量式迭代方法实现。Further, the dynamic adjustment of the ontology database is realized by an incremental iterative method.

进一步地,对该待过滤文本进行过滤的步骤至少还包括如下步骤:Further, the step of filtering the text to be filtered further includes at least the following steps:

对待过滤文本进行去除停用词操作;Remove stop words from the text to be filtered;

抽取出该待过滤文本中表达文本内容的特征词,根据特征词不同的位置及频率赋予相应的权重,并将相同的特征词权重值相加,形成文本特征词集;Extract the feature words expressing the text content in the text to be filtered, assign corresponding weights according to the different positions and frequencies of the feature words, and add the same feature word weight values to form a text feature word set;

根据向量空间模型,计算出该待过滤文本与本体的相关度;以及Calculate the correlation between the text to be filtered and the ontology according to the vector space model; and

根据一设定的阈值与该相关度的关系对该待过滤文本进行过滤。The text to be filtered is filtered according to the relationship between a set threshold and the correlation degree.

与现有技术相比,本发明一种文本过滤系统及方法通过建立本体库能够比较精确地表达用户的过滤需求,同时为了进一步保证本体库更接近于用户的过滤需求,本发明采用自适应学习的方式,通过对一组样本进行训练学习,部分动态调整本体库,克服了传统的特征向量方法以及建立本体库的一般方法对用户需求表达不精确而造成过滤精度不高的缺点,另外,本发明在过滤阶段采用空间向量模型来计算待过滤的文本与本体库的相似度,将低于阈值的文本过滤掉,能够动态调整过滤阈值,以达到更好的过滤效果,实践证明,本发明这种采用基于本体的自适应的文本过滤方法能够获得较高的过滤精度。Compared with the prior art, a text filtering system and method of the present invention can more accurately express the user's filtering requirements by establishing an ontology database, and at the same time, in order to further ensure that the ontology database is closer to the user's filtering requirements, the present invention adopts adaptive learning By training and learning a group of samples, part of the ontology library is dynamically adjusted, which overcomes the shortcomings of the traditional feature vector method and the general method of establishing an ontology library, which cause inaccurate expression of user needs and cause low filtering accuracy. In addition, this In the filtering stage, the invention uses a space vector model to calculate the similarity between the text to be filtered and the ontology library, filters out texts below the threshold, and can dynamically adjust the filtering threshold to achieve a better filtering effect. Practice has proved that the present invention An adaptive text filtering method based on ontology can obtain higher filtering accuracy.

附图说明 Description of drawings

图1为本发明一种文本过滤系统的系统架构图;Fig. 1 is a system architecture diagram of a text filtering system of the present invention;

图2为本发明一种文本过滤方法的步骤流程图。FIG. 2 is a flow chart of the steps of a text filtering method in the present invention.

具体实施方式 Detailed ways

以下通过特定的具体实例并结合附图说明本发明的实施方式,本领域技术人员可由本说明书所揭示的内容轻易地了解本发明的其它优点与功效。本发明亦可通过其它不同的具体实例加以施行或应用,本说明书中的各项细节亦可基于不同观点与应用,在不背离本发明的精神下进行各种修饰与变更。The implementation of the present invention is described below through specific examples and in conjunction with the accompanying drawings, and those skilled in the art can easily understand other advantages and effects of the present invention from the content disclosed in this specification. The present invention can also be implemented or applied through other different specific examples, and various modifications and changes can be made to the details in this specification based on different viewpoints and applications without departing from the spirit of the present invention.

图1为本发明一种文本过滤系统的系统架构图。如图1所示,本发明一种文本过滤系统,至少包括:本体库建立模组10、自适应学习模组11以及文本过滤模组12。FIG. 1 is a system architecture diagram of a text filtering system of the present invention. As shown in FIG. 1 , a text filtering system of the present invention includes at least: an ontology database building module 10 , an adaptive learning module 11 and a text filtering module 12 .

其中本体库建立模组10用于根据用户的过滤需求建立本体库,其至少包括领域确定模组101、收集分析模组102以及本体框架建立模组103。领域确定模组101首先根据用户的过滤需求,明确要构建的本体所覆盖的领域和范围以确定本体的领域与范围;收集分析模组102用于在本体所涉及的领域范围内进行信息的收集和分析,明确重点概念和概念之间的关系,并且用精确的术语表达出来,例如,在本发明较佳实施例中,本体采取三元组Topic(C,P,S)来表示,其中:C表示由过滤领域内的名词概念抽象出来,具有相同属性和行为结构的概念类的集合;P描述概念和关系的属性;S表示类之间的结构关系,如父类、子类等。C采用向量空间模型(VSM)来表示,使用二元组Ci(Keyi,Weighti),其中Keyi表示关键词,Weighti表示关键词的权重;本体框架建立模组103用于根据收集分析模组102的收集分析结果建立本体框架。The ontology library building module 10 is used to build an ontology library according to the user's filtering requirements, and at least includes a field determination module 101 , a collection and analysis module 102 and an ontology framework building module 103 . The domain determination module 101 first defines the domain and scope covered by the ontology to be constructed according to the user's filtering requirements to determine the domain and scope of the ontology; the collection and analysis module 102 is used to collect information within the scope of the domain involved in the ontology and analysis, clarify key concepts and the relationship between concepts, and express them in precise terms, for example, in a preferred embodiment of the present invention, the ontology is represented by a triple Topic(C, P, S), where: C represents the collection of concept classes with the same attribute and behavior structure abstracted from noun concepts in the filtering field; P describes the attributes of concepts and relationships; S represents the structural relationship between classes, such as parent class, subclass, etc. C adopts vector space model (VSM) to represent, and uses two-tuple C i (Key i , Weight i ), wherein Key i represents a keyword, and Weight i represents the weight of a keyword; The collection and analysis results of the analysis module 102 establish an ontology framework.

自适应学习模组11通过对一组过滤样本进行训练学习对本体库建立模组10建立的本体库动态调整,使其逐渐接近于用户的过滤需求。在本发明较佳实施例中,自适应学习模组11采用增量式迭代方法对一组过滤样本进行训练,设定固定值m作为观察新的需要被过滤掉的文档出现数量的窗口大小,根据评测指标的参数n来灵活设置,并设训练迭代次数为5,在增量迭代训练过程中,需要确定每次增加的特征项数目,以避免产生更多的噪音,根据增加的有效特征值,选取一定数量的增加到已有的本体库中,丰富用户的过滤需求模型。因此随着不断的学习,本体库越来越接近于用户的过滤需求,本体库所必需的特征也逐渐减少。The self-adaptive learning module 11 dynamically adjusts the ontology library built by the ontology library building module 10 by training and learning a group of filtering samples, so that it gradually approaches the user's filtering requirements. In a preferred embodiment of the present invention, the self-adaptive learning module 11 uses an incremental iterative method to train a group of filtered samples, and sets a fixed value m as the window size for observing the number of new documents that need to be filtered out. Set flexibly according to the parameter n of the evaluation index, and set the number of training iterations to 5. In the incremental iterative training process, it is necessary to determine the number of feature items added each time to avoid generating more noise. According to the increased effective feature value , select a certain number to add to the existing ontology library, and enrich the user's filtering demand model. Therefore, with continuous learning, the ontology library is getting closer to the user's filtering requirements, and the necessary features of the ontology library are gradually reduced.

文本过滤模组12通过对待过滤文本进行预处理、抽取特征词集与相似度匹配处理后,根据待过滤文本与本体的相关度对待过滤文本进行过滤。其至少包括预处理模组121、特征词集抽取模组122、相似度计算模组123以及过滤模组124。其中,预处理模组121对待过滤文本经过去除停用词等预处理操作,特征词集抽取模组122用于抽取出表达文本内容的特征词,并根据特征词不同的位置及频率赋予相应的权重,相同的特征词权重值相加,形成文本特征词集Ti={(Word1k,Weight1k)},这样待过滤的文本采用了特征向量来表示;相似度计算模组123根据向量空间模型,两特征向量夹角的余弦值可以表示它们的相关度,由此可以计算出一个待过滤的文本与本体的相关度Simj;过滤模组124则根据该相关度Simj与设定的阈值,对待过滤文本进行过滤,即对低于阈值的文本进行过滤。The text filtering module 12 preprocesses the text to be filtered, extracts the feature word set and matches the similarity, and then filters the text to be filtered according to the correlation between the text to be filtered and the ontology. It includes at least a preprocessing module 121 , a feature word set extraction module 122 , a similarity calculation module 123 and a filtering module 124 . Among them, the preprocessing module 121 is used to remove stop words and other preprocessing operations on the text to be filtered, and the feature word set extraction module 122 is used to extract the feature words expressing the content of the text, and assign corresponding words according to different positions and frequencies of the feature words. Weight, the same feature word weight value is added, forms text feature word set Ti={(Word1k, Weight1k)}, and the text to be filtered has adopted feature vector to represent like this; Similarity calculation module 123 is according to vector space model, two The cosine value of the angle between the eigenvectors can represent their degree of correlation, and thus the degree of correlation Sim j between a text to be filtered and the ontology can be calculated; the filtering module 124 treats Filter the text to filter, that is, to filter the text below the threshold.

图2为本发明一种文本过滤方法的步骤流程图。如图2所示,本发明一种文本过滤方法,至少包括如下步骤:FIG. 2 is a flow chart of the steps of a text filtering method in the present invention. As shown in Figure 2, a kind of text filtering method of the present invention comprises the following steps at least:

步骤201,根据用户的过滤需求建立本体库。在该步骤中,首先根据用户的过滤需求,明确要构建的本体所覆盖的领域和范围确定本体的领域与范围;然后在本体所涉及的领域范围内进行信息的收集和分析,明确重点概念和概念之间的关系,并且用精确的术语表达出来;最后,建立本体框架。在本发明较佳实施例中,本体采取三元组Topic(C,P,S)来表示,其中:C表示由过滤领域内的名词概念抽象出来,具有相同属性和行为结构的概念类的集合;P描述概念和关系的属性;S表示类之间的结构关系,如父类、子类等,C采用向量空间模型(VSM)来表示,使用二元组Ci(Keyi,Weighti),其中Keyi表示关键词,Weighti表示关键词的权重。In step 201, an ontology library is established according to user's filtering requirements. In this step, firstly, according to the user's filtering requirements, specify the domain and scope covered by the ontology to be constructed; The relationship between concepts is expressed in precise terms; finally, the ontology framework is established. In a preferred embodiment of the present invention, the ontology is represented by a triplet Topic (C, P, S), wherein: C represents a set of concept classes that are abstracted from the noun concepts in the filtering field and have the same attributes and behavioral structures ;P describes the attributes of concepts and relationships; S represents the structural relationship between classes, such as parent class, subclass, etc., C is represented by the vector space model (VSM), using the binary group C i (Key i , Weight i ) , where Key i represents the keyword, and Weight i represents the weight of the keyword.

步骤202,对一组过滤样本进行训练学习以对所建立的本体库动态调整,使其逐渐接近于用户的过滤需求。在本发明较佳实施例中,采用增量式迭代方法对一组过滤样本进行训练,设定固定值m作为观察新的需要被过滤掉的文档出现数量的窗口大小,根据评测指标的参数n来灵活设置,并设训练迭代次数为5,在增量迭代训练过程中,需要确定每次增加的特征项数目,以避免产生更多的噪音,根据增加的有效特征值,选取一定数量的增加到已有的本体库中,丰富用户的过滤需求模型,因此随着不断的学习,本体库越来越接近于用户的过滤需求,本体库所必需的特征也逐渐减少。Step 202, training and learning a group of filtering samples to dynamically adjust the established ontology database so that it gradually approaches the user's filtering requirements. In a preferred embodiment of the present invention, an incremental iterative method is used to train a group of filtered samples, and a fixed value m is set as the window size for observing the number of new documents that need to be filtered out. According to the parameter n of the evaluation index To set it flexibly, and set the number of training iterations to 5. In the incremental iteration training process, it is necessary to determine the number of feature items added each time to avoid generating more noise. According to the increased effective feature value, select a certain number of increased Into the existing ontology library, enrich the user's filtering requirement model, so with continuous learning, the ontology library is getting closer to the user's filtering requirement, and the necessary features of the ontology library are gradually reduced.

步骤203,对待过滤文本进行预处理、抽取特征词集与相似度匹配处理后,根据待过滤文本与本体的相关度对待过滤文本进行过滤。其具体过程如下:首先对待过滤文本经过去除停用词等预处理操作;然后抽取出表达文本内容的特征词,并根据特征词不同的位置及频率赋予相应的权重,相同的特征词权重值相加,形成文本特征词集Ti={(Word1k,Weight1k)},这样待过滤的文本采用了特征向量来表示;接着根据向量空间模型,两特征向量夹角的余弦值可以表示它们的相关度。由此可以计算出一个待过滤的文本与本体的相关度Simj;最后根据设定的阈值与相关度Simj的关系对待过滤文本进行过滤,即对低于阈值的文本进行过滤。Step 203: After preprocessing the text to be filtered, extracting feature word sets and matching similarity, the text to be filtered is filtered according to the correlation between the text to be filtered and the ontology. The specific process is as follows: firstly, the filtered text is subjected to preprocessing operations such as removing stop words; then the feature words expressing the content of the text are extracted, and corresponding weights are given according to the different positions and frequencies of the feature words. Add to form the text feature word set Ti={(Word1k, Weight1k)}, so that the text to be filtered is represented by a feature vector; then according to the vector space model, the cosine value of the angle between the two feature vectors can represent their degree of correlation. Thus, a correlation degree Sim j between the text to be filtered and the ontology can be calculated; finally, the text to be filtered is filtered according to the relationship between the set threshold and the correlation degree Sim j , that is, texts below the threshold are filtered.

可见,由于本体能够对领域概念及概念间进行明确的定义,本发明一种文本过滤系统及方法通过建立本体库能够比较精确地表达用户的过滤需求,同时为了进一步保证本体库更接近于用户的过滤需求,本发明采用自适应学习的方式,通过对一组样本进行训练学习,部分动态调整本体库,克服了传统的特征向量方法以及建立本体库的一般方法对用户需求表达不精确而造成过滤精度不高的缺点,另外,本发明在过滤阶段采用空间向量模型来计算待过滤的文本与本体库的相似度,将低于阈值的文本过滤掉,并能够动态调整过滤阈值,以达到更好的过滤效果,实践证明,本发明这种采用基于本体的自适应的文本过滤方法能够获得较高的过滤精度。It can be seen that since ontology can clearly define domain concepts and concepts, a text filtering system and method of the present invention can more accurately express the user's filtering needs by establishing an ontology library, and at the same time, in order to further ensure that the ontology library is closer to the user's To filter requirements, the present invention adopts an adaptive learning method, through training and learning on a group of samples, and partially dynamically adjusting the ontology library, which overcomes the traditional feature vector method and the general method of establishing an ontology library, which cause inaccurate expression of user needs and cause filtering. In addition, the present invention uses a space vector model to calculate the similarity between the text to be filtered and the ontology library in the filtering stage, filters out texts below the threshold, and can dynamically adjust the filtering threshold to achieve better The filtering effect is proved by practice that the ontology-based adaptive text filtering method of the present invention can obtain higher filtering precision.

上述实施例仅例示性说明本发明的原理及其功效,而非用于限制本发明。任何本领域技术人员均可在不违背本发明的精神及范畴下,对上述实施例进行修饰与改变。因此,本发明的权利保护范围,应如权利要求书所列。The above-mentioned embodiments only illustrate the principles and effects of the present invention, but are not intended to limit the present invention. Any person skilled in the art can modify and change the above-mentioned embodiments without departing from the spirit and scope of the present invention. Therefore, the protection scope of the present invention should be listed in the claims.

Claims (10)

1.一种文本过滤系统,至少包括:1. A text filtering system comprising at least: 本体库建立模组,用于根据用户的过滤需求建立本体库;Ontology library building module, used to build ontology library according to user's filtering requirements; 自适应学习模组,通过对一组过滤样本进行训练学习以对该本体库建立模组建立的本体库动态调整,使其逐渐接近于用户的过滤需求;以及The self-adaptive learning module dynamically adjusts the ontology library established by the ontology library building module by training and learning a group of filtering samples, making it gradually approach the user's filtering requirements; and 文本过滤模组,通过对待过滤文本进行预处理、抽取特征词集及相似度匹配处理后,获得该待过滤文本与本体的相关度,并根据该相关度对该待过滤文本进行过滤。The text filtering module obtains the degree of correlation between the text to be filtered and the ontology after preprocessing the text to be filtered, extracting the feature word set and matching the similarity, and filters the text to be filtered according to the degree of correlation. 2.如权利要求1所述的文本过滤系统,其特征在于,该本体库建立模组至少包括:2. text filter system as claimed in claim 1, is characterized in that, this ontology storehouse builds module to comprise at least: 领域确定模组,用于根据用户的过滤需求,明确要构建的本体所覆盖的领域和范围以确定本体的领域与范围;The domain determination module is used to clarify the domain and scope covered by the ontology to be built according to the user's filtering requirements, so as to determine the domain and scope of the ontology; 收集分析模组,用于在本体所涉及的领域范围内进行信息的收集和分析,明确重点概念和概念之间的关系,并且用精确的术语表达;以及The collection and analysis module is used to collect and analyze information within the scope of the ontology, clarify key concepts and the relationship between concepts, and express them in precise terms; and 本体框架建立模组,用于根据收集分析结果建立本体框架。The ontology frame building module is used to build the ontology frame according to the collection and analysis results. 3.如权利要求2所述的文本过滤系统,其特征在于:该本体采取三元组Topic(C,P,S)来表示,其中,C表示由过滤领域内的名词概念抽象出来,具有相同属性和行为结构的概念类的集合;P描述概念和关系的属性;S表示类之间的结构关系,如父类、子类等。3. The text filtering system as claimed in claim 2, characterized in that: the ontology is represented by a triplet Topic (C, P, S), wherein C represents an abstraction from a noun concept in the filtering field, and has the same A collection of conceptual classes with attributes and behavioral structures; P describes the attributes of concepts and relationships; S represents the structural relationships between classes, such as parent classes, subclasses, etc. 4.如权利要求1所述的文本过滤系统,其特征在于:该自适应学习模组采用增量式迭代方法对一组过滤样本进行训练学习以对该本体库建立模组建立的本体库动态调整。4. text filtering system as claimed in claim 1, is characterized in that: this self-adaptive learning module adopts the incremental iterative method to carry out training and learning to a group of filtering samples to build up the ontology library dynamics of module establishment to this ontology library. Adjustment. 5.如权利要求1所述的文本过滤系统,其特征在于,该文本过滤模组至少包括:5. the text filter system as claimed in claim 1, is characterized in that, this text filter module comprises at least: 预处理模组,用于对该待过滤文本进行去除停用词操作;A preprocessing module for removing stop words from the text to be filtered; 特征词集抽取模组,用于对该待过滤文本抽取出表达文本内容的特征词,根据特征词不同的位置及频率赋予相应的权重,并将相同的特征词权重值相加,形成文本特征词集;The feature word set extraction module is used to extract the feature words that express the text content from the text to be filtered, assign corresponding weights according to the different positions and frequencies of the feature words, and add the same feature word weight values to form text features vocabulary; 相似度计算模组,根据向量空间模型,计算出该待过滤文本与该本体的相关度;以及The similarity calculation module calculates the correlation between the text to be filtered and the ontology according to the vector space model; and 过滤模组,根据该相关度与一设定的阈值,对该待过滤文本进行过滤。The filtering module filters the text to be filtered according to the correlation degree and a set threshold. 6.如权利要求5所述的文本过滤系统,其特征在于:该过滤模组对该带过滤文本中低于该阈值的文本进行过滤。6. The text filtering system according to claim 5, characterized in that: the filtering module filters texts below the threshold in the filtered texts. 7.一种文本过滤方法,至少包括如下步骤:7. A text filtering method, comprising at least the following steps: 根据用户的过滤需求建立本体库;Build an ontology library according to the user's filtering requirements; 对一组过滤样本进行训练学习以对所建立的本体库动态调整,使其逐渐接近于用户的过滤需求;以及Carry out training and learning on a set of filtered samples to dynamically adjust the established ontology library, making it gradually approach the user's filtering requirements; and 对待过滤文本进行预处理、抽取特征词集及相似度匹配处理后,获得该待过滤文本与本体的相关度,并根据该相关度对该待过滤文本进行过滤。After the text to be filtered is preprocessed, feature word set extracted and similarity matching processed, the correlation between the text to be filtered and the ontology is obtained, and the text to be filtered is filtered according to the correlation. 8.如权利要求7所述的一种文本过滤方法,其特征在于,该根据用户的过滤需求建立本体库的步骤至少还包括如下步骤:8. A kind of text filtering method as claimed in claim 7, it is characterized in that, the step of setting up the ontology library according to user's filtering requirement also comprises the following steps at least: 根据用户的过滤需求,明确要构建的本体所覆盖的领域和范围确定本体的领域与范围;According to the user's filtering requirements, specify the field and scope covered by the ontology to be constructed to determine the field and scope of the ontology; 在本体所涉及的领域范围内进行信息的收集和分析,明确重点概念和概念之间的关系,并且用精确的术语表达;以及Collect and analyze information within the scope of the ontology, clarify the key concepts and the relationship between concepts, and express them in precise terms; and 根据收集分析结果建立本体框架。Based on the collected and analyzed results, an ontology framework is established. 9.如权利要求7所述的一种文本过滤方法,其特征在于:对该本体库动态调整采用增量式迭代方法实现。9. A text filtering method as claimed in claim 7, characterized in that: the dynamic adjustment of the ontology database is realized by an incremental iterative method. 10.如权利要求7所述的一种文本过滤方法,其特征在于,对该待过滤文本进行过滤的步骤至少还包括如下步骤:10. A kind of text filtering method as claimed in claim 7, is characterized in that, the step of filtering the text to be filtered further comprises the following steps at least: 对待过滤文本进行去除停用词操作;Remove stop words from the text to be filtered; 抽取出该待过滤文本中表达文本内容的特征词,根据特征词不同的位置及频率赋予相应的权重,并将相同的特征词权重值相加,形成文本特征词集;Extract the feature words expressing the text content in the text to be filtered, assign corresponding weights according to the different positions and frequencies of the feature words, and add the same feature word weight values to form a text feature word set; 根据向量空间模型,计算出该待过滤文本与本体的相关度;以及根据一设定的阈值与该相关度的关系对该待过滤文本进行过滤。Calculate the correlation between the text to be filtered and the ontology according to the vector space model; and filter the text to be filtered according to the relationship between a set threshold and the correlation.
CN201110440801.6A 2011-12-23 2011-12-23 Text filtering system and method Expired - Fee Related CN102521402B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201110440801.6A CN102521402B (en) 2011-12-23 2011-12-23 Text filtering system and method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201110440801.6A CN102521402B (en) 2011-12-23 2011-12-23 Text filtering system and method

Publications (2)

Publication Number Publication Date
CN102521402A true CN102521402A (en) 2012-06-27
CN102521402B CN102521402B (en) 2014-02-19

Family

ID=46292315

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201110440801.6A Expired - Fee Related CN102521402B (en) 2011-12-23 2011-12-23 Text filtering system and method

Country Status (1)

Country Link
CN (1) CN102521402B (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102880636A (en) * 2012-08-03 2013-01-16 深圳证券信息有限公司 Bad information detection method and server
CN103034726A (en) * 2012-12-18 2013-04-10 上海电机学院 Text filtering system and method
CN103902619A (en) * 2012-12-28 2014-07-02 中国移动通信集团公司 Internet public opinion monitoring method and system
CN104615714A (en) * 2015-02-05 2015-05-13 北京中搜网络技术股份有限公司 Blog duplicate removal method based on text similarities and microblog channel features
US9755616B2 (en) 2014-06-30 2017-09-05 Huawei Technologies Co., Ltd. Method and apparatus for data filtering, and method and apparatus for constructing data filter
CN108428382A (en) * 2018-02-14 2018-08-21 广东外语外贸大学 It is a kind of spoken to repeat methods of marking and system

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101751409A (en) * 2008-11-28 2010-06-23 上海电机学院 Application of immune system in search engine
CN101794311A (en) * 2010-03-05 2010-08-04 南京邮电大学 Fuzzy data mining based automatic classification method of Chinese web pages
CN101901247A (en) * 2010-03-29 2010-12-01 北京师范大学 A vertical search engine method and system constrained by domain ontology

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101751409A (en) * 2008-11-28 2010-06-23 上海电机学院 Application of immune system in search engine
CN101794311A (en) * 2010-03-05 2010-08-04 南京邮电大学 Fuzzy data mining based automatic classification method of Chinese web pages
CN101901247A (en) * 2010-03-29 2010-12-01 北京师范大学 A vertical search engine method and system constrained by domain ontology

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102880636A (en) * 2012-08-03 2013-01-16 深圳证券信息有限公司 Bad information detection method and server
CN103034726A (en) * 2012-12-18 2013-04-10 上海电机学院 Text filtering system and method
CN103034726B (en) * 2012-12-18 2016-05-25 上海电机学院 Text filtering system and method
CN103902619A (en) * 2012-12-28 2014-07-02 中国移动通信集团公司 Internet public opinion monitoring method and system
CN103902619B (en) * 2012-12-28 2018-10-23 中国移动通信集团公司 A kind of network public-opinion monitoring method and system
US9755616B2 (en) 2014-06-30 2017-09-05 Huawei Technologies Co., Ltd. Method and apparatus for data filtering, and method and apparatus for constructing data filter
CN104615714A (en) * 2015-02-05 2015-05-13 北京中搜网络技术股份有限公司 Blog duplicate removal method based on text similarities and microblog channel features
CN104615714B (en) * 2015-02-05 2019-05-24 北京中搜云商网络技术有限公司 Blog article rearrangement based on text similarity and microblog channel feature
CN108428382A (en) * 2018-02-14 2018-08-21 广东外语外贸大学 It is a kind of spoken to repeat methods of marking and system

Also Published As

Publication number Publication date
CN102521402B (en) 2014-02-19

Similar Documents

Publication Publication Date Title
CN103034726B (en) Text filtering system and method
CN102289522B (en) Method of intelligently classifying texts
CN109902289B (en) News video theme segmentation method oriented to fuzzy text mining
CN106960025B (en) A personalized document recommendation method based on domain knowledge graph
CN107301171A (en) A kind of text emotion analysis method and system learnt based on sentiment dictionary
CN103942340A (en) Microblog user interest recognizing method based on text mining
CN107766324A (en) A kind of text coherence analysis method based on deep neural network
CN107423339A (en) Popular microblogging Forecasting Methodology based on extreme Gradient Propulsion and random forest
CN104866558B (en) A kind of social networks account mapping model training method and mapping method and system
CN102521402A (en) Text filtering system and method
CN107291886A (en) A kind of microblog topic detecting method and system based on incremental clustering algorithm
CN102929861A (en) Method and system for calculating text emotion index
CN102779510A (en) Speech emotion recognition method based on feature space self-adaptive projection
CN111597328B (en) New event theme extraction method
CN107679031B (en) Advertisement and blog identification method based on stacking noise reduction self-coding machine
CN112132096B (en) Behavior modal identification method of random configuration network for dynamically updating output weight
CN108710611A (en) A kind of short text topic model generation method of word-based network and term vector
CN112347761B (en) BERT-based drug relation extraction method
CN108804651A (en) A kind of Social behaviors detection method based on reinforcing Bayes's classification
CN108710609A (en) A kind of analysis method of social platform user information based on multi-feature fusion
CN110457711A (en) A topic recognition method for social media events based on keywords
CN103778206A (en) Method for providing network service resources
CN109697288A (en) A kind of example alignment schemes based on deep learning
CN112489689A (en) Cross-database voice emotion recognition method and device based on multi-scale difference confrontation
CN108268461A (en) A kind of document sorting apparatus based on hybrid classifer

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20140219

Termination date: 20161223

CF01 Termination of patent right due to non-payment of annual fee