TWI509434B - Methods and apparatus for classification - Google Patents

Methods and apparatus for classification Download PDF

Info

Publication number
TWI509434B
TWI509434B TW099112962A TW99112962A TWI509434B TW I509434 B TWI509434 B TW I509434B TW 099112962 A TW099112962 A TW 099112962A TW 99112962 A TW99112962 A TW 99112962A TW I509434 B TWI509434 B TW I509434B
Authority
TW
Taiwan
Prior art keywords
vocabulary
category
vector
distance
normalized
Prior art date
Application number
TW099112962A
Other languages
Chinese (zh)
Other versions
TW201137641A (en
Original Assignee
Alibaba Group Holding Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba Group Holding Ltd filed Critical Alibaba Group Holding Ltd
Priority to TW099112962A priority Critical patent/TWI509434B/en
Publication of TW201137641A publication Critical patent/TW201137641A/en
Application granted granted Critical
Publication of TWI509434B publication Critical patent/TWI509434B/en

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Description

本文分類的方法及裝置Method and device for classification

本案關於電腦及通信領域,特別是關於本文分類的方法及裝置。This case is about the computer and communication fields, especially the methods and devices for classifying this article.

本文分類是本文挖掘的一個重要內容,是指按照預先定義的主題類別,為文檔集合中的每個文檔確定一個類別。通過自動本文分類系統把文檔進行歸類,可以幫助人們更好地尋找需要的資訊和知識。在人們看來,分類是對資訊的一種最基本的認知形式。傳統的文獻分類硏究有著豐富的硏究成果和相當的實用水準。但隨著本文資訊的快速增長,特別是網際網路(Internet)上線上本文資訊的激增,本文自動分類已經成為處理和組織大量文檔資料的關鍵技術。現在,本文分類正在各個領域得到廣泛的應用。但是,隨著信息量日趨豐富,人們對於內容搜尋的準確率,查全率等方面的要求會越來越高,因而對本文分類技術需求大為增加,如何構造一個有效的本文分類系統仍然是本文挖掘的一個主要硏究方向。This paper is an important part of this paper. It refers to determining a category for each document in the document collection according to the predefined topic category. By categorizing documents through the automated paper classification system, people can better find the information and knowledge they need. In the eyes of people, classification is one of the most basic forms of cognition of information. The traditional literature classification research has rich research results and considerable practical standards. However, with the rapid growth of information in this article, especially the proliferation of information on the Internet on the Internet, this paper has become a key technology for processing and organizing a large amount of documents. Now, the classification of this paper is widely used in various fields. However, with the increasing amount of information, people's requirements for accuracy and recall of content search will become higher and higher, so the demand for classification technology is greatly increased. How to construct an effective classification system is still This article explores a major research direction.

在自然語言處理領域,本文的表示主要採用向量空間模型(Vector space model,VSM),這種方法認為每篇本文都包含一些用概念詞表達的揭示其內容的獨立屬性,而每個屬性都可以看成是概念空間的一個維數,這些獨立屬性稱為本文特徵項,本文就可以表示為這些特徵項的集合。特徵向量的相近程度常用夾角餘弦來衡量。然後根據本文向量與候選類別的特徵向量的相近程度來判定本文的類別。In the field of natural language processing, the representation of this paper mainly uses Vector space model (VSM). This method thinks that each article contains some independent attributes that are expressed by concept words to reveal their contents, and each attribute can be It is regarded as a dimension of the concept space. These independent attributes are called feature items in this paper. This article can be expressed as a collection of these feature items. The closeness of the eigenvectors is usually measured by the cosine of the angle. The categories herein are then determined based on how close the vectors are to the feature vectors of the candidate categories.

現有技術中需要計算每個本文向量與候選類別的所有特徵向量相近程度,每次計算均需要採用夾角餘弦來衡量,計算量非常大,並且現有技術對本文的語義沒有任何約束,其分類的準確度不是很好。In the prior art, it is necessary to calculate the degree of similarity between each vector vector and all the feature vectors of the candidate category. Each calculation needs to be measured by the angle cosine, and the calculation amount is very large, and the prior art has no constraint on the semantics of the paper, and the classification is accurate. Degree is not very good.

本案實施例提供一種本文分類的方法及裝置,用於實現本文分類,簡化分類操作,並提高本文分類的準確度。The embodiment of the present invention provides a method and device for classifying the present article, which is used to implement the classification, simplify the classification operation, and improve the accuracy of the classification.

一種本文分類的方法,包括以下步驟:對獲得的本文內容進行分詞,得到多個辭彙;針對得到的多個辭彙中的每一個辭彙,確定該辭彙在球面空間模型中的辭彙向量;辭彙的辭彙向量包括該辭彙在各類目上的詞頻值進行歸一化後得到的歸一化詞頻值;球面空間模型是以單位長度為半徑的多維球體模型,球面空間的維度等於類目的個數,類目對應球面空間中的一個類目向量;針對每個類目,確定得到的多個辭彙的辭彙向量之和到該類目的類目向量的距離;將本文分入最短距離對應的類目。A method for classification in the present invention comprises the following steps: segmenting the obtained content of the article to obtain a plurality of vocabulary; determining a vocabulary of the vocabulary in the spherical space model for each of the obtained vocabulary Vector; the vocabulary vector of the vocabulary includes the normalized word frequency obtained by normalizing the vocabulary value of the vocabulary in various categories; the spherical space model is a multi-dimensional spherical model with a radius of unit length, spherical space The dimension is equal to the number of categories, and the category corresponds to a category vector in the spherical space; for each category, the distance between the sum of the obtained vocabulary vectors of multiple vocabularies to the category vector of the category is determined; Divide into the category corresponding to the shortest distance.

一種用於本文分類的裝置,包括:分詞模組,用於對獲得的本文內容進行分詞,得到多個辭彙;查詢模組,用於針對得到的多個辭彙中的每一個辭彙,確定該辭彙在球面空間模型中的辭彙向量;辭彙的辭彙向量包括該辭彙在各類目上的詞頻值進行歸一化後得到的歸一化詞頻值;球面空間模型是以單位長度為半徑的多維球體模型,球面空間的維度等於類目的個數,類目對應球面空間中的一個類目向量;計算模組,針對每個類目,確定得到的多個辭彙的辭彙向量之和到該類目的類目向量的距離;分類模組,用於將本文分入最短距離對應的類目。An apparatus for classification in the present invention, comprising: a word segmentation module, configured to perform word segmentation on the obtained content, and obtain a plurality of vocabulary; and a query module, configured to vocabulary for each of the obtained plurality of vocabulary, Determining the vocabulary vector of the vocabulary in the spherical space model; the vocabulary vector of the vocabulary includes the normalized word frequency value obtained by normalizing the vocabulary value of the vocabulary in various categories; the spherical space model is A multi-dimensional sphere model with a radius of unit length, the dimension of the spherical space is equal to the number of categories, the category corresponds to a category vector in the spherical space, and the calculation module determines the multiple vocabulary words for each category. The distance between the sum of the sink vectors and the category vector of the category; the classification module is used to classify the text into the category corresponding to the shortest distance.

本案實施例預先構造一球面空間模型,並基於該球面空間模型對本文進行分類,在分類過程中,計算本文中各辭彙的向量和與各類目向量的距離,從而確定本文應分入的類目。本案實施例實現了本文分類,並且相對于現有技術中的夾角餘弦演算法,計算量明顯減少。以及本案實施例中球面空間模型以單位長度為半徑,則一個辭彙在各類目上的歸一化後的辭彙向量的平方和也為單位長度,相當於將一個辭彙的語義信息量等價為單位長度,對語義信息量進行了約束,因此相對于現有技術可提高本文分類的準確度。The embodiment of the present invention pre-constructs a spherical space model, and classifies the paper based on the spherical space model. In the classification process, the vector of each vocabulary in this paper and the distance from various mesh vectors are calculated to determine the points that should be classified in this paper. Category. The embodiment of the present invention achieves the classification herein, and the amount of calculation is significantly reduced relative to the angle cosine algorithm in the prior art. And in the embodiment of the present invention, the spherical space model has a radius of unit length, and the sum of the squares of the normalized lexicon vector of each vocabulary in each category is also a unit length, which is equivalent to the semantic information amount of a vocabulary. Equivalent to the unit length, the amount of semantic information is constrained, so the accuracy of the classification can be improved compared to the prior art.

本案實施例預先構造一球面空間模型,並基於該球面空間模型對本文進行分類,在分類過程中,計算本文中各辭彙的向量和與各類目向量的距離,從而確定本文應分入的類目。本案實施例實現了本文分類,並且相對于現有技術中的夾角餘弦演算法,計算量明顯減少。以及本案實施例中球面空間模型以單位長度為半徑,則一個辭彙在各類目上的歸一化後的辭彙向量的平方和也為單位長度,相當於將一個辭彙的語義信息量等價為單位長度,對語義信息量進行了約束,因此相對于現有技術可提高本文分類的準確度。The embodiment of the present invention pre-constructs a spherical space model, and classifies the paper based on the spherical space model. In the classification process, the vector of each vocabulary in this paper and the distance from various mesh vectors are calculated to determine the points that should be classified in this paper. Category. The embodiment of the present invention achieves the classification herein, and the amount of calculation is significantly reduced relative to the angle cosine algorithm in the prior art. And in the embodiment of the present invention, the spherical space model has a radius of unit length, and the sum of the squares of the normalized lexicon vector of each vocabulary in each category is also a unit length, which is equivalent to the semantic information amount of a vocabulary. Equivalent to the unit length, the amount of semantic information is constrained, so the accuracy of the classification can be improved compared to the prior art.

參見圖1,本實施例中用於本文分類的裝置包括:分詞模組101、查詢模組102、計算模組103和分類模組104。Referring to FIG. 1 , the apparatus for classification in this embodiment includes: a word segmentation module 101, a query module 102, a calculation module 103, and a classification module 104.

分詞模組101用於對獲得的本文內容進行分詞,得到多個辭彙。The word segmentation module 101 is used to segment the obtained content of the article to obtain a plurality of vocabulary.

查詢模組102用於針對得到的多個辭彙中的一個辭彙,確定該辭彙在球面空間模型中的辭彙向量。辭彙的辭彙向量包括該辭彙在各類目上的詞頻值進行歸一化後得到的歸一化詞頻值;球面空間模型是以單位長度為半徑的多維球體模型,球面空間的維度等於類目的個數,類目對應球面空間中的一個類目向量。其中,單位長度可以為一常數,為了便於計算,本實施例中球面空間模型的半徑為1。本文中各辭彙的向量和到各類目向量的距離為直線距離或球面距離。The query module 102 is configured to determine a vocabulary vector of the vocabulary in the spherical space model for one of the obtained plurality of vocabularies. The vocabulary vector of the vocabulary includes the normalized word frequency obtained by normalizing the vocabulary value of the vocabulary in various categories; the spherical space model is a multi-dimensional spherical model with a radius of unit length, and the dimension of the spherical space is equal to The number of categories, the category corresponds to a category vector in the spherical space. The unit length may be a constant. For the convenience of calculation, the radius of the spherical space model in this embodiment is 1. The vector of each vocabulary in this paper and the distance to each category vector are linear distance or spherical distance.

計算模組103用於針對每個類目,確定對本文分詞後得到的多個辭彙的辭彙向量和到每個類目向量的距離。The calculation module 103 is configured to determine, for each category, a vocabulary vector of a plurality of vocabulary obtained after the word segmentation and a distance to each category vector.

分類模組104用於將本文分入最短距離對應的類目。The classification module 104 is used to classify the text into the category corresponding to the shortest distance.

計算模組103在計算本文中辭彙向量和到各類目向量的距離時,可將對本文分詞後得到的多個辭彙的辭彙向量在相應類目上的歸一化詞頻值進行累加,得到歸一化辭彙向量和。分類模組104將本文分入歸一化辭彙向量和的最大分量對應的類目。When calculating the distance between the vocabulary vector and the various mesh vectors in the text, the calculation module 103 may accumulate the normalized word frequency values of the vocabulary vectors of the plurality of vocabularies obtained after the word segmentation in the corresponding category. , get the normalized lexical vector and . The classification module 104 divides the text into categories corresponding to the largest component of the normalized lexicon vector sum.

該裝置還包括:介面模組105、過濾模組106、構造模組107和儲存模組108,參見圖2所示。The device further includes: an interface module 105, a filter module 106, a structure module 107, and a storage module 108, as shown in FIG.

介面模組105用於從裝置外部獲得待分類的本文。The interface module 105 is used to obtain the text to be classified from outside the device.

過濾模組106用於在對本文分詞得到多個辭彙後,對得到的多個辭彙進行過濾,得到符合過濾條件的多個辭彙。過濾條件有多種,如根據辭彙在各類目上的詞頻值計算該辭彙的變異係數,然後過濾出變異係數大於預設的變異係數閾值(如0.5)的辭彙。通過變異係數,可過濾掉在各類目中詞頻值變化不大的詞(如你、我等在各類目的詞頻值基本一致),而保留在各類目中詞頻值變化較明顯的詞(如專業名詞,在與其專業有關類目中的詞頻值明顯高於其他類目下的詞頻值)。在各類目中詞頻值變化較明顯的詞,說明其主要出現在某一個或某幾個類目中,這樣的詞對本文分類的準確性做出較多的貢獻,本實施例認為這樣的詞屬於優秀詞,應通過過濾來篩選出優秀詞。還可能有其他過濾條件,此處不一一列舉。The filtering module 106 is configured to filter the obtained plurality of vocabularies after obtaining a plurality of vocabulary for the word segmentation, and obtain a plurality of vocabularies that meet the filtering condition. There are various filtering conditions, such as calculating the coefficient of variation of the vocabulary according to the word frequency value of each vocabulary, and then filtering out the vocabulary whose coefficient of variation is greater than the preset coefficient of variation threshold (such as 0.5). Through the coefficient of variation, it is possible to filter out words that have little change in the frequency of words in various categories (such as you, me, etc., the frequency of words in each type of target is basically the same), while retaining the words whose frequency changes significantly in various categories ( For professional terms, the word frequency in the categories related to their profession is significantly higher than the word frequency in other categories). Words with more obvious changes in the frequency of words in various categories indicate that they mainly appear in one or several categories, and such words make more contributions to the accuracy of the classification of this paper. This embodiment considers such Words are excellent words, and should be filtered to filter out excellent words. There may be other filters, which are not listed here.

構造模組107用於構造球面空間模型。The construction module 107 is used to construct a spherical space model.

儲存模組108用於儲存構造的球面空間模型,以及分類儲存各本文等。The storage module 108 is used to store the constructed spherical space model, and to store and store the various articles.

構造模組107構造球面空間模型的過程如下:The process of constructing the spherical space model by the construction module 107 is as follows:

設多維球面空間為S,S的維數與類目的總數相同。類目Ci是球面上的一個端點,同時對應球面空間中的一個類目向量,Ci={0,……,0,1,0,……,0},相當於從球心(相當於原點)指向球面端點,該類目向量的第i個維度值是1,其餘都是0。本實施例中,假設任意一個辭彙在任意兩個類目Ci和Cj中出現的概率是概率獨立的,則Ci和Cj在S中必然是相互垂直的,推廣到一般,所有類目向量{Ci}是兩兩垂直的。Let the multidimensional spherical space be S, and the dimension of S is the same as the total number of categories. The category Ci is an endpoint on the sphere and corresponds to a category vector in the spherical space, Ci={0,...,0,1,0,...,0}, which is equivalent to the sphere (equivalent to The origin point points to the end of the sphere. The i-th dimension of the category vector is 1 and the rest are 0. In this embodiment, assuming that the probability that any one of the vocabularies appears in any two of the categories Ci and Cj is probability independent, then Ci and Cj are necessarily perpendicular to each other in S, generalized to general, all category vectors { Ci} is two or two vertical.

本實施例中第m個辭彙的辭彙向量Wm為S中的一個向量,m=1…M,M為辭彙的總數。Wm={V1,V2,……,VN},Vi是在類目Ci上的歸一化詞頻值,i=1…N,N為類目的總數。該歸一化詞頻值從球心指向球面端點,則可將歸一化詞頻值表示為類目Ci上的座標。辭彙的辭彙向量與類目向量的示意圖參見圖3所示,Ci、Cj和Ck表示三個類目向量,O表示球心,也是原點(座標為{0,0,……,0})。The vocabulary vector Wm of the mth vocabulary in this embodiment is a vector in S, m=1...M, and M is the total number of vocabularies. Wm={V1, V2, ..., VN}, Vi is the normalized word frequency value on the category Ci, i=1...N, and N is the total number of categories. The normalized word frequency value points from the center of the sphere to the spherical end point, and the normalized word frequency value can be expressed as a coordinate on the category Ci. See Figure 3 for the vocabulary vector and category vector of the vocabulary. Ci, Cj and Ck represent three category vectors, O is the center of the sphere, and is also the origin (coordinates are {0,0,...,0 }).

本實施例中設任一個辭彙的語義信息量均為同一個常數,語義信息量是指認識主體所感知或所表述的事物的存在方式和運動狀態的邏輯含義,是辭彙內在含義因素的資訊部分。定義該常數為單位長度,則辭彙向量在S中的長度(即辭彙向量的端點到原點O的距離)也為該常數,為了計算方便,設該常數為1。辭彙向量的端點到原點O的距離可表示為:|Wm-O|=1(公式1),進而根據Wm={V1,V2,……,VN}有ΣVi2 =1(公式2)。由公式1可知,辭彙向量Wm的端點均落在球面上。由於辭彙向量Wm和類目向量Ci的端點都落在球面上,則任一個辭彙的語義與類目的近似程度可以用Wm與Ci的距離來表示,距離越短則越接近。Wm與Ci的距離可以通過直線距離或球面距離來計算。The semantic information of any vocabulary in this embodiment is the same constant. The semantic information quantity refers to the logical meaning of the existence mode and the motion state of the things perceived or expressed by the subject, and is the intrinsic meaning factor of the vocabulary. Information section. The constant is defined as the unit length, and the length of the vocabulary vector in S (ie, the distance from the end of the lexicon vector to the origin O) is also the constant. For the convenience of calculation, the constant is set to 1. The distance from the endpoint of the vocabulary vector to the origin O can be expressed as: |Wm - O| = 1 (formula 1), and further according to Wm = {V1, V2, ..., VN} Σ Vi 2 =1 (Formula 2 ). It can be known from Equation 1 that the endpoints of the vocabulary vector Wm fall on the spherical surface. Since the endpoints of the vocabulary vector Wm and the category vector Ci fall on the spherical surface, the degree of semantics and category approximation of any vocabulary can be expressed by the distance between Wm and Ci, and the shorter the distance, the closer. The distance between Wm and Ci can be calculated by linear distance or spherical distance.

由於定義了任一個辭彙的語義信息量均為同一個常數,則歸一化詞頻值為詞頻值經過歸一化後得到的,,進而有Σ(Fi×k)2 =1,其中Fi為該辭彙在類目Ci上的詞頻值,k為預設的歸一化係數。由Σ(Fi×k)2 =1可以導出(公式3)。再由Vi=Fi×k(公式4)可得出辭彙向量與詞頻值的轉換函數(或稱量化函數)Wm=δ(Fi)={Fi}×k(公式5)。Since the amount of semantic information defining any vocabulary is the same constant, the normalized word frequency value is normalized after the word frequency value is obtained. And then there is Σ(Fi×k) 2 =1, where Fi is the word frequency value of the vocabulary on the category Ci, and k is the preset normalization coefficient. A Σ (Fi × k) 2 = 1 can be derived (Equation 3). Then, from Vi=Fi×k (Equation 4), a conversion function (or a quantization function) of the vocabulary vector and the word frequency value can be obtained. Wm=δ(Fi)={Fi}×k (Equation 5).

通過以上描述,構造模組107構造出以原點為球心,以單位長度1為半徑的球面空間,辭彙向量Wm和類目向量Ci的端點都落在球面上。經過訓練樣本對該球面空間模型進行訓練和學習,得到可直接應用的球面空間模型。樣本訓練過程與本文分類過程類似,也可通過其他的模式識別或人工等方式實現。Through the above description, the construction module 107 constructs a spherical space with the origin as the center of the sphere and the radius of the unit length 1 , and the end points of the vocabulary vector Wm and the category vector Ci fall on the spherical surface. The spherical space model is trained and learned through training samples to obtain a spherical space model that can be directly applied. The sample training process is similar to the classification process in this paper, and can also be implemented by other pattern recognition or manual methods.

對於一個本文D來說,D=ΣWm,Wm為本文中第m個辭彙的辭彙向量。計算模組103計算ΣWm與類目向量Ci的距離,則最短距離對應的類目即為本文應分入的類目。由於ΣWm不一定落在球面上,為了便於計算,計算模組103也可對D進行歸一化,乘以歸一化係數k,再計算與類目向量Ci的距離。For a paper D, D=ΣWm, Wm is the lexical vector of the mth vocabulary in this paper. The calculation module 103 calculates the distance between the ΣWm and the category vector Ci, and the category corresponding to the shortest distance is the category to be classified in this paper. Since ΣWm does not necessarily fall on the sphere, for ease of calculation, the calculation module 103 can also normalize D, multiply by the normalization coefficient k, and then calculate the distance from the category vector Ci.

由於辭彙向量Wm和與類目向量Ci的距離越短表示辭彙向量Wm和與類目向量Ci的近似程度越大,為了簡化計算過程,則可設P={Pi}={ΣVmi}(公式6),Pi表示第i個類目的權重分量,Pi值越大對應到類目向量Ci的距離越短,相當於ΣVmi越大對應到類目向量Ci的距離越短。所以計算模組103在計算距離和時,可將得到的多個辭彙在該類目上的歸一化分量值進行累加,得到該類目的權重值。分類模組104將本文分入最大權重值對應的類目。Since the distance between the vocabulary vector Wm and the category vector Ci is shorter, the degree of approximation of the vocabulary vector Wm and the category vector Ci is larger. To simplify the calculation process, P={Pi}={ΣVmi} can be set. Formula 6), Pi represents the weight component of the i-th class, and the larger the Pi value is, the shorter the distance corresponding to the category vector Ci is, and the larger the ΣVmi is, the shorter the distance corresponding to the category vector Ci is. Therefore, when calculating the distance sum, the calculation module 103 may accumulate the obtained normalized component values of the plurality of vocabularies on the category to obtain the weight value of the category. The classification module 104 assigns the text to the category corresponding to the maximum weight value.

Pi值越大對應到類目向量Ci的距離越短的實現原理如下:由於D=ΣWm,Wm={V1,V2,……,VN},則D={ΣVm1,ΣVm2,......,ΣVmi,......,ΣVmn},ΣVmi是文檔的所有辭彙在第i個類目上的歸一化詞頻值之和。不妨設Pi=ΣVmi,那麼,D={Pi}。D到Ci的距離可表示為:The larger the Pi value, the shorter the distance corresponding to the category vector Ci is as follows: since D=ΣWm, Wm={V1, V2, . . . , VN}, then D={ΣVm1, ΣVm2, ..... .ΣVmi,......,ΣVmn},ΣVmi is the sum of the normalized word frequency values of all vocabulary of the document on the i-th category. Let's set Pi=ΣVmi, then D={Pi}. The distance from D to Ci can be expressed as:

|D-Ci|=|{P1,P2,...Pi,...,Pn}×k-{0,0,...,0,1,0,....,0}|=k×|{P1,P2,...Pi,...,Pn}-{0,0,...,0,1/k,0,....,0}|=k×sqrt((P1-0)2 +(P2-0)2 +...+(Pi-1/k)2 +...+(Pn-0)2 )=k×sqrt(P12 +P22 +...+(Pi2 -2Pi/k+1/k2 )+...+Pn2 )=k×sqrt(Σ(Pi2 )-2Pi/k+1/k2 )=sqrt(Σ((Pi×k)2 )-2K×Pi+1)|D-Ci|=|{P1,P2,...Pi,...,Pn}×k-{0,0,...,0,1,0,....,0}|= k×|{P1,P2,...Pi,...,Pn}-{0,0,...,0,1/k,0,....,0}|=k×sqrt( (P1-0) 2 +(P2-0) 2 +...+(Pi-1/k) 2 +...+(Pn-0) 2 )=k×sqrt(P1 2 +P2 2 +. ..+(Pi 2 -2Pi/k+1/k 2 )+...+Pn 2 )=k×sqrt(Σ(Pi 2 )-2Pi/k+1/k 2 )=sqrt(Σ(( Pi×k) 2 )-2K×Pi+1)

由於Σ((Pi×k)2 )=1,所以:sqrt(Σ((Pi×k)2 )-2K×Pi+1)=sqrt(1-2K×Pi+1)=sqrt(2*(1-K×Pi))。由此可知,D到Ci的距離與Pi成反比關係,取Pi最大的類目,就是最接近的類目。其中sqrt表示√。Since Σ((Pi×k) 2 )=1, so: sqrt(Σ((Pi×k) 2 )-2K×Pi+1)=sqrt(1-2K×Pi+1)=sqrt(2*( 1-K×Pi)). It can be seen that the distance from D to Ci is inversely proportional to Pi, and the largest category of Pi is the closest category. Where sqrt indicates √.

該裝置可以位於一個電腦設備內,或者該裝置中的各模組由不同的電腦設備實現,由多個電腦設備協作完成該裝置的功能。該裝置中的各模組可以是軟體、硬體,也可以是軟體和硬體相結合的形式實現。The device can be located in a computer device, or each module in the device is implemented by a different computer device, and the functions of the device are performed by a plurality of computer devices. Each module in the device may be a soft body, a hardware body, or a combination of a software body and a hardware body.

以上描述了本文分類裝置的內部結構和功能,下面對本文分類的實現過程進行介紹。The internal structure and function of the classification device are described above. The implementation process of this classification is introduced below.

參見圖4,本實施例中本文分類的主要方法流程如下:步驟401:對獲得的本文內容進行分詞,得到多個辭彙。Referring to FIG. 4, the main method of the method in this embodiment is as follows: Step 401: Perform word segmentation on the obtained content to obtain multiple vocabularies.

步驟402:針對得到的多個辭彙中的每一個辭彙,確定該辭彙在球面空間模型中的辭彙向量。辭彙的辭彙向量包括該辭彙在各類目上的詞頻值進行歸一化後得到的歸一化詞頻值;球面空間模型是以單位長度為半徑的多維球體模型,球面空間的維度等於類目的個數,類目對應球面空間中的一個類目向量。Step 402: Determine a vocabulary vector of the vocabulary in the spherical space model for each of the obtained plurality of vocabularies. The vocabulary vector of the vocabulary includes the normalized word frequency obtained by normalizing the vocabulary value of the vocabulary in various categories; the spherical space model is a multi-dimensional spherical model with a radius of unit length, and the dimension of the spherical space is equal to The number of categories, the category corresponds to a category vector in the spherical space.

步驟403:針對每個類目,確定得到的多個辭彙的辭彙向量之和到該類目的類目向量的距離。Step 403: For each category, determine the distance from the sum of the obtained vocabulary vectors of the plurality of vocabularies to the category vector of the category.

步驟404:將本文分入最短距離對應的類目。Step 404: Divide the article into the category corresponding to the shortest distance.

本實施例中可以通過距離進行本文分類,也可以根據本文中的辭彙向量之和進行本文分類,下面針對這兩種情況詳細介紹本文分類的過程。In this embodiment, the classification can be performed by distance, and the classification can be performed according to the sum of the lexicon vectors in the present paper. The following describes the process of classification in detail for the two cases.

參見圖5,本實施例中通過距離和進行本文分類的方法流程如下:步驟501:對獲得的本文內容進行分詞,得到多個辭彙。Referring to FIG. 5, the method for passing the distance and performing the classification in this embodiment is as follows: Step 501: Perform word segmentation on the obtained content to obtain multiple vocabularies.

步驟502:對得到的多個辭彙進行過濾,得到符合過濾條件的多個辭彙。過濾模組106可以根據詞頻值對辭彙進行過濾。過濾條件有多種,如保留在所有類目上的詞頻均值大於預設數量;如辭彙的歸一化辭彙向量中最大分量(即最大歸一化詞頻值)大於預設的詞頻閾值的辭彙等,此處不一一列舉。Step 502: Filter the obtained multiple vocabularies to obtain a plurality of vocabularies that meet the filtering conditions. The filtering module 106 can filter the vocabulary according to the word frequency value. There are various filtering conditions, such as the word frequency average value remaining on all categories is greater than the preset number; for example, the maximum component (ie, the maximum normalized word frequency value) in the normalized vocabulary vector of the vocabulary is greater than the preset word frequency threshold. Hui et al., not listed here.

步驟503:針對符合過濾條件的每一個辭彙,查詢該辭彙在各類目上的歸一化詞頻值。其中,預先存有所有辭彙在各類目上的歸一化詞頻值,如果儲存的辭彙不夠全,無法查詢到該辭彙,則該辭彙在各類目上的歸一化詞頻值均為0。如果未存有辭彙在各類目上的歸一化詞頻值,而是存有辭彙在各類目上的詞頻值,則可由查詢模組102查詢詞頻值,並對詞頻值進行歸一化,得到歸一化詞頻值,具體實現可參見公式4。此步驟可過濾掉干擾辭彙(如生僻辭彙和常見辭彙等),儘量過濾出專業辭彙,以提高本文分類的準確度。Step 503: Query the normalized word frequency value of the vocabulary on various categories for each vocabulary that meets the filtering condition. Among them, the normalized word frequency value of all vocabulary in various categories is pre-stored. If the stored vocabulary is not enough, the vocabulary of the vocabulary in each category cannot be queried. Both are 0. If there is no normalized word frequency value of the vocabulary in various categories, but the word frequency value of the vocabulary in various categories is stored, the query module 102 can query the word frequency value and normalize the word frequency value. To obtain the normalized word frequency value, the specific implementation can be found in Equation 4. This step can filter out interference vocabulary (such as uncharted vocabulary and common vocabulary, etc.) and try to filter out professional vocabulary to improve the accuracy of the classification.

步驟504:針對每個類目,確定符合過濾條件的多個辭彙的辭彙向量和到各類目向量的距離。辭彙向量和到類目向量的距離為直線距離或球面距離。Step 504: For each category, determine a vocabulary vector of a plurality of vocabularies that meet the filtering condition and a distance to each category vector. The distance between the vocabulary vector and the category vector is a linear distance or a spherical distance.

在步驟504之前,還可以對得到的辭彙向量和進行歸一化,使歸一化後的辭彙向量和落入球面空間內。然後在步驟504中確定歸一化後的辭彙向量和到類目的距離。Prior to step 504, the resulting lexicon vector sum may also be normalized such that the normalized vocabulary vector sum falls within the spherical space. The normalized vocabulary vector and distance to the category are then determined in step 504.

步驟505:將本文分入最短距離對應的類目。Step 505: Divide the article into the category corresponding to the shortest distance.

可進一步在資料庫中依據類目分類保存本文。This article can be further saved in the database by category.

參見圖6,本實施例中通過歸一化辭彙向量和進行本文分類的方法流程如下:步驟601:對獲得的本文內容進行分詞,得到多個辭彙。Referring to FIG. 6, the method for normalizing the vocabulary vector and performing the classification in this embodiment is as follows: Step 601: Perform word segmentation on the obtained content to obtain multiple vocabularies.

步驟602:對得到的多個辭彙進行過濾,得到符合過濾條件的多個辭彙。Step 602: Filter the obtained multiple vocabularies to obtain a plurality of vocabularies that meet the filtering conditions.

步驟603:針對符合過濾條件的每一個辭彙,查詢該辭彙在各類目上的歸一化詞頻值。其中,預先存有所有辭彙在各類目上的歸一化詞頻值。Step 603: Query the normalized word frequency value of the vocabulary on various categories for each vocabulary that meets the filtering condition. Among them, the normalized word frequency value of all vocabulary in various categories is pre-stored.

步驟604:針對每個類目,將得到的多個辭彙在該類目上的歸一化詞頻值進行累加,得到歸一化辭彙向量和。具體實現可參見公式6。Step 604: For each category, accumulate the obtained normalized word frequency values of the plurality of vocabularies on the category to obtain a normalized vocabulary vector sum. See Equation 6 for specific implementation.

步驟605:將本文分入歸一化辭彙向量和的最大分量對應的類目。Step 605: Divide the text into categories corresponding to the largest component of the normalized lexicon vector sum.

用於實現本案實施例的軟體可以儲存於軟碟、硬碟、光碟和快閃記憶體等儲存介質。The software for implementing the embodiments of the present invention can be stored in a storage medium such as a floppy disk, a hard disk, a compact disk, and a flash memory.

本案實施例對VSM進行改進,預先構造一球面空間模型,並基於該球面空間模型對本文進行分類,在分類過程中,計算本文中辭彙向量之和與類目向量的距離,從而確定本文應分入的類目。本案實施例實現了本文分類,並且相對于現有技術中的夾角餘弦演算法,計算量明顯減少。以及本案實施例中球面空間模型以單位長度為半徑,則一個辭彙在各類目上的歸一化後的歸一化辭彙向量的平方和也為單位長度,相當於將一個辭彙的語義信息量等價為單位長度,對語義信息量進行了約束,因此相對于現有技術可提高本文分類的準確度。The embodiment of the present invention improves the VSM, pre-constructs a spherical space model, and classifies the paper based on the spherical space model. In the classification process, the distance between the sum of the vocabulary vectors and the category vector is calculated to determine the The category to be classified. The embodiment of the present invention achieves the classification herein, and the amount of calculation is significantly reduced relative to the angle cosine algorithm in the prior art. And in the embodiment of the present invention, the spherical space model has a radius of unit length, and the squared sum of the normalized lexical vectors of a normalized vocabulary in various categories is also a unit length, which is equivalent to a vocabulary. The semantic information quantity is equivalent to the unit length, and the amount of semantic information is constrained. Therefore, the accuracy of the classification can be improved compared with the prior art.

有了較準確的本文分類,有利於提高本文分類儲存和本文分類檢索(或搜尋)的準確度。With a more accurate classification of the paper, it is beneficial to improve the accuracy of the classification storage and the classification search (or search) in this paper.

顯然,本領域的技術人員可以對本案進行各種改動和變型而不脫離本案的精神和範圍。這樣,倘若對本案的這些修改和變型屬於本案申請專利範圍及其等同技術的範圍之內,則本案也意圖包含這些改動和變型在內。It will be obvious to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. In this way, if the modifications and variations of the present invention fall within the scope of the patent application and its equivalents, the present invention is intended to include such modifications and variations.

101...分詞模組101. . . Word segmentation module

102...查詢模組102. . . Query module

103...計算模組103. . . Computing module

104...分類模組104. . . Classification module

105...介面模組105. . . Interface module

106...過濾模組106. . . Filter module

107...構造模組107. . . Construction module

108...儲存模組108. . . Storage module

圖1為本案實施例中裝置的主要結構圖;Figure 1 is a main structural view of the device in the embodiment of the present invention;

圖2為本案實施例中裝置的詳細結構圖;Figure 2 is a detailed structural view of the device in the embodiment of the present invention;

圖3為本案實施例中球面空間的示意圖;Figure 3 is a schematic view of a spherical space in the embodiment of the present invention;

圖4為本案實施例中本文分類的主要方法流程圖;4 is a flow chart of the main method of classification in the embodiment of the present invention;

圖5為本案實施例中通過距離和進行本文分類的方法流程圖;5 is a flow chart of a method for passing the distance and performing the classification in the embodiment of the present invention;

圖6為本案實施例中通過辭彙向量和進行本文分類的方法流程圖。FIG. 6 is a flow chart of a method for classifying a vocabulary vector and performing the classification in the embodiment of the present invention.

Claims (10)

一種本文分類的方法,其特徵在於,包括以下步驟:對獲得的本文內容進行分詞,得到多個辭彙;對得到的多個辭彙進行過濾,得到符合過濾條件的多個辭彙,該過濾包含:根據辭彙在各類目上的詞頻值計算該辭彙的變異係數;以及過濾出變異係數大於預設的變異係數閾值的辭彙;針對得到的多個辭彙中的一個辭彙,確定該辭彙在球面空間模型中的辭彙向量,其中球面模型包含的空間的維度等於類目的個數,類目對應球面空間中的一個類目向量;針對每個類目,確定得到的多個辭彙的辭彙向量之和到該類目的類目向量的距離;及根據該確定的距離將本文分入最短距離對應的類目。 A method for classification according to the present invention, comprising the steps of: segmenting the obtained content of the article, obtaining a plurality of vocabulary; filtering the obtained plurality of vocabularies to obtain a plurality of vocabularies meeting the filtering condition, the filtering The method comprises: calculating a coefficient of variation of the vocabulary according to a word frequency value of each vocabulary; and filtering out a vocabulary whose coefficient of variation is greater than a threshold of a predetermined coefficient of variation; and for a vocabulary of the plurality of vocabularies obtained, Determining the vocabulary vector of the vocabulary in the spherical space model, wherein the dimension of the space contained in the spherical model is equal to the number of categories, and the category corresponds to a category vector in the spherical space; for each category, the determined number is determined The distance from the vocabulary vector of the vocabulary to the category vector of the category; and the text is classified into the category corresponding to the shortest distance according to the determined distance. 如申請專利範圍第1項所述的方法,其中,辭彙向量到類目向量的距離為直線距離或球面距離。 The method of claim 1, wherein the distance from the vocabulary vector to the category vector is a linear distance or a spherical distance. 如申請專利範圍第1項所述的方法,其中辭彙的辭彙向量包括該辭彙在各類目上的詞頻值進行歸一化後得到的歸一化詞頻值;及球面空間模型是以單位長度為半徑的多維球體模型。 The method of claim 1, wherein the vocabulary vector of the vocabulary includes a normalized word frequency obtained by normalizing the vocabulary value of each vocabulary; and the spherical space model is A multidimensional sphere model with a radius of unit length. 如申請專利範圍第4項所述的方法,其中,該單位 長度為1。 The method of claim 4, wherein the unit The length is 1. 如申請專利範圍第1項所述的方法,其中,確定得到的多個辭彙的辭彙向量之和到各類目的類目向量的距離的步驟包括:預先存有所有辭彙在各類目上的歸一化詞頻值,將得到的多個辭彙的辭彙向量在該類目上的歸一化詞頻值進行累加,以得到歸一化辭彙向量和;及將本文分入最短距離對應的類目的步驟包括:將本文分入歸一化辭彙向量和的最大分量對應的類目。 The method of claim 1, wherein the step of determining the distance between the sum of the obtained vocabulary vectors of the plurality of vocabularies to the categories of the destination categories includes: pre-existing all vocabulary in each category The normalized word frequency value, the cumulative vocabulary vector of the plurality of vocabulary is accumulated in the normalized word frequency value of the category to obtain a normalized vocabulary vector sum; and the paper is divided into the shortest distance The steps of the corresponding category include: dividing the text into categories corresponding to the largest component of the normalized lexicon vector sum. 一種用於本文分類的裝置,其中,包括:分詞模組,用於對獲得的本文內容進行分詞,得到多個辭彙;過濾模組,用於對得到的多個辭彙進行過濾,以得到符合過濾條件的多個辭彙根據辭彙在各類目上的詞頻值計算該辭彙的變異係數,過濾出變異係數大於預設的變異係數閾值的辭彙;查詢模組,用於針對得到的多個辭彙中的一個辭彙,確定該辭彙在球面空間模型中的辭彙向量,其中球面空間模型包含的維度等於類目的個數,類目對應球面空間中的一個類目向量;計算模組,用於針對每個類目,確定得到的多個辭彙的辭彙向量之和到該類目的類目向量的距離;及分類模組,用於根據該確定的距離將本文分入最短距 離對應的類目。 An apparatus for classification in the present invention, comprising: a word segmentation module, configured to perform segmentation on the obtained content, and obtain a plurality of vocabulary; and a filtering module, configured to filter the obtained plurality of vocabulary to obtain The multiple vocabulary that meets the filtering condition calculates the coefficient of variation of the vocabulary according to the word frequency value of each vocabulary, and filters out the vocabulary whose coefficient of variation is greater than the preset coefficient of variation coefficient; the query module is used to obtain a vocabulary of a plurality of vocabulary, determining a vocabulary vector of the vocabulary in the spherical space model, wherein the spherical space model includes a dimension equal to the number of categories, and the category corresponds to a category vector in the spherical space; a calculation module, configured to determine, for each category, a distance from a sum of the vocabulary vectors of the plurality of vocabularies obtained to the category vector of the category; and a classification module configured to divide the document according to the determined distance Shortest distance From the corresponding category. 如申請專利範圍第6項所述的裝置,其中,辭彙向量到類目的距離為直線距離或球面距離。 The device of claim 6, wherein the distance from the vocabulary vector to the category is a linear distance or a spherical distance. 如申請專利範圍第6項所述的裝置,其中辭彙的辭彙向量包括該辭彙在各類目上的詞頻值進行歸一化後得到的歸一化詞頻值;及球面空間模型是以單位長度為半徑的多維球體模型。 The device according to claim 6, wherein the vocabulary vector of the vocabulary includes a normalized word frequency obtained by normalizing the vocabulary value of each vocabulary; and the spherical space model is A multidimensional sphere model with a radius of unit length. 如申請專利範圍第8項所述的裝置,其中,該單位長度為1。 The device of claim 8, wherein the unit length is one. 如申請專利範圍第6項所述的裝置,其中計算模組預先存有所有辭彙在各類目上的歸一化詞頻值,將得到的多個辭彙的辭彙向量在該類目上的歸一化詞頻值進行累加,以得到歸一化辭彙向量和;及分類模組將本文分入歸一化辭彙向量和的最大分量對應的類目。The device of claim 6, wherein the computing module pre-stores the normalized word frequency values of all the vocabulary in various categories, and the obtained vocabulary vectors of the plurality of vocabularies are on the category. The normalized word frequency values are accumulated to obtain a normalized vocabulary vector sum; and the classification module divides the text into categories corresponding to the largest component of the normalized lexicon vector sum.
TW099112962A 2010-04-23 2010-04-23 Methods and apparatus for classification TWI509434B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
TW099112962A TWI509434B (en) 2010-04-23 2010-04-23 Methods and apparatus for classification

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
TW099112962A TWI509434B (en) 2010-04-23 2010-04-23 Methods and apparatus for classification

Publications (2)

Publication Number Publication Date
TW201137641A TW201137641A (en) 2011-11-01
TWI509434B true TWI509434B (en) 2015-11-21

Family

ID=46759593

Family Applications (1)

Application Number Title Priority Date Filing Date
TW099112962A TWI509434B (en) 2010-04-23 2010-04-23 Methods and apparatus for classification

Country Status (1)

Country Link
TW (1) TWI509434B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050228783A1 (en) * 2004-04-12 2005-10-13 Shanahan James G Method and apparatus for adjusting the model threshold of a support vector machine for text classification and filtering
TW200612262A (en) * 2004-10-05 2006-04-16 Microsoft Corp Systems, methods, and interfaces for providing personalized search and information access
TW200729003A (en) * 2006-01-25 2007-08-01 Bridgewell Inc Conceptual keyword function generation method, adjustment method, system, search engine, and calculation method for keyword related value
US20080183471A1 (en) * 2000-11-02 2008-07-31 At&T Corp. System and method of pattern recognition in very high dimensional space

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080183471A1 (en) * 2000-11-02 2008-07-31 At&T Corp. System and method of pattern recognition in very high dimensional space
US20050228783A1 (en) * 2004-04-12 2005-10-13 Shanahan James G Method and apparatus for adjusting the model threshold of a support vector machine for text classification and filtering
TW200612262A (en) * 2004-10-05 2006-04-16 Microsoft Corp Systems, methods, and interfaces for providing personalized search and information access
TW200729003A (en) * 2006-01-25 2007-08-01 Bridgewell Inc Conceptual keyword function generation method, adjustment method, system, search engine, and calculation method for keyword related value

Also Published As

Publication number Publication date
TW201137641A (en) 2011-11-01

Similar Documents

Publication Publication Date Title
CN107122352B (en) Method for extracting keywords based on K-MEANS and WORD2VEC
WO2022116537A1 (en) News recommendation method and apparatus, and electronic device and storage medium
WO2017210949A1 (en) Cross-media retrieval method
US9208220B2 (en) Method and apparatus of text classification
CN111414393A (en) Semantic similar case retrieval method and equipment based on medical knowledge graph
CN111460201B (en) Cross-modal retrieval method for modal consistency based on generative countermeasure network
CN103473327A (en) Image retrieval method and image retrieval system
CN110688452B (en) Text semantic similarity evaluation method, system, medium and device
WO2018176913A1 (en) Search method and apparatus, and non-temporary computer-readable storage medium
Zhan et al. Comprehensive distance-preserving autoencoders for cross-modal retrieval
CN112329460B (en) Text topic clustering method, device, equipment and storage medium
WO2020151152A1 (en) User profile-based clustering method, electronic device, and storage medium
US11755668B1 (en) Apparatus and method of performance matching
CN116501875B (en) Document processing method and system based on natural language and knowledge graph
CN115080764A (en) Medical similar entity classification method and system based on knowledge graph and clustering algorithm
Zhang et al. Boosting cross-media retrieval via visual-auditory feature analysis and relevance feedback
CN112364937A (en) User category determination method and device, recommended content determination method and electronic equipment
Karamti et al. Vector space model adaptation and pseudo relevance feedback for content-based image retrieval
CN113486670B (en) Text classification method, device, equipment and storage medium based on target semantics
CN112579783B (en) Short text clustering method based on Laplace atlas
Conilione et al. Fuzzy approach for semantic face image retrieval
US20230178073A1 (en) Systems and methods for parsing and correlating solicitation video content
TWI509434B (en) Methods and apparatus for classification
WO2023177723A1 (en) Apparatuses and methods for querying and transcribing video resumes
CN107423294A (en) A kind of community image search method and system