CN103345471B

CN103345471B - A kind of accessible text exhibiting method decomposed based on multiple manifold incidence matrix

Info

Publication number: CN103345471B
Application number: CN201310217406.0A
Authority: CN
Inventors: 卜佳俊; 李平; 陈纯; 王北斗; 高珊
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2013-06-03
Filing date: 2013-06-03
Publication date: 2016-08-10
Anticipated expiration: 2033-06-03
Also published as: CN103345471A

Abstract

The accessible text exhibiting method decomposed based on multiple manifold incidence matrix, after the Internet captures web page text, proceed as follows for text: first text is carried out participle, extract text statistical nature information, including word frequency and reverse document frequency, form the TF IDF vectorization character representation of text；Then building some text manifolds and word manifold, incidence matrix based on multiple manifold decomposes the duality considered between text and word, it is thus achieved that text representation and the word of low-dimensional represent；Finally the low-dimensional to text represents and clusters, and the text of same or like theme is divided into one group, the most again represents text message.Advantage of the process is that and can preferably help people with disability user to divide theme to browse the text message on the Internet, and quickly show the web page text set of same theme, strengthen the Experience Degree of user.

Description

A kind of accessible text exhibiting method decomposed based on multiple manifold incidence matrix

Technical field

The present invention relates to the technical field of accessible text exhibiting method, be based particularly on multiple manifold The accessible text exhibiting method that incidence matrix decomposes.

Background technology

China's large population base, composition colony is in variation feature, and important colony therein is residual The total amount of disease people has reached 85,000,000, is to build a harmonious society and in developing national economy Important force, Ye Shi governments at all levels and the colony of all kinds of groups emphasis helping.According to people with a disability in China The statistical report form of community finds, the data of all kinds of people with disabilitys are in the past few decades in year by year Ascendant trend.In the information age of big data-driven, increasing people with disability utilizes the most just Prompt the Internet obtains the information resources of daily studying and living, becomes very important in netizen Colony.In this huge information sharing platform of the Internet, text media occupies information and represents Overwhelming ratio, such as topical news, sports reports, book review film review etc. the overwhelming majority letter Breath presents to people with disability user by textual form.Comparing ordinary people, many people with disabilitys are due to body Body or the various defects of psychology and be difficult to effectively browse required info web, and on the Internet Text message a feast for the eyes, therefore be badly in need of invent a kind of clog-free text exhibiting method, side Just the text message on the Internet is read by people with disability colony.

It is known that the info web tissue provided on all kinds of websites is loose, lacks and concentrate classification Management, and people with disability user's only interesting web page text reading some particular topic, this makes Become text abundant information mixed and disorderly and people with disability reads the contradiction between webpage difficulty interested.Special Not for those hearing losss people or extremity disabled persons, search on the internet and read net The step of page text message is more time-consuming, easily causes and feels exhausted and spirit sleepiness.If energy Text message in all kinds of webpages is quickly put in little set according to theme, further in accordance with difference Theme is presented to people with disability user, is beneficial to alleviate web page text and reads pressure, improves text Reading efficiency and the Experience Degree of people with disability user.

At information retrieval and Data Mining, it is based primarily upon the cosine similarity of web page text also Carry out the cluster of text on this basis, form the text collection of all kinds of theme.To webpage literary composition Shelves carry out after the TF-IDF feature extraction dyad of text represents, according to text and word it Between relation of interdependence, use the clustering algorithms such as k-means in data mining, can be by net Page text is divided into multiple different subclass according to different themes and presents to user.

Summary of the invention

In order to help people with disability user can browse the web page text of same subject quickly and easily, To improve the Experience Degree of text reading, the present invention proposes one and divides based on multiple manifold incidence matrix The accessible text exhibiting method solved, the method comprises the following steps:

1, capture after web page text from the Internet, carry out following operation for text:

1) text is carried out participle, extract text statistical nature information, including word frequency with reverse Document frequency, forms the TF-IDF vectorization character representation of text；

2) building some text manifolds and word manifold, incidence matrix based on multiple manifold decomposes Consider the duality between text and word, it is thus achieved that text representation and the word of low-dimensional represent；

3) low-dimensional to text represents and clusters, and the text of same or like theme is divided into one Group, represents text message the most again.

2, step 1) described in extraction text statistical nature information comprise the concrete steps that:

1.1) each web page text can regard a document as, to two kinds of statistical information of Text Feature Extraction, I.e. word frequency (TF:Term Frequency) and reverse document frequency (IDF:Inverse Document Frequency), if the word occurred in text has m, then formed the TF-IDF of m dimension to Quantization characteristic represents；

1.2) the TF-IDF character representation to all texts carries out unified normalized.

3, step 2) described in structure some texts manifold and word manifold comprise the concrete steps that:

2.1) manifold structure can reflect the intrinsic structure of data, and it is by figure Laplce's square Battle array builds, and text manifold and word manifold can reflect text data and word data respectively Intrinsic structure；

2.2) figure Laplce's matrix L of text is built_s, from the Internet, first obtain n net Page text, the character representation of i-th text isThe character representation of jth text isWill The summit on non-directed graph regarded as by each text, if the Euclidean distance of two texts is relatively near, then in phase Connect a limit between the summit answered and give limit weight, so can set up a reflection textual data Non-directed graph according to manifold structure；Associated weights composition size between each text is the weight of n × n Matrix W_s, to W_sEvery column element cumulative successively and be placed on diagonal matrix D_sDiagonal on, D_sElement on middle off-diagonal is all set to 0, then can pass through L_s=D_s-W_sObtain text Figure Laplce's matrix L_s；

2.3) figure Laplce's matrix L of some texts is built_s, connected by giving in non-directed graph Different weights W of edge fit_sRealize, i.e. utilize three kinds of different Weight Algorithms: two-value weight, Cosine similarity and gaussian kernel weight；IfWithEuclidean distance farther out, between i.e. two summits Boundless connection, then the limit weight of two texts is 0；IfWithEuclidean distance relatively near, i.e. Jian You limit, two summits connects, then:

A. for two-value weight, the limit weight of two texts is 1；

B. for cosine similarity, the limit weight of two texts isWherein ()^TRepresent Vector or the transposition of matrix；

C. for gaussian kernel weight, the limit weight of two texts isWherein | | represent the l of vector₂Norm, real parameters σ ＞ 0 represents the bandwidth of gaussian kernel, by arranging Different bandwidth parameters, can obtain different gaussian kernel weights；

2.4) figure Laplce's matrix L of word is built_f, according to the duality between text and word, The character representation dimension of each word is n, and the character representation of i-th word isJth list The character representation of word isEach word is regarded as the summit on non-directed graph, if two words Euclidean distance is relatively near, then connect a limit between corresponding summit and give limit weight, so may be used To set up the non-directed graph of a reflection word data manifold structure；Associated weights group between each word The weight matrix W becoming size to be m × m_f, to W_fEvery column element cumulative successively and be placed on right Angle matrix D_fDiagonal on, D_fElement on middle off-diagonal is all set to 0, then can pass through L_f=D_f-W_fObtain figure Laplce's matrix L of word_f；

2.5) figure Laplce's matrix L of some words is built_f, its concrete grammar is some with structure Figure Laplce's matrix L of text_sIdentical.

4, step 2) described in based on multiple manifold incidence matrix decompose comprise the concrete steps that:

3.1) assuming to obtain n text from the Internet, these texts relate to c_sIndividual theme, each The character representation of text is matrix column vector, then full text forms a dimension is m × n Data matrix X_s；The word of composition text has m, and these words relate to c_fIndividual theme, often The feature epi-position of individual word is matrix column vector, then all one dimension of word formation is The data matrix X of n × m_f；Due to the collaborative duality relation between text and word, then meetText and word data matrix are merged into a dimension is (n+m) incidence matrix of × (n+m)

R = (\begin{matrix} O & X_{f} \\ X_{s} & O \end{matrix}),

Wherein O represents full null matrix, Its dimension is determined by the number of text and word；

3.2) data matrix of text is resolved into three parts, i.e.The biggest Little for m × c_fMatrix V_fBeing that the low-dimensional of word represents, size is n × c_sMatrix V_sIt it is text Low-dimensional represent, size is c_f×c_sMatrix S_fWord data for compression represent；Similarly, The data matrix of word is resolved into three parts, i.e.Wherein size is c_s×c_f Matrix S_sText data for compression represents；So, available size is (n+m)×(c_f+c_s) association low-dimensional representing matrix

V = (\begin{matrix} V_{s} & O \\ O & V_{f} \end{matrix}),

Wherein O represents complete Null matrix, its dimension is determined by text and the number of word and involved number of topics；Also may be used To obtain size for (c_f+c_s)×(c_f+c_s) association low-dimensional representing matrix

S = (\begin{matrix} O & S_{f} \\ S_{s} & O \end{matrix}),

Wherein O represents full null matrix, and its dimension is determined by the number of topics involved by text and word；

3.3) q text manifold and q word manifold are built respectively according to different Weight Algorithms, I.e.WithBuild the association that q size is (n+m) × (n+m) Manifold matrix, then i-th association manifold matrix table is shown as

L_{i} = (\begin{matrix} L_{s}^{i} & O \\ O & L_{f}^{i} \end{matrix}),

Wherein O represents Full null matrix, its dimension is determined by the number of text and word；For preferably approaching to reality Data manifold, gives each manifold one weight coefficient μ_i＞ 0, forms the linear of multiple manifold Combination, i.e.

L = Σ_{i = 1}^{q} μ_{i} L_{i},

And meet condition

Σ_{i = 1}^{q} μ_{i} = 1;

3.4) incidence matrix utilizing multiple manifold decomposes the object function minimizing regularization

\min_{V} {{| R - {VSV}^{T} |}_{F}^{2} + α T r [V^{T} (Σ_{i = 1}^{q} μ_{i} L_{i}) V] + β {| μ |}^{2}},

\begin{matrix} s . t . & Σ_{i = 1}^{q} μ_{i} = 1, μ &GreaterEqual; 0, V &GreaterEqual; 0, \end{matrix}

Wherein, | |_FFor matrix norm, | | for the l of vector₂Norm, Tr () is matrix trace, Regularization factors α ＞ 0 and β ＞ 0 is respectively intended to regulate the contribution of manifold structure and avoided Matching；The text low-dimensional obtained by solving this object function represents, it is possible to approach urtext The intrinsic structure of data, and keep text data and the local geometry of word data simultaneously, Make the text distance of same subject as close possible to.

The present invention proposes the accessible text exhibiting method decomposed based on multiple manifold incidence matrix, Have an advantage in that: utilize the duality of text and word, the statistical nature of text is represented and carries out Clustering processing, so that similar text presents with packet mode；It is applicable to all types of webpage Text message, it is not necessary to backstage manual operation, can be used for helping people with disability to realize accessible webpage literary composition This reading is it can also be used to help domestic consumer to improve text reading efficiency.

Accompanying drawing explanation

Fig. 1 is the method flow diagram of the present invention.

Detailed description of the invention

Referring to the drawings, the present invention is further illustrated:

A kind of accessible text exhibiting method decomposed based on multiple manifold incidence matrix, the method bag Include following steps:

1, capture after text from the Internet, carry out following operation for text:

Step 1) described in extract the comprising the concrete steps that of text statistical nature information:

Step 2) described in build comprising the concrete steps that of some text manifolds and word manifold:

A. for two-value weight, the limit weight of two texts is 1；

Step 2) described in based on multiple manifold incidence matrix decompose comprise the concrete steps that:

R = (\begin{matrix} O & X_{f} \\ X_{s} & O \end{matrix}),

V = (\begin{matrix} V_{s} & O \\ O & V_{f} \end{matrix}),

S = (\begin{matrix} O & S_{f} \\ S_{s} & O \end{matrix}),

L_{i} = (\begin{matrix} L_{s}^{i} & O \\ O & L_{f}^{i} \end{matrix}),

L = Σ_{i = 1}^{q} μ_{i} L_{i},

And meet condition

Σ_{i = 1}^{q} μ_{i} = 1;

\min_{V} {{| R - {VSV}^{T} |}_{F}^{2} + α T r [V^{T} (Σ_{i = 1}^{q} μ_{i} L_{i}) V] + β {| μ |}^{2}},

\begin{matrix} s . t . & Σ_{i = 1}^{q} μ_{i} = 1, μ &GreaterEqual; 0, V &GreaterEqual; 0, \end{matrix}

Content described in this specification embodiment is only enumerating of the way of realization to inventive concept, this The protection domain of invention be not construed as being only limitted to the concrete form that embodiment is stated, this Invention protection domain also and in those skilled in the art according to present inventive concept institute it is conceivable that Equivalent technologies means.

Claims

1. the accessible text exhibiting method decomposed based on multiple manifold incidence matrix, the method is characterized in that and capture after web page text from the Internet, carries out following operation for text:

1) text is carried out participle, extract text statistical nature information, including word frequency and reverse document frequency, form the TF-IDF vectorization character representation of text；

2) building some text manifolds and word manifold, incidence matrix based on multiple manifold decomposes the duality considered between text and word, it is thus achieved that text representation and the word of low-dimensional represent；

3) low-dimensional to text represents and clusters, and the text of same or like theme is divided into one group, the most again represents text message；

Described step 1) in extract the comprising the concrete steps that of text statistical nature information:

1.1) each web page text can regard a document as, to two kinds of statistical information of Text Feature Extraction, i.e. word frequency (TF:Term Frequency) and reverse document frequency (IDF:Inverse Document Frequency), if the word occurred in text has m, then form the TF-IDF vectorization character representation of m dimension；

1.2) the TF-IDF character representation to all texts carries out unified normalized；

Described step 2) in build comprising the concrete steps that of some text manifolds and word manifold:

2.1) manifold structure can reflect the intrinsic structure of data, and it is built by figure Laplacian Matrix, and text manifold and word manifold can reflect the intrinsic structure of text data and word data respectively；

2.2) figure Laplce's matrix L of text is built_s, from the Internet, first obtaining n web page text, the character representation of i-th text isThe character representation of jth text isEach text is regarded as the summit on non-directed graph, if the Euclidean distance of two texts is relatively near, then between corresponding summit, connects a limit and give limit weight, so can set up the non-directed graph of a reflection text data manifold structure；The weight matrix W that associated weights composition size is n × n between each text_s, to W_sEvery column element cumulative successively and be placed on diagonal matrix D_sDiagonal on, D_sElement on middle off-diagonal is all set to 0, then can pass through L_s=D_s-W_sObtain figure Laplce's matrix L of text_s；

2.3) figure Laplce's matrix L of some texts is built_s, by giving different weights W on connected limit in non-directed graph_sRealize, i.e. utilize three kinds of different Weight Algorithms: two-value weight, cosine similarity and gaussian kernel weight；IfWithEuclidean distance farther out, boundless connection between i.e. two summits, then the limit weight of two texts is 0；IfWithEuclidean distance relatively near, Jian You limit, i.e. two summits connects, then:

A. for two-value weight, the limit weight of two texts is 1；

2.4) figure Laplce's matrix L of word is built_f, according to the duality between text and word, the character representation dimension of each word is n, and the character representation of i-th word isThe character representation of jth word isEach word is regarded as the summit on non-directed graph, if the Euclidean distance of two words is relatively near, then between corresponding summit, connects a limit and give limit weight, so can set up the non-directed graph of a reflection word data manifold structure；The weight matrix W that associated weights composition size is m × m between each word_f, to W_fEvery column element cumulative successively and be placed on diagonal matrix D_fDiagonal on, D_fElement on middle off-diagonal is all set to 0, then can pass through L_f=D_f-W_fObtain figure Laplce's matrix L of word_f；

2.5) figure Laplce's matrix L of some words is built_f, its concrete grammar and the figure Laplce's matrix L building some texts_sIdentical；

Step 2) in multiple manifold incidence matrix decompose comprise the concrete steps that:

3.1) assuming to obtain n text from the Internet, these texts relate to c_sIndividual theme, the character representation of each text is matrix column vector, then one dimension of full text formation is the data matrix X of m × n_s；The word of composition text has m, and these words relate to c_fIndividual theme, the feature epi-position of each word is matrix column vector, then all one dimension of word formation is the data matrix X of n × m_f；Due to the collaborative duality relation between text and word, then meetText and word data matrix are merged into a dimension for ( The incidence matrix of n+m) × (n+m)Wherein O represents full null matrix, and its dimension is determined by the number of text and word；

3.2) data matrix of text is resolved into three parts, i.e.Wherein size is m × c_fMatrix V_fBeing that the low-dimensional of word represents, size is n × c_sMatrix V_sBeing that the low-dimensional of text represents, size is c_f×c_sMatrix S_fWord data for compression represent；Similarly, the data matrix of word is resolved into three parts, i.e.Wherein size is c_s×c_fMatrix S_sText data for compression represents；So, available size is (n+m) × (c_f+c_s) association low-dimensional representing matrixWherein O represents full null matrix, and its dimension is determined by text and the number of word and involved number of topics；Size can also be obtained for (c_f+c_s)×(c_f+c_s) association low-dimensional representing matrixWherein O represents full null matrix, and its dimension is determined by the number of topics involved by text and word；

3.3) q text manifold and q word manifold are built respectively according to different Weight Algorithms, i.e.WithBuild the association manifold matrix that q size is (n+m) × (n+m), then i-th association manifold matrix table is shown asWherein O represents full null matrix, and its dimension is determined by the number of text and word；For the data manifold of preferably approaching to reality, give each manifold one weight coefficient μ_i＞ 0, forms the linear combination of multiple manifold, i.e.And meet condition

Wherein, | |_FFor matrix norm, | | for the l of vector₂Norm, Tr () is matrix trace, and regularization factors α ＞ 0 and β ＞ 0 is respectively intended to regulate the contribution of manifold structure and avoid over-fitting；The text low-dimensional obtained by solving this object function represents, it is possible to approach the intrinsic structure of urtext data, and keeps text data and the local geometry of word data simultaneously so that the text distance of same subject as close possible to.