CN103488662A

CN103488662A - Clustering method and system of parallelized self-organizing mapping neural network based on graphic processing unit

Info

Publication number: CN103488662A
Application number: CN201310112420.4A
Authority: CN
Inventors: 叶允明; 张金超; 黄晓辉
Original assignee: Shenzhen Graduate School Harbin Institute of Technology
Current assignee: Shenzhen Graduate School Harbin Institute of Technology
Priority date: 2013-04-01
Filing date: 2013-04-01
Publication date: 2014-01-01

Abstract

The invention relates to a clustering method and system of a parallelized self-organizing mapping neural network based on a graphic processing unit. Compared with the traditional serialized clustering method, the invention can realize large-scale data clustering in a faster manner by parallelization of an algorithm and a parallel processing system of the graphic processing unit. The invention mainly relates to two aspects of contents: (1) firstly, designing the clustering method of the parallelized self-organizing mapping neural network according to the characteristic of high parallelized calculating capability of the graphic processing unit, wherein the method comprises the following steps of obtaining a word-frequency matrix by carrying out parallelized statistics on the word frequency of keywords in a document, calculating feature vectors of a text by parallelization to generate a feature matrix of data sets, and obtaining a cluster structure of massive data objects by the parallelized self-organizing mapping neural network; and (2) secondly, designing a parallelized text clustering system based on a CPU/GPU cooperation framework by utilizing the complementarity of the calculating capability between the graphic processing unit (GPU) and the central processing unit (CPU).

Description

Self-organizing map neural network clustering method and system based on Graphics Processing Unit

Technical field

The present invention relates to a kind of self-organizing map neural network clustering method and system of parallelization, relate in particular to a kind of parallelization self-organizing map neural network clustering method and system based on Graphics Processing Unit.

Background technology

At present, along with popularizing of computing machine, the continual growth of the number of users of internet, the Internet user produces a large amount of information every day on network.Simultaneously, some have in the social media system of a large number of users, also have every day a large amount of new datas to increase.Data mining and machine learning algorithm provide feasible method for us from the valuable information of these extracting data, but the learning process complexity of most of algorithm needs iterative learning, and the time that the processing mass data spends is longer.Although useful information is extracted, information may not have ageing, and this just need to develop algorithm or the more high performance arithmetic facility of employing faster.Adopt the mode of high-performance machine or CPU cluster no doubt can accelerate the calculating process of algorithm, but enterprise need to bear huge fund input.At present, it is relatively ripe that multi-core technology has developed, and the numerical evaluation performance of Graphics Processing Unit (GPU), considerably beyond the performance of CPU, is utilized the multinuclear characteristic of GPU, and the parallel ability of fully excavating algorithm becomes the study hotspot of computer science now.

At Data Mining, there is the partial data mining algorithm can run on Graphics Processing Unit equipment by improving, and obtained at least 5-6 acceleration doubly, what have even can reach 20-30 acceleration effect doubly.In Data Mining, an important research direction is exactly the excavation for text data, and text cluster is being played the part of important role in the text mining field.Cluster is the feature according to data, according to the similarity degree between data, is gathered into different texts bunch.According to statistics, human society has 80% information to take text to exist as carrier format.The text cluster technology can effectively be organized text data, make a summary and navigate.

The SOM network is by simulation human brain a kind of artificial neural network that the characteristics of information processing design to external world, is a kind of unsupervised learning method, is very suitable for processing the clustering problem of higher-dimension text data.SOM (Self-Organizing Mapping, be called for short " SOM ") network need not the user be specified the cluster number of clusters, network can be in training process the adaptive cluster of carrying out, insensitive to the outlier noise data, there is very strong Noise Resistance Ability.SOM carries out cluster according to the sample distribution rule in training sample, insensitive to the shape of data.Yet it is slow that existing SOM algorithm process high dimensional data has network convergence speed, the characteristics that the cluster time is long.

Text cluster is a kind of in data mining technology, and the text document resource is divided into to several bunches according to the similarity standard of appointment, makes every cluster inside identical as much as possible, and between different bunches, similarity is as far as possible little.Text cluster is mainly according to famous cluster hypothesis: similar Documents Similarity is larger, and inhomogeneous Documents Similarity is less.As a kind of unsupervised machine learning method, cluster is due to the training process do not needed in advance, and do not need in advance to the manual mark of document classification, therefore there is certain dirigibility and higher robotization processing power, become the important means that text message is effectively organized, made a summary and navigates, by increasing researchist is paid close attention to.

Summary of the invention

The technical matters that the present invention solves is: build a kind of based on Graphics Processing Unit ((Graphic Processing Unit, Graphics Processing Unit, be called for short " GPU ")) parallelization self-organizing map neural network clustering method and system, overcome prior art in the text cluster process because data volume causes greatly the slow technical matters of computing velocity.

Technical scheme of the present invention is: a kind of parallel self-organizing map neural network clustering method based on Graphics Processing Unit is provided, comprises the steps:

Parallel keyword word frequency statistics: content of text is carried out to participle and obtain the set of keyword, in parallel statistic document, the frequency of keyword, obtain frequency matrix;

The Concurrent Feature vector calculation: the keyword frequency matrix is converted into to the characteristic of correspondence vector matrix, and each proper vector represents a document.

Parallel SOM cluster: according to eigenvectors matrix design SOM network structure, initialization SOM network, parallel computation input sample and whole output neuron weight vector distances, the size that compares each distance, obtain the best neuron J of minor increment, by upgrading neuron weight vector value, learning rate and the best neuronic Size of Neighborhood in best neuron, its neighborhood, then by Graphics Processing Unit parallel computation network error rate E _tif, network error rate E _t<=target error ε or iterations t>=training maximum iteration time T, the SOM network training finishes, otherwise re-starts new round training; The result of each study makes the neuronic neighborhood of optimum matching zone close to the input data vector value, and the close input feature value of distance is gathered into to same bunch, and bunch set of formation is final cluster result.

Further technical scheme of the present invention is: the process of adding up every piece of document keyword word frequency is separate, and the present invention is thread statistics word frequency of every piece of file design, then by the multi-threaded parallel statistics of Graphics Processing Unit.

Further technical scheme of the present invention is: the proper vector computation process of every piece of document is separate, and the present invention is thread computes proper vector of every piece of file design, then by the multi-thread concurrent of Graphics Processing Unit, carries out.Its proper vector is calculated and is adopted formula

x _ij=log ₂(tf _ij+1.0)*log(m/m _j)，

And be normalized to

x_{ij} = \frac{x_{ij}}{\sqrt{Σ_{p = 1}^{n} x_{ip}^{2}}} .

In formula, x _ijbe that j Feature Words is at document d _ithe value of middle proper vector, tf _ijbe that j Feature Words is at document d _iin occurrence number, m/m _jbe the document frequency that falls of j Feature Words, m is total number of documents, m _jit is the number of files that comprises j Feature Words.

Further technical scheme of the present invention is: in Concurrent Feature vector calculation step, adopt based on graphics process the proper vector of each document of multithreads computing.

Further technical scheme of the present invention is: the computation process of input feature value and each output neuron weight vector distance is separate, the distance of input feature value and each output neuron vector is calculated in employing based on a plurality of thread parallels of graphics process, system is opened a thread for each neuron, adopts multithreads computing.

Further technical scheme of the present invention is: the computation process of the weight vector error of adjacent twice iteration of each neuron is separate, a plurality of thread parallels of employing based on graphics process calculate each neuronic weight vector error, system is opened a thread for each neuron, adopts multithreads computing.

Technical scheme of the present invention is: build a kind of self-organizing map neural network clustering system based on Graphics Processing Unit, comprise hardware components and software section, hardware components: adopt the design of CPU/GPU collaboration framework, the serial run time version operates on CPU, it is upper that the executed in parallel code operates in GPU, and the data transfer mode provided by GPU exchanges the data between video memory and internal memory; Software section is divided into three modules, comprise parallelization keyword word frequency statistics module, parallelization proper vector computing module, parallelization SOM cluster module, the proper vector computing unit of unit, calculated characteristics vector, carry out the text cluster unit of text cluster, described parallelization keyword word frequency statistics module is carried out participle by content of text and is obtained the set of keyword, in parallel statistic document, the frequency of keyword, obtain frequency matrix; Described parallelization proper vector computing module is converted into the characteristic of correspondence vector matrix to the keyword frequency matrix, and each proper vector represents a document; Described parallelization SOM cluster module is according to eigenvectors matrix design SOM network structure, initialization SOM network, parallel computation input sample and whole output neuron weight vector distances, the size that compares each distance, obtain the best neuron J of minor increment, by upgrading neuron weight vector value, learning rate and the best neuronic Size of Neighborhood in best neuron, its neighborhood, then by Graphics Processing Unit parallel computation network error rate E _tif, network error rate E _t<=target error ε or iterations t>=training maximum iteration time T, the SOM network training finishes, otherwise re-starts new round training; The result of each study makes the neuronic neighborhood of optimum matching zone close to the input data vector value, and the close input feature value of distance is gathered into to same bunch, and bunch set of formation is final cluster result.

Further technical scheme of the present invention is: all designed the operation that several kernel functions are carried out purpose parallel acceleration algorithm in described parallelization keyword word frequency statistics module, described parallelization proper vector computing module and described parallelization SOM cluster module.

Further technical scheme of the present invention is: in parallel keyword word frequency statistics module, designed a kernel function for the keyword word frequency statistics; In Concurrent Feature vector calculation module, two kernel functions of calculating for proper vector and two have been designed for the normalized kernel function of proper vector.

Further technical scheme of the present invention is: in parallel SOM cluster module, designed a kernel function for the distance of calculating input feature value and output neuron, kernel function and the kernel function for the error of regular net region weight vector for the error of the network weight vector of calculating adjacent twice iteration of each neuron.

Technique effect of the present invention is: the present invention is a set of parallel self-organizing map neural network clustering method and system based on Graphics Processing Unit.This invention is by the Text Clustering Algorithm of design parallelization, utilize the complementarity of the computing power between Graphics Processing Unit (GPU) and central processing unit (CPU) simultaneously, designed a set of parallelization text cluster system based on the CPU/GPU collaboration framework.Specifically, the present invention comprises two parts: one. designed a kind of clustering method of the parallelization self organizing neural network based on Graphics Processing Unit.In the method, for the keyword word frequency statistics of document, the proper vector of document is calculated and parallelization has been done in three aspects of SOM clustering algorithm.Two. developed a set of parallelization text cluster system based on the CPU/GPU collaboration framework.In this system, the present invention has designed three computing modules: parallelization keyword word frequency statistics module, parallelization proper vector computing module and parallelization SOM cluster module.Simultaneously, designed the operation that some kernel functions are carried out accelerating algorithm on each module.The present invention, by parallelization and the parallel accelerating system based on Graphics Processing Unit of algorithm, can realize the cluster of large-scale data faster, is very suitable for processing the clustering problem of similar higher-dimension text data.

The accompanying drawing explanation

Fig. 1 is process flow diagram of the present invention.

Fig. 2 is multithreading word frequency statistics schematic diagram of the present invention.

Fig. 3 serial keyword place number of files statistics schematic diagram.

Fig. 4 is Concurrent Feature matrix computations procedure chart of the present invention.

Fig. 5 is parallel keyword of the present invention place number of files statistics schematic diagram.

Fig. 6 is multithreading calculated characteristics matrix schematic diagram of the present invention.

Fig. 7 is multithreading compute vector mould schematic diagram of the present invention.

Fig. 8 is multithreading normalization schematic diagram of the present invention.

Fig. 9 is SOM topology of networks of the present invention.

Figure 10 is CPU/GPU hardware structure figure of the present invention

Figure 11 is parallel SOM algorithm CPU/GPU collaboration framework schematic diagram of the present invention

Figure 12 is parallel statistic document frequency matrix kernel function process flow diagram of the present invention

The kernel function process flow diagram that Figure 13 is parallel statistics keyword of the present invention place number of files

Figure 14 is that input feature value of the present invention and neuronic distance are calculated schematic diagram

Figure 15 is calculating input feature value of the present invention and neural apart from the kernel function process flow diagram

Figure 16 is that data of the present invention are made the difference operation schematic diagram.

Figure 17 is that error matrix of the present invention is by row or column summation operation schematic diagram.

Figure 18 is the kernel function process flow diagram of error matrix of the present invention by the row or column summation

Embodiment

Below in conjunction with specific embodiment, technical solution of the present invention is further illustrated.

As shown in Figure 1, the specific embodiment of the present invention is: a kind of parallelization self-organizing map neural network clustering method based on Graphics Processing Unit is provided, comprises the steps:

Step 1: parallel keyword word frequency statistics, that is: carry out content of text participle and obtain the set of keyword; For large scale text data, the large-scale calculations unit of graphic processing apparatus can provide the frequency of keyword in a thread parallel statistic document for each text document, obtain frequency matrix.

Specific implementation process is as follows: computing machine does not have the mankind's intelligence, the people is after reading article, can produce the obscure understanding to article content according to the understandability of self, and computing machine can not " be understood " article easily, basically, it only is familiar with 0 and 1, so must be the form that computing machine can be identified by text-converted.At present, on the information processing direction, the expression of text mainly adopts vector space model (Vector space model is called for short " VSM ").The basic thought of vector space model is to mean text with vector: (X ₁, X ₂... X _n), X wherein _jbe the weight of j characteristic item, what chosen so as characteristic item? generally can select word or word, according to experimental result, generally believe that choosing word is better than word and phrase as characteristic item.Therefore, be by text representation for take the vector space model that word is unit, at first will be by the text participle, the word of usining after participle means text as the dimension in vector space.After document content is carried out to word segmentation processing, then carry out denoising, obtain the keyword set that text is corresponding.

Parallel keyword word frequency statistics is for every piece of document participle, after denoising, adds up the frequency that keyword occurs in the document, then forms the frequency matrix of whole data set.Because the process of the keyword word frequency of adding up every piece of document is separate, can open a thread on Graphics Processing Unit for every piece of document, thereby reach the massive parallelism of calculating, as Fig. 2.A line of frequency matrix represents one piece of document, and row of matrix represent a keyword, and the COV of ranks represents the frequency of occurrences of certain keyword in certain piece of document.If certain keyword does not appear in certain document,, in frequency matrix, this value is made as zero.

Step 2: the Concurrent Feature vector calculation, that is: by the Graphics Processing Unit parallel computation, is converted into the characteristic of correspondence vector matrix to the keyword frequency matrix.Each proper vector represents a text document.

Specific implementation process is as follows: according to the keyword frequency matrix, parallel computation document characteristic of correspondence vector, the generating feature vector matrix, a line of proper vector represents one piece of document, one row of proper vector represent a feature, the eigenwert that row and row COV are certain feature of the document.

The present invention adopts vector space model to describe document, and it only pays close attention to the word occurred in document, and does not pay close attention to the relation of word and the structure of document.In vector space model, document space is considered as the vector space that proper vector forms.One piece of document is a proper vector in vector space, and the Feature Words in document can be regarded the dimension in vector space model as.Proper vector is denoted as d _i=(x _i1, x _i2..., x _in), x in formula _ijrepresent that word j is at document d _iin weight.Whether a kind of method of rough description weight be to occur in certain piece of document with Boolean 0 or 1 come representation feature words.Tf*idf (term frequency*inverse document frequency) is a kind of file characteristics weight method for expressing commonly used. this weighing computation method is mainly considered word frequency tf, inverse document frequency idf and the normalization factor of Feature Words.For guaranteeing Clustering Effect, the present invention adopts the weight of LTC weight as Feature Words, described as the following formula

x _ij=log ₂(tf _ij+1.0)*log(m/m _j)。

LTC weight formula is, on the basis of tf*idf formula, the value of word frequency tf has been got to logarithm, has again reduced the impact of word frequency tf on proper vector, and this formula is more reasonable in actual applications.Simultaneously, because different document length can impact vectorial weighted value, so the weighted value that need to calculate formula is done normalized, that is:

x_{ij} = \frac{x_{ij}}{\sqrt{Σ_{p = 1}^{n} x_{ip}^{2}}} .

For the text data of higher-dimension, each dimensional weight value of calculating Text eigenvector is very consuming time, so the present invention has designed the method for parallel calculating weight, the weight calculation process is accelerated.When calculating the LTC weight, lack m _jvalue, the number of times that certain keyword occurs in collection of document.Under the higher-dimension large-scale dataset, the conventional serial statistics is very consuming time, as Fig. 3.Obtaining m _jvalue after, calculate weight for each Feature Words of each document, because the text data dimension is very high, if use traditional serial algorithm, time complexity is also very high.After obtaining weight vectors, weight vectors is carried out to normalized process, need to calculate each vectorial length, then each weight is carried out to normalized, whole process also needs to expend the plenty of time.Therefore, the present invention has carried out the Parallel Design under the Graphics Processing Unit environment to three parts very consuming time in above-mentioned proper vector weight calculation process, and the computation process of its Concurrent Feature vector is as Fig. 4.

(1) multi-threaded parallel is added up the document frequency m that each keyword occurs in collection of document _j

Need to add up corresponding m for each Feature Words _jvalue, for the calculating of follow-up weighted value.We are converted to this problem the counting statistics that each row of frequency matrix is carried out to nonzero value, and its account form is as follows

m_{j} = Σ_{i = 0}^{m} x_{ij},

Wherein

x_{ij} = \{\begin{matrix} 0, & {tf}_{ij = 0} \\ 1, & {tf}_{ij} > 0, \end{matrix}

The number that wherein m is the data centralization document.Above-mentioned formula adds up m _jcomputing formula, when the word frequency number is zero, illustrate that this Feature Words does not appear in this corresponding document, do not participate in counting; When the word frequency of Feature Words is greater than zero, illustrate that this word appears in document, m _jcounting adds one.M due to each Feature Words _jvalue is independent statistics, for the data of matrix form, can adopt the implementation of multithreading, and the parallel processing capability that can take full advantage of Graphics Processing Unit is realized this process acceleration.Fig. 5 is parallel keyword place number of files statistics schematic diagram.

(2) multithreads computing eigenmatrix

In the computation process in this stage, be input as frequency matrix and m _jvalue, be output as eigenmatrix as shown in Figure 6.The computing formula of eigenmatrix is as follows

x _ij=log ₂(tf _ij+1.0)*log(m/m _j)。

The schematic diagram that Fig. 6 is the multithreads computing eigenmatrix, if the quantity of document is a m piece of writing, the dimension of Feature Words is the n dimension, the execution frequency of this computing formula is m*n time.Application scenarios in large scale for collection of document, that text dimensionality is high, the calculated amount of eigenmatrix is very large, so this paper designs parallel multithreading manner of execution to this computing, is accelerated.Because the computation process of each proper vector is separate, each thread is responsible for the computation process of a proper vector, the concurrent execution of a plurality of threads, the computing velocity of raising eigenmatrix.

(3) multithreading eigenmatrix normalization

Above-mentioned eigenmatrix computation process can obtain the eigenwert of each keyword of document after finishing, length difference due to document, the file characteristics vector that length is longer may have to the proper vector of other documents obvious inhibiting effect, so employing is carried out normalized method to proper vector, carrys out the equilibrium characteristic vector.The normalized formula of proper vector is as follows

x_{ij} = \frac{x_{ij}}{\sqrt{Σ_{p = 1}^{n} x_{ip}_{2}}} .

The representative of this formula by each the term weight function value in document divided by the document character pair vector mould square, with the equalization vector.To the computing of normalized vector, the present invention design adopt two on Graphics Processing Unit kernel function realize, a kernel function is responsible for the value of the mould of compute vector, another kernel function is responsible for the normalization of weight is calculated.

Multithreading that Fig. 7 is vectorial mould calculates schematic diagram, and in figure, a dotted line represents a thread, and the function of thread is that the element square value to eigenmatrix a line sums up, obtain a file characteristics vector mould value square.After all thread computes, we will obtain each file characteristics vector mould value square, for follow-up normalization operation.If do not use Graphics Processing Unit to carry out the multi-threaded parallel acceleration, the time complexity of asking modular arithmetic is O (m*n), and m represents the quantity of document, and n represents the dimension of file characteristics vector.After improvement on GPU the algorithm time complexity of parallel running be O (n).

After having obtained the mould of proper vector of every piece of document, need do the weight normalized for weight matrix again.For each element in matrix need to divided by corresponding file characteristics vector mould square, for this type of fabric problem, be suitable for equally the efficiency that the multithreading by Graphics Processing Unit comes accelerating algorithm to move.Fig. 8 is multithreading normalization schematic diagram, and in Fig. 8, a dotted line represents a thread, after all thread computes, obtains the file characteristics vector after final normalization, for follow-up cluster operation.If do not use Graphics Processing Unit to carry out the multi-threaded parallel acceleration, the time complexity of this process is O (m*n), and m represents the quantity of document, and n represents the dimension of file characteristics vector.After improvement on Graphics Processing Unit the algorithm time complexity of parallel running be O (1).

Step 3: parallel SOM text cluster, that is: according to eigenvectors matrix design SOM network structure, initialization SOM network, by Graphics Processing Unit parallel computation input sample and whole output neuron weight vector distances, the size that compares each distance, obtain the best neuron of minor increment, by upgrading the Size of Neighborhood of neuron weight vector value, learning rate and best neuron J in best neuron, neighborhood, then by Graphics Processing Unit parallel computation network error rate E _tif, network error rate E _t<=target error ε or iterations t>=training maximum iteration time T, the SOM network training finishes, otherwise re-starts new round training; The result of each study makes the neuronic neighborhood of optimum matching zone close to the input data vector value, and the close input feature value of distance is gathered into to same bunch, and bunch set of formation is final cluster result.

Specific implementation process is as follows:

The SOM network is the neural network model of simulation human brain, and its most important characteristic is self-organization, the meeting of the neuron in outside input network is adjusted to connection weight automatically, and corresponding neuron is assembled formation to outside response.The SOM network is exactly that this Self-organization of simulating brain cell is realized cluster, identification, sequence, topological invariance mapping etc.Neuron node in the SOM network can be accepted other neuronic input data, also can be to other neuron output data.SOM network after training can form to outside input modes the conceptual schema of oneself.SOM is applicable to processing non-linear, probabilistic data.

The SOM network is a kind of unsupervised learning network, by input layer with output layer is two-layer forms.Input layer calculates the distance of input vector and weight vector, and this is apart from reflection matching degree, the matching layer so input layer is otherwise known as.Output layer also is competition layer, each neuron is at war with according to matching degree, the neuron that matching degree is large is called the triumph neuron, the neuron of winning with and neighborhood in neuronic weight vectors can be nearer to input vector direction upgrade, through repetitious iteration competition, upgrade, the stable network of final formation, neuron is preserved corresponding weight vector.The follow-up network of having trained that can use carries out the operations such as cluster and spatial mappings.The training process of SOM network is exactly the process of a self-organized learning, and training is divided into two parts: the renewal of the neuronic screening of optimum matching and network weight vector.The input layer one dimension of common SOM network is arranged, and the output layer neuron is two-dimensional arrangements, and topological structure as shown in Figure 9.

The SOM network is used the mode of learning of Self-organizing Maps, this mode of learning belongs to unsupervised learning, the result of each study makes the neuronic neighborhood of optimum matching zone close to the input data vector value, so just can be brought together the close input vector of distance, forms cluster.SOM network after large-scale training, the connection weight between neuron can represent the feature of input pattern, is the input vector merger with phase plesiotype one class, has just completed the automatic cluster process of SOM network.The core process of SOM algorithm is the renewal of weight in the neuronic screening of optimum matching and neuron neighborhood.Screening optimum matching neuron is according to the distance between all neurons and input vector, and distance is minimum is the optimum matching neuron; Between neuron, the self-organization adjustment of connection weight is to adjust each neuronic weighted value in its neighborhood according to the optimum matching neuron.The every study of SOM network is once just carried out the self-organization procedure of adaptation one time to input vector, strengthens the mapping of new match pattern, weakens old mode map.In traditional serial SOM clustering algorithm, have two steps to spend 80% Riming time of algorithm: (1) calculates input sample and whole output nerve weight vector distances, compares the size of each distance, obtains the best neuron of minor increment; (2) calculate the network error rate E of adjacent twice iteration _t.Therefore the present invention is directed to above two features, designed a kind of parallel SOM clustering algorithm based on Graphics Processing Unit.The logic flow of parallel SOM algorithm is described below:

Step1: suppose that the input sample is m, the dimension of each input sample is the n dimension.Design SOM network structure, the number of determining input layer is n, the output layer neuron is k, training maximum iteration time T and target error ε.

Step2: initialization SOM network comprises connection weight vector W between the initialization neuron ₀=(W ₁₀, W ₂₀..., W _k0), learning rate α _i(0) ∈ (0,1), Size of Neighborhood N _i(0), i ∈ 1,2 ..., n}, iteration count t=1.

Step3: parallel computation input sample X _iwith whole output neuron weight vector W _japart from d _j, formula is as follows

d_{j} = | | X_{i} - W_{j, t - 1} | | = \sqrt{Σ_{p = 1}^{n} {(x_{ip} - w_{jp, t - 1})}^{2}} . - - - (2 - 1)

Step4: compare the size of each distance, the neuron with minor increment is best neuron J.

Step5: upgrade extremely neighborhood N of best neuron J _j(t-1) the neuronic weight vector value in

W _j,t=W _j,t-1+α _i(t-1)(X _i-W _j,t-1)。(2-2)

Step6: renewal learning rate α _jand the Size of Neighborhood N of best neuron J (t) _j(t).The correction formula of learning rate is as follows

α _i(t)=α _i(0)(1-t/T)。(2-3)

If the coordinate figure of competition layer neuron g in two-dimensional columns is (X _g, Y _g), the scope of neighborhood is a square area, foursquare upper right corner apex coordinate is (X _g+ N _j(t), Y _g+ N _j(t)), foursquare lower left corner apex coordinate is (X _g-N _j(t), Y _g-N _j(t)), the correction formula of neighborhood is as follows

N _j(t)=INT[N _j(0)·(1-t/T)]， (2-4)

Wherein, INT[X] mean X is rounded to operation.

Step7: the error of adjacent twice iteration of parallel computation neuron is as follows as formula

E_{t} = | | W_{t} - W_{t - 1} | | = Σ_{i = 1}^{k} | | W_{i, t} - W_{i, t - 1} | | . - - - (2 - 5)

Step8: if E _t<=ε or t>=T, the SOM network convergence is in expected error rate or reach maximum iteration time, and the SOM network training finishes, otherwise turns to Step3 to carry out new round training.The result of each study makes the neuron weight in optimum matching neuron neighborhood close to the input data vector value, and the close input feature value of distance is brought together, and forms text bunch.

As shown in Figure 10, Figure 11, the specific embodiment of the present invention is: build a kind of self-organizing map neural network clustering system based on Graphics Processing Unit, it is characterized in that, comprise hardware components and software section, hardware components: adopt the design of CPU/GPU collaboration framework, it is upper that the serial run time version operates in CPU, and it is upper that the executed in parallel code operates in GPU, and the data transfer mode provided by GPU exchanges the data between video memory and internal memory; Software section is divided into three modules, comprise parallelization keyword word frequency statistics module, parallelization proper vector computing module, parallelization SOM cluster module, the proper vector computing unit of unit, calculated characteristics vector, carry out the text cluster unit of text cluster, described parallelization keyword word frequency statistics module is carried out participle by content of text and is obtained the set of keyword, in parallel statistic document, the frequency of keyword, obtain frequency matrix; Described parallelization proper vector computing module is converted into the characteristic of correspondence vector matrix to the keyword frequency matrix, and each proper vector represents a document; Described parallelization SOM cluster module is according to eigenvectors matrix design SOM network structure, initialization SOM network, parallel computation input sample and whole output neuron weight vector distances, the size that compares each distance, obtain the best neuron J of minor increment, by upgrading neuron weight vector value, learning rate and the best neuronic Size of Neighborhood in best neuron, its neighborhood, then by Graphics Processing Unit parallel computation network error rate E _tif, network error rate E _t<=target error ε or iterations t>=training maximum iteration time T, the SOM network training finishes, otherwise re-starts new round training; The result of each study makes the neuronic neighborhood of optimum matching zone close to the input data vector value, and the close input feature value of distance is gathered into to same bunch, and bunch set of formation is final cluster result.

Concrete cluster process is as follows: the parallel self-organizing map neural network clustering system that the present invention is based on Graphics Processing Unit adopts the design of CPU/GPU framework, as Figure 10 hardware frame that is system, the scheduling of system control cpu, give the Graphics Processing Unit allocating task, for Graphics Processing Unit is prepared running space etc., Graphics Processing Unit under the ready environment of CPU, the executed in parallel calculation task.The software collaboration framework that Figure 11 is the SOM cluster, system utilization unified calculation equipment framework (Compute Unified Device Architecture is called for short " CUDA ") programming platform is accelerated in the text data cluster process the SOM algorithm application.

In the design based on the CPU/GPU collaboration framework, by the collaborative task to CPU and GPU, reasonably distribute and Frame Design, take full advantage of the advantage separately of CPU and GPU, for algorithm is accelerated.Native system is divided into two parts by its task and is distributed, and a part is the task of having obvious operation advantage on CPU, and a part is obviously to have the task of operation advantage on Graphics Processing Unit.The task of being adapted at the upper operation of CPU mainly comprises: the initialization of SOM network, the I/O operation of data, the control of algorithm logic flow process, the calling of kernel function.The task of being adapted at moving on Graphics Processing Unit is mainly that the data operation generic task comprises: parallel word frequency statistics, and the Concurrent Feature vector calculation, input sample and neuronic distance are calculated, and the error of network weight is calculated.

Aspect system software, mainly by carry out the Accelerating running of implementation algorithm for each modular design kernel function.In parallel word frequency statistics module, kernel function of system, this kernel function is that each document distributes a thread on Graphics Processing Unit, open altogether m thread, m is number of files, counts the frequency that each keyword occurs in document, and the calculation process of its kernel function is Figure 12.

In Concurrent Feature vector calculation module, system be this modular design three kernel functions with three parts more consuming time in Processing Algorithm: (1) calculates keyword place number of files kernel function, as Fig. 4, for this kernel function is opened n thread, n is that keyword obtains number, and the calculation process of its kernel function is as Figure 13; (2) calculate file characteristics vector kernel function, as Fig. 6, open m for this kernel function and take advantage of n thread, its computing formula is

x _ij=log ₂(tf _ij+1.0)*log(m/m _j)。

(3) due in practice, the length of the document of every piece may differ greatly, in order to overcome this problem, designed two kernel functions and carried out normalization file characteristics vector in this module, be respectively: the kernel function of calculating the mould of each proper vector, as Fig. 7, for this kernel function is opened m thread; The kernel function of normalization proper vector, as Fig. 8, open m for this kernel function and take advantage of n thread.

The idiographic flow of parallel SOM cluster module is as Figure 11.Through the actuating logic flow process of serial SOM algorithm is analyzed, the extensive computing module of finding serial SOM algorithm has two parts: a part is that the SOM network is after receiving new input vector, calculate the distance of input vector to each output neuron, therefrom select there is minimum distance unexpected neuron as the neuronic module of optimum; A part is to calculate the network error of neuronic adjacent twice iteration, with the module of the default acceptable error rate contrast of algorithm.These two parts are all the large-scale floating point arithmetics that CPU is bad at, if computation process adopts serial to carry out, the time that algorithm consumes on these two parts has occupied more than 80% of Riming time of algorithm, so in module three for these two partial design three kernel functions accelerate the speed that algorithm is carried out, improve the algorithm execution efficiency.For above two parts, system two submodules that move on Graphics Processing Unit.

(1) best neuron submodule is found in concurrent operation.Best neuron is the neuron nearest from input mode vector, and this just needs algorithm when receiving an input mode vector, calculates each neuronic distance in this vector and network.Neuron is as follows apart from corresponding computing formula with input vector

d_{j} = | | X_{i} - W_{j, t - 1} | | = \sqrt{Σ_{p = 1}^{n} {(x_{ip} - w_{jp, t - 1})}^{2}} .

In view of all distances all need to do evolution in computation process, process, formula can be reduced to

d_{j} = | | X_{i} - W_{j, t - 1} | | = Σ_{p = 1}^{n} {(x_{ip} - w_{jp, t - 1})}^{2} .

The range formula calculation procedure is simple, but when the neuronal quantity of SOM network is more, large, the input sample size of the dimension of input sample is when more, operand is very huge.When the sample dimension is n, neuronal quantity is k, and when the algorithm iteration number of run is T time, the frequency that in this formula, the difference computing is carried out is k*n*T, so it is very effective to the performance boost of program that this formula is carried out to the improvement of algorithm executed in parallel.In this submodule, kernel function of system is calculated each neuronic distance in input feature value and network, can take full advantage of the concurrent operation performance of Graphics Processing Unit, accelerates the execution speed of algorithm.

As shown in figure 14, for convenience of serial program being done to parallelization, improve design, the computational problem of below adjusting the distance is done the description on the matrix meaning.The distance operation of definition neuron weight matrix and input mode vector is , the matrix that neuron weight matrix W is k*n, the quantity that wherein k is output neuron, the dimension that n is input mode vector, input mode vector x _kfor the vector that dimension is n, distance operation is d as a result _jfor the dimension vector that is k.For the operation of accelerating algorithm, system can start k thread parallel simultaneously to be calculated, and the calculation process of its kernel function is as Figure 15.

(2) the error submodule of the network weight of the adjacent two-wheeled of parallel computation.SOM network index E _tthe meaning of representative is the poor of the current weighted value of network and last round of training weighted value when complete.If E _tvalue be less than the error rate of definition before algorithm initialization, illustrate that the SOM network is when iteration is trained, large-scale variation has no longer occurred in each neuronic weighted value, thinks that network has converged on certain state, network training is complete.Judge whether current SOM network has trained the formula in the error allowed band as follows

E_{t} = | | W_{t} - W_{t - 1} | | = Σ_{i = 1}^{k} | | W_{i, t} - W_{i, t - 1} | | .

By the above-mentioned computing formula of observation and analysis, find that this formula is a kind of distortion that two matrixes are done difference operation.This large-scale matrix operation serial execution speed on CPU lags far behind executed in parallel speed on Graphics Processing Unit, so in this submodule, the error of network is calculated and be designed to a kernel function of moving on Graphics Processing Unit, carry out the acceleration of implementation algorithm.Simultaneously, the computational grid error and the time, for accelerating algorithm operation, system adopts by row or column parallel computation network error.Therefore, in this submodule, we also will design a kernel function parallel computation network error.

As shown in figure 16, the calculating sub module for calculating SOM network error rate, be divided into concurrent operation on two step Graphics Processing Unit by it, and first is that two matrixes are done difference operation, i.e. W _t-W _t-1=E _tcomputing, the absolute value that second portion is each element in matrix and the stipulations part.Two parts are all used Graphics Processing Unit to be accelerated.In Figure 16, W _t-W _t-1=E _t.Can design a kernel function and calculate the poor of this matrix, system is opened m for this kernel function and is taken advantage of n thread, two neuronic Euclidean distances of each thread computes.

Usually the above-mentioned difference operation of doing not is the result that we finally want, we need to continue that element to each row of C matrix carries out that absolute value adds and, obtain the vector of a n dimension, finally by the serial arithmetic on CPU, value in vector is added up, and the actual change amount and the error rate that obtain final SOM network are contrasted, if be greater than error rate, continue the next round training, if be less than deconditioning of error rate, prove to a certain degree convergence of network.To adding and needs two steps of matrix, by the row or column summation, then the result of the first step is sued for peace again, and calculating process is shown in as shown in figure 17.System can design a kernel function parallel computation first step, opens n thread for this kernel function simultaneously, and the calculation process of its kernel function is as Figure 18.

Notice that matrix element value summation in Figure 17 generates the stipulations that the computing of n*1 dimensional vector is not necessarily carried out according to row, also can carry out stipulations by row, this depends on the magnitude relationship of line number and columns in practical application.If the matrix line number is far longer than columns, by advancing professional etiquette approximately, because approximately can once open more parallel thread by the professional etiquette of advancing.Otherwise, if the matrix column number is far longer than capable quantity, need to carry out stipulations by row.

Need to safeguard several corresponding data structures for above-mentioned calculating process: two-dimensional matrix W, training sample matrix X, the distance vector D of storage SOM network neuron weighted value.Because needs calculate E _tvalue, therefore need to preserve the two-dimensional matrix of last round of SOM network neuron weighted value.

Parallel self-organizing map neural network clustering method and system based on Graphics Processing Unit of the present invention, designed a kind of SOM Text Clustering Algorithm of parallelization.Simultaneously, utilize the complementarity of the computing power between Graphics Processing Unit (GPU) and central processing unit (CPU), the present invention has designed a set of parallelization text cluster system based on the CPU/GPU collaboration framework.Hardware is designed to the CPU/GPU collaboration framework, software section design minute three modules: parallel word frequency statistics module, Concurrent Feature vector calculation module and parallel SOM algorithm cluster module.The text cluster of the self-organizing map neural network based on Graphics Processing Unit of the present invention, can take full advantage of the high concurrency of graphic processing apparatus, effectively improves the cluster speed of algorithm, is very suitable for processing the clustering problem of higher-dimension text data.

Above content is in conjunction with concrete preferred implementation further description made for the present invention, can not assert that specific embodiment of the invention is confined to these explanations.For the general technical staff of the technical field of the invention, without departing from the inventive concept of the premise, can also make some simple deduction or replace, all should be considered as belonging to protection scope of the present invention.

Claims

1. the parallelization self-organizing map neural network clustering method based on Graphics Processing Unit, comprise the steps:

The Concurrent Feature vector calculation: the keyword frequency matrix is converted into to the characteristic of correspondence vector matrix, and each proper vector represents a document;

2. the self-organizing map neural network clustering method based on Graphics Processing Unit according to claim 1, is characterized in that, in obtaining the keyword word frequency step of document, adopts the multi-threaded parallel statistics word frequency based on Graphics Processing Unit.

3. the self-organizing map neural network clustering method based on Graphics Processing Unit according to claim 1, is characterized in that, in Concurrent Feature vector calculation step, adopt based on graphics process the proper vector of each document of multithreads computing.

4. the self-organizing map neural network clustering method based on Graphics Processing Unit according to claim 1, it is characterized in that, the computation process of input feature value and each output neuron weight vector distance is separate, the distance of input feature value and each output neuron vector is calculated in employing based on a plurality of thread parallels of graphics process, system is opened a thread for each neuron, adopts multithreads computing.

5. the self-organizing map neural network clustering method based on Graphics Processing Unit according to claim 1, it is characterized in that, the computation process of the weight vector error of adjacent twice iteration of each neuron is separate, a plurality of thread parallels of employing based on graphics process calculate each neuronic weight vector error, system is opened a thread for each neuron, adopts multithreads computing.

6. the self-organizing map neural network clustering system based on Graphics Processing Unit, it is characterized in that, comprise hardware components and software section, hardware components: adopt the design of CPU/GPU collaboration framework, the serial run time version operates on CPU, it is upper that the executed in parallel code operates in GPU, and the data transfer mode provided by GPU exchanges the data between video memory and internal memory; Software section is divided into three modules, comprise parallelization keyword word frequency statistics module, parallelization proper vector computing module, parallelization SOM cluster module, the proper vector computing unit of unit, calculated characteristics vector, carry out the text cluster unit of text cluster, described parallelization keyword word frequency statistics module is carried out participle by content of text and is obtained the set of keyword, in parallel statistic document, the frequency of keyword, obtain frequency matrix; Described parallelization proper vector computing module is converted into the characteristic of correspondence vector matrix to the keyword frequency matrix, and each proper vector represents a document; Described parallelization SOM cluster module is according to eigenvectors matrix design SOM network structure, initialization SOM network, parallel computation input sample and whole output neuron weight vector distances, the size that compares each distance, obtain the best neuron J of minor increment, by upgrading neuron weight vector value, learning rate and the best neuronic Size of Neighborhood in best neuron, its neighborhood, then by Graphics Processing Unit parallel computation network error rate E _tif, network error rate E _t<=target error ε or iterations t>=training maximum iteration time T, the SOM network training finishes, otherwise re-starts new round training; The result of each study makes the neuronic neighborhood of optimum matching zone close to the input data vector value, and the close input feature value of distance is gathered into to same bunch, and bunch set of formation is final cluster result.

7. the clustering system of the parallelization self-organizing map neural network based on Graphics Processing Unit according to claim 6, it is characterized in that, all designed the operation that several kernel functions are carried out purpose parallel acceleration algorithm in described parallelization keyword word frequency statistics module, described parallelization proper vector computing module and described parallelization SOM cluster module.

8. the clustering system of the parallelization self-organizing map neural network based on Graphics Processing Unit according to claim 6, is characterized in that, in parallel keyword word frequency statistics module, designed a kernel function for the keyword word frequency statistics; In Concurrent Feature vector calculation module, two kernel functions of calculating for proper vector and two have been designed for the normalized kernel function of proper vector.

9. the clustering system of the parallelization self-organizing map neural network based on Graphics Processing Unit according to claim 6, it is characterized in that, in parallel SOM cluster module, designed a kernel function for the distance of calculating input feature value and output neuron, kernel function and the kernel function for the error of regular net region weight vector for the error of the network weight vector of calculating adjacent twice iteration of each neuron.