CN110188320A

CN110188320A - Second order blind source separating parallel optimization method and system based on multi-core platform

Info

Publication number: CN110188320A
Application number: CN201910329707.XA
Authority: CN
Inventors: 刘卫国; 刘美洋; 殷泽坤; 徐晓明
Original assignee: Shandong University
Current assignee: Shandong University
Priority date: 2019-04-23
Filing date: 2019-04-23
Publication date: 2019-08-30

Abstract

The second order blind source separating parallel optimization method and system based on multi-core platform that the invention discloses a kind of, comprising the following steps: receive environmental variance parameter, CPU line journey nucleophilicity is set；Signal to be processed is received, multi-threaded parallel pretreatment is carried out to signal to be processed；Merge it is multiple can parallel computation region, carry out joint approximate diagonalization；Export separation matrix and source matrix.The present invention has greatly accelerated the processing speed of second order blind source separating by the characteristic of multi-core platform.

Description

Second order blind source separating parallel optimization method and system based on multi-core platform

Technical field

The invention belongs to signal processing technology field more particularly to a kind of second order blind source separating based on multi-core platform are parallel Optimization method and system.

Background technique

In fields such as nerve signal process, statistical analysis, the observation data that are collected into be often have an error mix number According to these errors are many times machine errors, it is difficult to be avoided.Therefore, how a basic problem is by appropriate Method finds an appropriate expression of the source data of observation data.Blind source separating (BSS:Blind source It separation) is the process that each original signal that can not directly observe is recovered from several mixed signals observed. Second order blind source separation algorithm (SOBI) is based on delay cross-correlation matrix principle, carries out joint approximate diagonal to a collection of covariance matrix Change to realize the purpose of signal blind source separating, is a kind of steady blind source separation method.SOBI uses simple second-order statistics Amount, source signal component can be estimated using relatively small number of data point, do not have to consider source signal whether Gaussian distributed, To avoid judging the Gaussian characteristics of source signal probability density function, and multiple Gaussian noise sources can be separated, be current one kind The blind source separation algorithm of mainstream.

Second order blind source separation algorithm (SOBI) calculating process is simple, good separating effect, in processing of biomedical signals, array The fields such as signal processing, voice signal identification, image procossing and mobile communication are widely used.However, inventor exists It is found in application process, effective second order blind source separation algorithm running and comparing is slow, and needing to be promoted speed could be by extensive Ying Yu Under line and realizes real-time nerve signal process and feed back.

Summary of the invention

To overcome above-mentioned the deficiencies in the prior art, the present invention provides a kind of second order blind source separating based on multi-core platform is simultaneously Row optimization method and system are realized by advantages such as the parallel processing of multi-core platform, mathematics core library, instruction set to second order The acceleration of blind source separating implementation procedure.

To achieve the above object, one or more embodiments of the invention provides following technical solution:

A kind of second order blind source separating parallel optimization method based on multi-core platform, comprising the following steps:

Environmental variance parameter is received, CPU line journey nucleophilicity is set；

Signal to be processed is received, multi-threaded parallel pretreatment is carried out to signal to be processed；

Merge it is multiple can parallel computation region, carry out joint approximate diagonalization；

Export separation matrix and source matrix.

One or more embodiments provide a kind of second order blind source separating parallel optimization system based on multi-core platform, packet It includes:

CPU line journey nucleophilicity is arranged for receiving environmental variance parameter in CPU nucleophilicity configuration module；

Data reception module, for receiving signal to be processed；

Data preprocessing module, for carrying out multi-threaded parallel pretreatment to signal to be processed；

Diagonalization module, for merge it is multiple can parallel computation region, carry out joint approximate diagonalization；

Processing result output module exports separation matrix and source matrix.

One or more embodiments provide a kind of computing device, including memory, processor and storage are on a memory And the computer program that can be run on a processor, the processor realize that described one kind is based on multicore when executing described program The second order blind source separating parallel optimization method of platform.

One or more embodiments provide a kind of computer readable storage medium, are stored thereon with computer program, should A kind of second order blind source separating parallel optimization method based on multi-core platform is realized when program is executed by processor.

The above one or more technical solution there are following the utility model has the advantages that

The present invention is based on multi-core platforms, thread nucleophilicity when program is run are arranged by environmental variance first, so that journey CPU only accesses directly connected memory to sequence at runtime, and such memory access time can greatly reduce, and ensure that entire second order is blind The performance boost of source separation process；It raises speed in data preprocessing phase by the parallel processing feature of multi-core platform；Joining Close the joint approximate diagonalization stage, extremely disperse since the region that multithreading is accelerated parallel can be used in the process, by it is multiple simultaneously Row region merging technique, so that acceleration is realized, to substantially increase the operational efficiency of second order blind source separating.

Detailed description of the invention

The Figure of description for constituting a part of the invention is used to provide further understanding of the present invention, and of the invention shows Examples and descriptions thereof are used to explain the present invention for meaning property, does not constitute improper limitations of the present invention.

Fig. 1 is the flow chart of traditional second order blind source separation method；

Fig. 2 is that the second order blind source separating parallel optimization method based on multi-core platform is whole in the one or more embodiments of the present invention Body flow chart；

When Fig. 3 and Fig. 4 is respectively the operation of links when using optimization method in the one or more embodiments of the present invention Between schematic diagram and speed-up ratio schematic diagram.

Specific embodiment

It is noted that described further below be all exemplary, it is intended to provide further instruction to the present invention.Unless another It indicates, all technical and scientific terms used herein has usual with general technical staff of the technical field of the invention The identical meanings of understanding.

It should be noted that term used herein above is merely to describe specific embodiment, and be not intended to restricted root According to exemplary embodiments of the present invention.As used herein, unless the context clearly indicates otherwise, otherwise singular Also it is intended to include plural form, additionally, it should be understood that, when in the present specification using term "comprising" and/or " packet Include " when, indicate existing characteristics, step, operation, device, component and/or their combination.

In the absence of conflict, the feature in the embodiment and embodiment in the present invention can be combined with each other.

Intel multi-core processor possesses multiple kernels, can provide higher calculation power, multithreading is supported, in conjunction with its high speed Memory bandwidth higher data transmission capabilities can be provided.Adding in the AVX instruction set supported by Intel, multiplies fusion (FMA) etc. instruction, can fast implement single-instruction multiple-data stream (SIMD) (SIMD), can best play the hardware peak of Intel CPU It is worth performance.The above characteristic can be good at accelerating SOBI algorithm, between can solve second order blind source separation algorithm at runtime On bottleneck, be preferably applied for real-time signal processing system.

Embodiment one

The second order blind source separating parallel optimization method based on multi-core platform that present embodiment discloses a kind of, including following step It is rapid:

Step 1: receiving environmental variance parameter, CPU line journey nucleophilicity is set；

Intel multi-core processor is NUMA architecture, although memory is directly connected with CPU, since memory is averaged point It has fitted on each core cpu (die).Only when CPU accesses the corresponding physical address of the memory that itself is directly connected to, just meeting There is the shorter response time.And when accessing the data of the memory of other CPU connections if necessary, it is necessary to pass through inter- Connect channel access, response time are just compared slack-off before.Therefore, in order to avoid above situation, we are become by environment Thread nucleophilicity when amount setting program is run, making calling program, CPU only accesses directly connected memory at runtime, in this way The memory access time can greatly reduce, and program feature has very big promotion.

Step 2: receiving signal to be processed, in conjunction with mathematics core function library, it is pre- that multi-threaded parallel is carried out to signal to be processed Processing；

Data prediction part will do it a large amount of in calculation delay matrix, Data Whitening, calculating sample covariance matrix Matrix manipulation, including matrix multiplication operation, matrix transposition, ask matrix exgenvalue and feature vector, and matrix in practical application Scale is all bigger.The problems such as large-scale matrix manipulation is related to discontinuous memory access, always and high-performance computing sector Traditional hot spot, we are accelerated using OpenMP multithreading with Intel high-performance math library (Intel MKL).Intel The library MKL can be good at adapting to the computing unit of Intel, reaches best acceleration effect, is a set of extraordinary high-performance number It learns and calculates library.

Pretreatment specifically includes in the step 2: calculation delay matrix, Data Whitening processing and calculating sampling covariance square Battle array.

Wherein, Data Whitening is handled method particularly includes:

Whitening processing is carried out to observation data X (t) by formula (1), so that the covariance matrix of Y (t) is unit matrix, to go Except the second order correlation between each component, W is m × n dimension whitening matrix:

Y (t)=WX (t) (1)

Calculate sample covariance matrix method particularly includes:

For fixed delay, τ ∈ { τ_j| j=1,2 ..., k }, calculate the sample covariance matrix of whitened data:

R (τ)=E [Y (t+ τ) Y^T(t)]=AR_Y(τ)A^T (2)

Step 3: merge it is multiple can parallel computation region, carry out joint approximate diagonalization；

During joint approximate diagonalization, integrated use Thread-Level Parallelism and instruction set parallel mode, and memory optimization Access, the boosting algorithm speed of service.Joint approximate diagonalization is the most time-consuming part of SOBI algorithm, this part is related to propitious essay This rotation, seek small-scale matrix eigen vector, small-scale matrix matrix multiplication.More importantly this process It is the process of an iterative solution, time-consuming very long, the time, accounting was 90% or more.For joint approximate diagonalization, we are main Take following measures:

Thread-Level Parallelism acceleration is carried out using OpenMP, merges multiple parallel regions to reduce multithreading expense.Many institutes It is known, single thread can only sequential processes calculating task, complete one and carry out again next, however multithreading can be handled concurrently Calculating task is exactly to be completed at the same time multiple calculating tasks in simple terms.Nevertheless, the parallel programming model under multithread mode The time not necessarily can be saved, because when the unlatching of multithreading, data are synchronous, end of multithreading can all bring additional Between expense, so only when parallel computation save it is lower when calculating intrinsic expense of the time more than multithreading, under multithread mode Parallel computation can just achieve the effect that acceleration.During joint approximate diagonalization, multithreading can be used and add parallel The region of speed extremely disperses, if directly simply using multi-threaded parallel will lead to multithreading need it is frequently synchronous, it is more in this way By the overhead of multithreading to counteracting, acceleration effect is not obvious the acceleration income that thread parallel calculates.Therefore, Wo Mentong Cross adjustment algorithmic code by original 3 disperse can parallel computation region merging technique to 1 parallel regions, the expense of multithreading is reduced It is original 1/3, and reduces the synchronous number of multi-thread data, multithreads computing brings apparent speed-up ratio. Herein in multithreaded programming model we using OpenMP technology.

Joint approximate diagonalization specifically includes in the step 3: carrying out Givens rotation and iteration updates two, M, U matrix Important link.

Wherein, Givens rotation method particularly includes: Givens rotation matrix is calculated using trigonometric function.

During joint approximate diagonalization, need to carry out Givens rotation.The Givens rotation that traditional algorithm uses is It is realized by seeking the eigen vector of matrix, if Givens rotation matrix isBy solving feature Value and feature vector calculate c and s:S=0.5 × (eigenvector [2]-j × Eigenvector [2])/c, but the time loss of the eigen vector of solution matrix is bigger.

We, which use, realizes new Givens rotation: c=sin α, s=using sinusoidal, cosine method cosα.SOBI algorithm whole result is not influenced using the Givens rotation of sine and cosine in actual test, and program is transported The row time significantly reduces.Because trigonometric function is asked only to need an instruction, additional internal storage access is not needed, but seeks matrix Eigen vector necessary not only for a plurality of instruction, and needs additional memory read-write.

Iteration updates M, U matrix and specifically includes: calculating the value of the s in above-mentioned Givens rotation matrix and the threshold of algorithm setting The difference of value, judges whether the difference is less than threshold value, if it is not, updating M, U matrix, re-starts Givens rotation, if so, meter Calculate separation matrix, source matrix.

Wherein, the calculation method of orthogonal matrix U are as follows: for all R (τ_j), using joint approximate diagonalization algorithm, obtain Orthogonal matrix U out meets formula (3), { D_jIt is one group of diagonal matrix.

U^TR(τ_j) U=D_j (3)

The calculation method of separation matrix W are as follows: by above step it can be concluded that Y (t)=U^tWX (t) and hybrid matrix A=W⁺U。 After obtaining the source signal Y (t) of decorrelation, remove should not Independent sources signal component and be reconstructed, it is as follows:

X_r(t)=W⁺Y_r(t) (4)

In formula: X (t) is the observation signal vector after reconstruct；Y_rIt (t) is by source signal ingredient zero setting unwanted in Y (t) The new independent source matrix obtained after processing；W⁺For the pseudo inverse matrix of separation matrix W.

Mixing source signal matrix A can be calculated according to separation matrix W.

In addition to using OpenMP to realize the other parallel acceleration of thread-level, we are also made full use of on Intel's multi-core processor Vector registor and AVX instruction set realize instruction-level and data level parallel computation.Traditional calculating mode is individual instructions A number is taken every time and corresponding operating is carried out to this number, and this mode is called single instruction single data stream.But, at present absolutely mostly Several processors all supports single-instruction multiple-data stream (SIMD) mode, that is, individual instructions to take multiple numbers every time, and multiple to this simultaneously Number carries out relevant operation.On Intel's multi-core platform, highest supports 512 vector registors at present, and Intel mentions Supplying to make full use of the AVX instruction set of vector registor, we pass through the support of its vector registor and command adapted thereto collection AVX, Single-instruction multiple-data stream (SIMD) mode may be implemented, by 512 bit vector registers, we can once execute 16 single precisions or The operation of 8 double precision datums, to reach the parallel computation of instruction-level and data level.For example, in second order blind source separation algorithm (SOBI) in, it is related to the mathematical computations of addition subtraction multiplication and division, we all employ AVX instruction of equal value and have carried out parallel acceleration.It removes Except this, we additionally use in AVX instruction set plus multiply fusion instruction (FMA), using only can once calculate ± (a × b) ±c.Traditionally, ± (a × b) ± c needs at least two computations, is multiplication, addition (subtraction) computations respectively, still If using plus multiply fusion instruction (FMA), 2 operations (multiplication, addition) can be combined into one by we, so at runtime between Half can be reduced less, and floating-point operation (FLOPS) per second promotes one times；Further, since Intel multi-core processor is furnished with 2 portions FMA Part, the double growth of peak F LOPS.In addition, due to that will not be rounded up to a × b intermediate result, than multiplying order (MUL) and Addition instruction (ADD) is more acurrate.FMA can promote the performance and accuracy of many Floating-point Computations, such as matrix multiplication.

In C/C++ language, physical store form of the two-dimensional array on memory is therefore row major storage is visited by row When asking two-dimensional array, the access of memory is continuously that continuous internal storage access can make full use of the storage knot of modern computer Structure makes full use of cache members uplift program feature.Because it is only to access memory that processor, which accesses the time that cache is spent, 1 percent.Conversely, the content in cache is substantially all and will not hit when accessing two-dimensional array according to column, occur serious Cache missing, program runtime greatly increase.During joint approximate diagonalization, original algorithm is according to two when updating M, U The column access of dimension group, there are a large amount of discontinuous memory access, program locality is unfriendly, and therefore, the present embodiment is by joining The initial stage for closing joint approximate diagonalization carries out transposition to matrix to be visited, realizes continuous memory access, creates good program part Property, give full play to the performance of CPU-cache- memory modern computing body structure.

In C/C++ language, if aray variable is infrequently updated, when processor uses array element (array) every time The case where directly access memory will be generated, still can but be deposited when accessing floating number (float/double) by access Device avoids internal storage access, and it is only the one thousandth for accessing memory that processor, which accesses the time that register is spent,.So using Floating-point number variable substitution floating number type array can significantly reduce the case where processor access memory.In joint approximate diagonalization mistake Cheng Zhong, when updating M, U matrix, there are more the case where doing temporary variable using aray variable to store the intermediate result of calculating, And these digit group type temporary variables are infrequently updated.At this point, the operation of the variable to digit group type, actually in direct read/write Address is deposited, huge memory read-write expense can be brought in this way, and comparatively speaking memory read-write expense and access register are ten Divide time-consuming.So we replace a series of array by using the floating number type temporary variable that can be put down in a register Temporary variable does not need additional read/write memory at no point in the update process in this way, it is only necessary to which read-write register greatly improves program Performance.

During joint approximate diagonalization, updates M, U matrix and be related to matrix multiplication, but matrix size ratio at this time It is smaller, it is practical if using Intel's high-performance math library (MKL) as step 1 calculates the measure taken in whitening matrix Acceleration effect is not obvious.We have found after tested, Intel's high-performance math library (Intel MKL) under small-scale matrix multiplication It shows and bad, because calling MKL that can generate function call expense, we realize a small-scale matrix multiplication more Fastly.

Step 4: output separation matrix and source matrix.

It will be understood by those skilled in the art that pthread can be used rather than use the OpenMP used herein to line Journey is managed, and also can achieve same effect.

The present embodiment usesTo strong^TMProcessor E5 Product Family (code name " Haswell EP ") is based on English A dual slot platform of the newest micro-architecture of Te Er.The product have 18 kernels, the dominant frequency of 3.6GHz, 55M caching and The memory bandwidth of 76.8GB/s supports AVX instruction set, supporting vector register on hardware.They can be improved to be obviously improved and answer Use performance.On the basis of the Matlab realization version of current most popular SOBI, every kind of Optimized Measures are applied to optimization It is timely that program is carried out in version code, obtains program runtime and speed-up ratio, as shown in Figure 3-4.As seen from the figure, synthesis makes Highly significant is improved in speed with SOBI after above-mentioned all Optimized Measures, program becomes final 4.5s reality from initial 180s 39 times of speed-up ratio is showed.

Blind source separating parallel acceleration method is in processing of biomedical signals, array signal processing, language described in the present embodiment The fields such as sound signal identification, image procossing and mobile communication can be applied.

In processing of biomedical signals field, second order blind source separation algorithm (SOBI) can fast and effeciently remove artefact letter Number, be brain-computer interface (brain computer interface, BCI) in EEG signals (electroencephalography, EEG online processing) is laid a good foundation.For arrhythmia cordis auricular fibrillation illness surface mapping signal extraction and During independent component analysis, the algorithm is also used.

In the power system, non-linear equipment access power grid causes power quality index to deteriorate to influence high-grade, precision and advanced electronics The use of equipment.Primary operational during administering power quality is to carry out divisions of responsibility to harmonic pollution.With second order Blind source separation algorithm (SOBI) is separated each source signal using the independence between source signal by second-order statistic.

Second order blind source separating is often taken in art of image analysis, the research such as the extraction of fault-signal and status monitoring Algorithm, and under different system dampings and different signal-to-noise ratio environment, all performance processing of second order blind source separation algorithm are higher Robustness, and accuracy of identification is higher.

The core algorithm of second order blind source separation algorithm or noise diagnostics failure identification of sound source technology, can be used for machinery and sets The research of identification of sound source method in standby noise diagnostics.

Embodiment two

Based on one the method for embodiment, the purpose of the present embodiment is to provide a kind of blind source of the second order based on multi-core platform point From parallel optimization system, comprising:

Data reception module, for receiving signal to be processed；

Processing result output module exports separation matrix and source matrix.

Embodiment three

The purpose of the present embodiment is to provide a kind of computing device, including memory, processor and storage are on a memory simultaneously The computer program that can be run on a processor, the processor are realized when executing described program:

Export separation matrix and source matrix.

Example IV

The purpose of the present embodiment is to provide a kind of computer readable storage medium.

A kind of computer readable storage medium, is stored thereon with computer program, calculates for fingerprint similarity, should Realization when program is executed by processor:

Export separation matrix and source matrix.

Each step involved in the device of above embodiments two, three and four is corresponding with embodiment of the method one, specific implementation Mode can be found in the related description part of embodiment one.Term " computer readable storage medium " be construed as include one or The single medium or multiple media of multiple instruction collection；It should also be understood as including any medium, any medium can be deposited Storage, coding carry instruction set for being executed by processor and processor are made either to execute in the present invention method.

The above one or more embodiment has following technical effect that

Operating parameter, control multithreading quantity and thread nucleophilicity are set first, promote second order blind source separating on the whole Execution efficiency；

In data preprocessing phase, the parallel of thread-level is carried out using OpenMP multithreading and Intel's high-performance math library Accelerate；

During joint approximate diagonalization, integrated use Thread-Level Parallelism and instruction set parallel mode, and memory optimization The boosting algorithm speed of service: 1) access carries out Thread-Level Parallelism acceleration using OpenMP, during joint approximate diagonalization The characteristics of region that multithreading is accelerated parallel extremely disperses can be used, merge multiple parallel regions to reduce multithreading Expense；2) it uses and realizes new Givens rotation using sinusoidal, cosine method, reduce the use of operational order With memory read-write；3) realize that instruction accelerates parallel with data level with AVX instruction set using vector registor, it is more by single instruction stream Assembly line when optimizing data stream program executes；4) in orthogonal matrix iteration renewal process, by matrix transposition, reduction does not connect Continuous memory access, improves cache hit probability；Digit group type temporary variable is replaced using floating number temporary variable, reduces memory read-write number； It realizes efficient small-scale matrix multiplication simultaneously, avoids Intel's high-performance math library time overhead.

It will be understood by those skilled in the art that each module or each step of aforementioned present invention can be filled with general computer It sets to realize, optionally, they can be realized with the program code that computing device can perform, it is thus possible to which they are stored Be performed by computing device in the storage device, perhaps they are fabricated to each integrated circuit modules or by they In multiple modules or step be fabricated to single integrated circuit module to realize.The present invention is not limited to any specific hardware and The combination of software.

The foregoing is only a preferred embodiment of the present invention, is not intended to restrict the invention, for the skill of this field For art personnel, the invention may be variously modified and varied.All within the spirits and principles of the present invention, made any to repair Change, equivalent replacement, improvement etc., should all be included in the protection scope of the present invention.

Above-mentioned, although the foregoing specific embodiments of the present invention is described with reference to the accompanying drawings, not protects model to the present invention The limitation enclosed, those skilled in the art should understand that, based on the technical solutions of the present invention, those skilled in the art are not Need to make the creative labor the various modifications or changes that can be made still within protection scope of the present invention.

Claims

1. a kind of second order blind source separating parallel optimization method based on multi-core platform, which comprises the following steps:

Export separation matrix and source matrix.

2. the second order blind source separating parallel optimization method based on multi-core platform as described in claim 1, which is characterized in that treat Processing signal carries out pretreatment and includes: calculation delay matrix, Data Whitening processing and calculate sample covariance matrix.

3. the second order blind source separating parallel optimization method based on multi-core platform as claimed in claim 2, which is characterized in that described Preprocessing process is accelerated by mathematics core function library.

4. the second order blind source separating parallel optimization method based on multi-core platform as described in claim 1, which is characterized in that described Joint approximate diagonalization includes: that Givens rotation and iteration update orthogonal matrix.

5. the second order blind source separating parallel optimization method based on multi-core platform as claimed in claim 4, which is characterized in that pass through Trigonometric function solves Givens rotation matrix.

6. the second order blind source separating parallel optimization method based on multi-core platform as claimed in claim 4, which is characterized in that iteration During updating orthogonal matrix, matrix to be visited is subjected to transposition processing, and replace one using floating number type temporary variable Serial digit group type temporary variable.

7. the second order blind source separating parallel optimization method based on multi-core platform as claimed in claim 4, which is characterized in that for It is related to the mathematical computations of the addition subtraction multiplication and division in joint approximate diagonalization, is accelerated parallel using AVX instruction of equal value.

8. a kind of second order blind source separating parallel optimization system based on multi-core platform characterized by comprising

Data reception module, for receiving signal to be processed；

Processing result output module exports separation matrix and source matrix.

9. a kind of computing device including memory, processor and stores the calculating that can be run on a memory and on a processor Machine program, which is characterized in that the processor realizes a kind of such as the described in any item bases of claim 1-7 when executing described program In the second order blind source separating parallel optimization method of multi-core platform.

10. a kind of computer readable storage medium, is stored thereon with computer program, which is characterized in that the program is by processor A kind of such as the described in any item second order blind source separating parallel optimization sides based on multi-core platform claim 1-7 are realized when execution Method.