WO2023050490A1

WO2023050490A1 - Data association feature analysis method and apparatus, and device and medium

Info

Publication number: WO2023050490A1
Application number: PCT/CN2021/124577
Authority: WO
Inventors: 陈东来
Original assignee: 深圳前海环融联易信息科技服务有限公司
Priority date: 2021-09-30
Filing date: 2021-10-19
Publication date: 2023-04-06
Also published as: CN113609204B; CN113609204A

Abstract

Disclosed in the present application are a data association feature analysis method and apparatus, and a device and a medium. The method comprises: performing conversion processing on initial sample data according to a data conversion rule, so as to obtain a sample feature matrix and a sample testing result matrix; performing feature analysis on each column of sample data in the sample feature matrix according to a sample feature analysis rule and the sample testing result matrix, so as to obtain a corresponding feature distribution value; compiling distribution statistics on the feature distribution value corresponding to each column of the sample data, so as to obtain a corresponding composite test value; and according to the composite test value, screening, from the sample feature matrix, out associated column information that corresponds to an associated screening coefficient.

Description

Data association feature analysis method, device, equipment and medium

This application claims the priority of the Chinese patent application with the application number 202111164594.6 and the invention title "data association feature analysis method, device, equipment and medium" submitted to the China Patent Office on September 30, 2021, the entire contents of which are incorporated by reference in this application.

technical field

The present application relates to the technical field of big data analysis, and in particular to a data association feature analysis method, device, equipment and medium.

Background technique

The relationship between the cause and the result can be obtained through big data analysis, such as the association analysis between the genome and the disease, so as to determine which genes are related to the disease. However, the inventors found that due to the gene The amount of information contained is huge, and the amount of data contained in the gene sequence to be analyzed is also very large. With the increase of the number of samples, the existing correlation feature analysis method is inefficient and inaccurate when analyzing massive genetic data. Obtain the position of genes associated with disease. Therefore, there is a problem in the prior art that it is impossible to quickly analyze massive data information to accurately obtain associated features.

Contents of the invention

The embodiment of the present application provides a data association feature analysis method, device, equipment and medium, aiming to solve the problem existing in the prior art methods that it is impossible to quickly analyze massive data information to accurately obtain association features.

In the first aspect, the embodiment of the present application provides a data association feature analysis method, which includes:

If the input initial sample data is received, convert the initial sample data according to a preset data conversion rule to obtain a corresponding sample feature matrix and a sample detection result matrix;

performing feature analysis on each column of sample data in the sample feature matrix according to preset sample feature analysis rules and the sample detection result matrix to obtain feature distribution values corresponding to each column of sample data;

Performing distribution statistics on the characteristic distribution value to obtain a composite test value corresponding to each column of the sample data;

The associated column information corresponding to the preset associated screening coefficient is screened out from the sample feature matrix according to the composite test value.

In the second aspect, the embodiment of the present application provides a data association feature analysis device, which includes:

A data conversion unit, configured to convert the initial sample data according to a preset data conversion rule to obtain a corresponding sample feature matrix and a sample detection result matrix if the input initial sample data is received;

A feature distribution value acquisition unit, configured to perform feature analysis on each column of sample data in the sample feature matrix according to preset sample feature analysis rules and the sample detection result matrix to obtain feature distribution values corresponding to each column of sample data ;

A composite inspection value acquisition unit, configured to perform distribution statistics on the characteristic distribution value to obtain a composite inspection value corresponding to the sample data in each column;

The association column information acquisition unit is configured to filter out the association column information corresponding to the preset association screening coefficients from the sample feature matrix according to the composite test value.

In the third aspect, the embodiment of the present application further provides a computer device, which includes a memory, a processor, and a computer program stored on the memory and operable on the processor, and the processor executes the computer program. The program implements the data association feature analysis method described in the first aspect above.

In a fourth aspect, the embodiment of the present application further provides a computer-readable storage medium, wherein the computer-readable storage medium stores a computer program, and when the computer program is executed by a processor, the processor executes the above-mentioned first step. In one aspect, the data association feature analysis method.

Embodiments of the present application provide a data association feature analysis method, device, computer equipment, and readable storage medium. According to the data conversion rules, the initial sample data is converted and processed to obtain the sample feature matrix and the sample detection result matrix. According to the sample feature analysis rules and the sample detection result matrix, the feature analysis is performed on each column of sample data in the sample feature matrix to obtain the corresponding feature distribution value. Perform distribution statistics on the characteristic distribution values corresponding to the sample data in each column to obtain the corresponding composite inspection value, and filter out the associated column information corresponding to the associated screening coefficient from the sample characteristic matrix according to the composite inspection value. Through the above method, the characteristic distribution value can be obtained according to the sample characteristic analysis rules for distribution statistics, and the associated column information can be screened out from the sample characteristic matrix according to the composite inspection value obtained by distribution statistics, which can realize rapid analysis of massive data information to obtain Accurately associate features.

Description of drawings

In order to illustrate the technical solutions of the embodiments of the present application more clearly, the drawings that need to be used in the description of the embodiments will be briefly introduced below. Obviously, the drawings in the following description are some embodiments of the present application. Ordinary technicians can also obtain other drawings based on these drawings on the premise of not paying creative work.

FIG. 1 is a schematic flow diagram of a data association feature analysis method provided in an embodiment of the present application;

FIG. 2 is a schematic diagram of the sub-flow of the data association feature analysis method provided by the embodiment of the present application;

FIG. 3 is a schematic diagram of another sub-flow of the data association feature analysis method provided by the embodiment of the present application;

FIG. 4 is a schematic diagram of another sub-flow of the data association feature analysis method provided by the embodiment of the present application;

FIG. 5 is a schematic diagram of another sub-flow of the data association feature analysis method provided by the embodiment of the present application;

FIG. 6 is another schematic flowchart of the data association feature analysis method provided by the embodiment of the present application;

FIG. 7 is a schematic diagram of another sub-flow of the data association feature analysis method provided by the embodiment of the present application;

FIG. 8 is a schematic block diagram of a data association feature analysis device provided in an embodiment of the present application;

Fig. 9 is a schematic block diagram of a computer device provided by an embodiment of the present application.

Detailed ways

The following will clearly and completely describe the technical solutions in the embodiments of the present application with reference to the drawings in the embodiments of the present application. Obviously, the described embodiments are part of the embodiments of the present application, not all of them. Based on the embodiments in this application, all other embodiments obtained by persons of ordinary skill in the art without making creative efforts belong to the scope of protection of this application.

It should be understood that when used in this specification and the appended claims, the terms "comprising" and "comprises" indicate the presence of described features, integers, steps, operations, elements and/or components, but do not exclude one or Presence or addition of multiple other features, integers, steps, operations, elements, components and/or collections thereof.

It should also be understood that the terminology used in the specification of this application is for the purpose of describing particular embodiments only and is not intended to limit the application. As used in this specification and the appended claims, the singular forms "a", "an" and "the" are intended to include plural referents unless the context clearly dictates otherwise.

It should also be further understood that the term "and/or" used in the description of the present application and the appended claims refers to any combination and all possible combinations of one or more of the associated listed items, and includes these combinations .

Please refer to FIG. 1. FIG. 1 is a schematic flow diagram of the data association feature analysis method provided by the embodiment of the present application; the data association feature analysis method is applied to the user terminal or the management server, and the data association feature analysis method is installed on the user terminal or The application software in the management server is executed. The management server is the server that can execute the data correlation feature analysis method to perform correlation feature analysis on the initial sample data. The management server can be a server built inside an enterprise or a government department. It is a terminal device, such as a desktop computer, a notebook computer, a tablet computer or a mobile phone, which can perform a data correlation feature analysis method to perform correlation feature analysis on the initial sample data. As shown in FIG. 1, the method includes steps S110-S140.

S110. If the input initial sample data is received, convert the initial sample data according to a preset data conversion rule to obtain a corresponding sample feature matrix and sample detection result matrix.

If the input initial sample data is received, the initial sample data is converted according to a preset data conversion rule to obtain a corresponding sample feature matrix and a sample detection result matrix. The user can input the initial sample data to the user terminal or the management server. The initial sample data can be the genetic data and test results of the sample. The genetic data can be all or part of the gene sequence contained in a pair of chromosomes. The test result is whether the patient is infected or not. The detection information of the disease, the technical scheme can screen out the gene points with strong correlation with the detection results from the genetic data through data correlation analysis. The initial sample data can be converted according to data conversion rules, wherein the data conversion rules include sample data mapping information and detection result mapping information.

In one embodiment, as shown in FIG. 2, step S110 includes sub-steps S111 and S112.

S111. Perform a mapping process on sample feature data of each sample in the initial sample data according to the sample data mapping information to obtain a corresponding sample feature matrix.

Specifically, the sample characteristic data of each sample in the initial sample data can be mapped according to the sample data mapping information, and the sample characteristic data is the genetic data of each sample in the initial sample data, and various types of genetic data can be The sample feature matrix is obtained through the mapping process, and the obtained sample feature matrix includes sample data corresponding to the genetic data of each sample. Specifically, each gene point in a chromosome contains two bases, and a gene point in a base pair can contain multiple genotypes, such as A=T-A=T, A=T-G≡C, G ≡C-G≡C, where A or G is an allele, and the base that appears less frequently is determined as a secondary allele. For example, G appears less than A, and G is called a secondary allele. The sample The data mapping information correspondingly includes mapping information of AA mapping 0, AG mapping 1, and GG mapping 2.

For example, the initial sample data contains 1963 samples, and the genetic data of each sample contains 317503 gene points, so a sample feature matrix with 1963 rows and 317503 columns can be obtained correspondingly.

S112. Perform a mapping process on the detection results of each sample in the initial sample data according to the detection result mapping information to obtain a corresponding sample detection result matrix.

The detection results in the initial sample data may be mapped according to the detection result mapping information. Specifically, the detection results may include the detection results of one or more diseases. If there is only one disease in the detection result, the detection result of "disease" is mapped to "1", and the detection result of "not diseased" is mapped to "0"; The detection result of suffering from multiple diseases is mapped to "1", and the other detection results are mapped to "0".

For example, the detection results of 1963 samples in the initial sample data are mapped to obtain a sample detection result matrix with 1963 rows and 1 column.

S120. Perform feature analysis on each column of sample data in the sample feature matrix according to preset sample feature analysis rules and the sample detection result matrix to obtain feature distribution values corresponding to each column of sample data.

Perform feature analysis on each column of sample data in the sample feature matrix according to the preset sample feature analysis rules and the sample detection result matrix to obtain a feature distribution value corresponding to each column of sample data, and each column of sample data is each The feature data corresponding to a sample at the same gene point. The sample feature analysis rules are the specific rules for analyzing the sample feature matrix. Based on the sample feature analysis rules and the sample detection result matrix, the feature analysis of each column of sample data in the sample feature matrix can be performed to obtain the feature distribution value corresponding to each column of sample data. , the feature distribution value is the distribution value of the feature of each gene point in each sample in a specific distribution state. Wherein, the sample feature analysis rules include latent variable calculation formulas and feature calculation formulas.

In one embodiment, as shown in FIG. 3 , step S120 includes sub-steps S121 and S122.

S121. Calculate the sample feature matrix according to the hidden variable calculation formula to obtain a corresponding hidden variable matrix.

First, the sample feature matrix can be calculated according to the hidden variable calculation formula to obtain the corresponding hidden variable matrix, which includes the hidden correlation between each column of sample data and the corresponding detection results.

Specifically, the sample feature matrix can be decomposed according to the latent variable calculation formula, then the sample feature matrix X can be expressed by formula (1):

X = UDV ^T = U ₁ D ₁ V ₁ ^T + U ₂ D ₂ V ₂ ^T (1);

Wherein, T is a matrix transposition symbol, wherein, the columns of matrix U and matrix V are orthogonal, that is, V ^T V = I, U ^T U = I, matrix I is the identity matrix with 1 as the diagonal; U ₁ and U ₂ are the sub-matrices obtained by U block by column, that is, U=(U ₁ , U ₂ ), V ₁ and V ₂ are the sub-matrix obtained by V by block by column, that is, V=(V ₁ ,V ₂ ), D=diag(D ₁ ,D ₂ ) is a diagonal matrix, called a singular value matrix, the values are arranged in descending order, and the decomposed matrix U ₁ can be used as the latent variable matrix G.

S122. Calculate each column of sample data in the hidden variable matrix, the sample detection result matrix, and the sample feature matrix according to the feature calculation formula, so as to obtain a feature distribution value corresponding to each column of the sample data.

After obtaining the latent variable matrix, the characteristic distribution value of each column of sample data can be calculated separately according to the characteristic calculation formula. Among them, the feature calculation formula includes a degree of freedom value calculation formula, a block matrix formula and a distribution value calculation formula.

In one embodiment, as shown in FIG. 4 , step S122 includes sub-steps S1221 , S1222 and S1223 .

S1221. Calculate the number of rows of the sample feature matrix and the number of columns of the hidden variable matrix according to the degree of freedom value calculation formula in the feature calculation formula to obtain a corresponding degree of freedom value.

First, the number of rows of the sample feature matrix and the number of columns of the hidden variable matrix are calculated according to the calculation formula of the degree of freedom value, and the corresponding degree of freedom value is obtained, then the degree of freedom value is common to each column of sample data. The calculation formula of degrees of freedom can be expressed by formula (2):

Among them, n is the number of rows of the sample feature matrix X, and d is the number of columns of the hidden variable matrix G.

S1222. Perform an inverse operation on the hidden variable matrix, the sample detection result matrix, and the sample feature matrix according to the block matrix formula in the feature calculation formula to obtain an estimated value corresponding to each column of the sample data.

The hidden variable matrix, the sample detection result matrix and the sample feature matrix can be inversely operated according to the block matrix formula to obtain an estimated value corresponding to each column of sample data. For each column of sample data, there is the following calculation relationship: X _i =YB _i +GΓ _i +E _i , where X _i is the sample data of column i in the sample feature matrix, Y is the sample detection result matrix, and B _i is the sample The coefficient corresponding to the detection result matrix Y, G is the hidden variable matrix, Γ _i is the coefficient corresponding to the hidden variable matrix, E _i is the residual, and the residuals corresponding to any column of sample data are independent of each other.

Estimated value corresponding to B _i

That is, the estimated value corresponding to each column of sample data.

Formula (3) can be used to express:

Among them, T is the matrix transpose symbol.

S1223. According to the distribution value calculation formula in the feature calculation formula, calculate the degree of freedom value, the estimated value corresponding to each column of the sample data, the hidden variable matrix, the sample detection result matrix, and the sample feature matrix Calculate the sample data in each column to obtain the characteristic distribution value corresponding to the sample data in each column.

Based on the calculated degrees of freedom, the estimated value of each column of sample data, the matrix of hidden variables, and the matrix of sample detection results, the characteristic distribution value of each column of sample data can be further calculated. Specifically, the corresponding characteristic distribution value can be calculated by the distribution value calculation formula. Since each column of sample data contains the characteristic data corresponding to each sample at the same gene point, the corresponding characteristic data of each column of sample data can be calculated. The characteristic distribution value contains a gene point and the characteristic distribution value corresponding to all samples, that is, the number of distribution values contained in the characteristic distribution value of each column of sample data is equal to the number of samples.

The distribution value calculation formula can be expressed by formula (4):

Among them, z is the degree of freedom value, and t _i is the calculated characteristic distribution value.

S130. Perform distribution statistics on the characteristic distribution values to obtain a composite test value corresponding to each column of the sample data.

Performing distribution statistics on the characteristic distribution values to obtain a composite test value corresponding to each column of the sample data. Distribution statistics can be performed on the characteristic distribution values to obtain the corresponding composite inspection value, and each column of sample data can correspondingly obtain a composite inspection value.

In one embodiment, as shown in FIG. 5 , step S130 includes sub-steps S131 and S132.

S131. Perform extreme value distribution statistics on the characteristic distribution values corresponding to the sample data in each column to obtain statistical information on the characteristic distribution values of the sample data in each column.

Specifically, extreme value distribution statistics can be performed on the characteristic distribution value of each column of sample data. Specifically, when the sample size is infinite, the distribution statistics of the characteristic distribution value t of any column of sample data is approximately a normal distribution, using the extreme value distribution theorem The distribution form with the largest absolute value can be determined as the target distribution form corresponding to the characteristic distribution value t of the current column of sample data, and the distribution parameters of the target distribution form can be further obtained as the corresponding characteristic distribution value statistical information.

For example, a normal distribution can be represented by expression (5):

R ~ N (μ, σ ² ) (5);

The μ and σ in the above expression are the corresponding distribution parameters.

S132. Obtain a composite inspection value corresponding to the statistical form of each characteristic distribution value statistical information according to the preset inspection value data table.

The user terminal or management server also pre-stores a test value data table, which contains the test value corresponding to each statistical form. After obtaining the statistical information of the characteristic distribution value, it can be based on the statistical mentality corresponding to the statistical information. , obtain a corresponding test value from the test value data table as a composite test value by means of table lookup.

S140. Filter out the correlation column information corresponding to the preset correlation screening coefficients from the sample feature matrix according to the composite test value.

The associated column information corresponding to the preset associated screening coefficient is screened out from the sample feature matrix according to the composite test value. After the compound test value is obtained, the sample feature matrix can be screened according to the compound test value and the correlation screening coefficient to obtain the corresponding correlation column information. The correlation column information can contain at least one column code value, and the correlation column information The included column encoding values can be used to indicate the gene points in the gene sequence that have a strong correlation with the disease.

In one embodiment, as shown in FIG. 6 , step S1401 is further included before step S140 .

S1401. Calculate the number of columns of the sample feature matrix according to a preset calculation formula for screening coefficients to obtain the associated screening coefficients.

Before screening the sample feature matrix according to the correlation screening coefficient, the corresponding correlation screening coefficient can also be calculated according to the calculation formula of the screening coefficient and the column number of the sample feature matrix. Specifically, the calculation formula of the screening coefficient can be expressed by formula (6) :

S=e/m (6);

Among them, e is the preset parameter value in the formula, m is the column number of the sample feature matrix, and S is the calculated correlation screening coefficient. For example, if e is set to 0.05 and m is set to 317503, the corresponding calculation results in S=1.57×10 ^-7 .

In one embodiment, as shown in FIG. 7 , step S140 includes sub-steps S141 and S142.

S141. Judging whether the composite inspection value of the sample data in each column is smaller than the correlation screening coefficient, so as to determine the composite inspection value smaller than the correlation screening coefficient as the target inspection value according to the judgment result.

It can be judged whether the composite test value of each column of sample data is less than the correlation screening coefficient. If it is less than, it indicates that the gene point corresponding to the composite test value is a gene point with significant correlation; if it is not less than, it indicates that the composite test value The gene points corresponding to the values have no significant correlation. According to the judgment result, the compound test value smaller than the correlation screening coefficient can be obtained as the target test value.

S142. Obtain and combine column coded values corresponding to the target test value in the sample feature matrix as associated column information corresponding to the associated screening coefficient.

Filter the sample feature matrix according to the target test value, and each target test value corresponds to a column of sample data in the sample feature matrix, then the column code value corresponding to each target test value can be obtained from the sample feature matrix and combined, Get the corresponding associated column information. There is a strong correlation between the gene points corresponding to the column code values in the associated column information and the diseases.

In the data association feature analysis method provided in the embodiment of the present application, the initial sample data is converted according to the data conversion rules to obtain the sample feature matrix and the sample detection result matrix, and the sample feature matrix is processed according to the sample feature analysis rules and the sample detection result matrix Perform characteristic analysis on each column of sample data to obtain the corresponding characteristic distribution value, perform distribution statistics on the characteristic distribution value corresponding to each column of sample data to obtain the corresponding composite test value, and filter out the correlation value from the sample feature matrix according to the composite test value Correlation column information corresponding to the filter coefficient. Through the above method, the characteristic distribution value can be obtained according to the sample characteristic analysis rules for distribution statistics, and the associated column information can be screened out from the sample characteristic matrix according to the composite inspection value obtained by distribution statistics, which can realize rapid analysis of massive data information to obtain Accurately associate features.

The embodiment of the present application also provides a data association feature analysis device, which can be configured in a user terminal or a management server, and the data association feature analysis device is used to implement any implementation of the aforementioned data association feature analysis method example. Specifically, please refer to FIG. 8 , which is a schematic block diagram of an apparatus for analyzing data association features provided by an embodiment of the present application.

As shown in FIG. 8 , the data correlation feature analysis device 100 includes a data conversion unit 110 , a feature distribution value acquisition unit 120 , a composite test value acquisition unit 130 and an association column information acquisition unit 140 .

The data conversion unit 110 is configured to convert the initial sample data according to preset data conversion rules to obtain a corresponding sample feature matrix and sample detection result matrix if the input initial sample data is received.

In a specific embodiment, the data conversion unit 110 includes a subunit: a sample feature matrix acquisition unit, configured to map the sample feature data of each sample in the initial sample data according to the sample data mapping information, Obtaining a corresponding sample feature matrix; a sample detection result matrix acquisition unit configured to perform mapping processing on the detection result of each sample in the initial sample data according to the detection result mapping information to obtain a corresponding sample detection result matrix.

The feature distribution value acquisition unit 120 is configured to perform feature analysis on each column of sample data in the sample feature matrix according to preset sample feature analysis rules and the sample detection result matrix to obtain a feature distribution corresponding to each column of sample data value.

In a specific embodiment, the characteristic distribution value acquisition unit 120 includes a subunit: a hidden variable matrix acquisition unit, which is used to calculate the sample feature matrix according to the hidden variable calculation formula to obtain a corresponding hidden variable matrix; A calculation unit, configured to calculate each column of sample data in the hidden variable matrix, the sample detection result matrix, and the sample feature matrix according to the feature calculation formula, so as to obtain the feature corresponding to each column of the sample data distribution value.

In a specific embodiment, the feature calculation unit includes a subunit: a degree of freedom value calculation unit, which is used to calculate the number of rows of the sample feature matrix and the implicit The number of columns of the variable matrix is calculated to obtain the corresponding degree of freedom; the estimated value calculation unit is used to calculate the hidden variable matrix, the sample detection result matrix and the sample according to the block matrix formula in the feature calculation formula Performing an inverse operation on the characteristic matrix to obtain the estimated value corresponding to the sample data in each column; the distribution value calculation unit is used to correspond to the degree of freedom value and the sample data in each column according to the distribution value calculation formula in the feature calculation formula Calculate the estimated value of the hidden variable matrix, the sample detection result matrix and the sample data in each column of the sample feature matrix to obtain the characteristic distribution value corresponding to each column of the sample data.

The composite inspection value acquisition unit 130 is configured to perform distribution statistics on the characteristic distribution values to obtain a composite inspection value corresponding to each column of the sample data.

In a specific embodiment, the composite test value acquisition unit 130 includes a subunit: a characteristic distribution value statistics unit, which is used to perform extreme value distribution statistics on the characteristic distribution values corresponding to each column of sample data, and obtain the Statistical information of characteristic distribution values of the sample data; a test value acquisition unit, configured to obtain a composite test value corresponding to the statistical form of each feature distribution value statistical information according to a preset test value data table.

The association column information acquisition unit 140 is configured to filter out the association column information corresponding to the preset association screening coefficients from the sample feature matrix according to the composite test value.

In a specific embodiment, the data correlation feature analysis device 100 further includes a subunit: a correlation screening coefficient calculation unit, which is used to calculate the number of columns of the sample feature matrix according to a preset screening coefficient calculation formula, and obtain the The correlation screening coefficient mentioned above.

In a specific embodiment, the association column information acquisition unit 140 includes a subunit: a target inspection value determination unit, which is used to judge whether the composite inspection value of each column of the sample data is smaller than the association screening coefficient, according to As a result of the judgment, the composite test value smaller than the associated screening coefficient is determined as the target test value; the column code value combination unit is used to obtain the column code value corresponding to the target test value in the sample feature matrix and combine it as the The associated column information corresponding to the associated screening coefficient.

The data association feature analysis device provided in the embodiment of the present application applies the above-mentioned data association feature analysis method, converts the initial sample data according to the data conversion rules to obtain a sample feature matrix and a sample detection result matrix, and according to the sample feature analysis rules and sample detection The result matrix analyzes the characteristics of each column of sample data in the sample feature matrix to obtain the corresponding characteristic distribution value, and performs distribution statistics on the characteristic distribution value corresponding to the sample data in each column to obtain the corresponding composite inspection value. According to the composite inspection value from the sample characteristics The associated column information corresponding to the associated screening coefficient is filtered out from the matrix. Through the above method, the characteristic distribution value can be obtained according to the sample characteristic analysis rules for distribution statistics, and the associated column information can be screened out from the sample characteristic matrix according to the composite inspection value obtained by distribution statistics, which can realize rapid analysis of massive data information to obtain Accurately associate features.

The above-mentioned device for analyzing data association features can be implemented in the form of a computer program, and the computer program can be run on a computer device as shown in FIG. 9 .

Please refer to FIG. 9 , which is a schematic block diagram of a computer device provided by an embodiment of the present application. The computer device may be a user terminal or a management server for performing the data correlation feature analysis method to perform correlation feature analysis on the initial sample data.

Referring to FIG. 9 , the computer device 500 includes a processor 502 connected through a system bus 501 , a memory and a network interface 505 , wherein the memory may include a storage medium 503 and an internal memory 504 .

The storage medium 503 can store an operating system 5031 and a computer program 5032 . When the computer program 5032 is executed, the processor 502 may execute the data association feature analysis method, wherein the storage medium 503 may be a volatile storage medium or a non-volatile storage medium.

The processor 502 is used to provide calculation and control capabilities and support the operation of the entire computer device 500 .

The internal memory 504 provides an environment for the operation of the computer program 5032 in the storage medium 503. When the computer program 5032 is executed by the processor 502, the processor 502 can execute the data association feature analysis method.

The network interface 505 is used for network communication, such as providing data transmission and the like. Those skilled in the art can understand that the structure shown in FIG. 9 is only a block diagram of a partial structure related to the solution of this application, and does not constitute a limitation on the computer device 500 on which the solution of this application is applied. The specific computer device 500 may include more or fewer components than shown, or combine certain components, or have a different arrangement of components.

Wherein, the processor 502 is configured to run the computer program 5032 stored in the memory, so as to realize the corresponding functions in the above-mentioned data association feature analysis method.

Those skilled in the art can understand that the embodiment of the computer device shown in FIG. 9 does not constitute a limitation on the specific composition of the computer device. In other embodiments, the computer device may include more or less components than those shown in the illustration. Or combine certain components, or different component arrangements. For example, in some embodiments, the computer device may only include a memory and a processor. In such an embodiment, the structures and functions of the memory and the processor are consistent with those of the embodiment shown in FIG. 9 , and will not be repeated here.

It should be understood that in the embodiment of the present application, the processor 502 may be a central processing unit (Central Processing Unit, CPU), and the processor 502 may also be other general-purpose processors, digital signal processors (Digital Signal Processor, DSP), Application Specific Integrated Circuit (ASIC), off-the-shelf programmable gate array (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. Wherein, the general-purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

In another embodiment of the present application a computer readable storage medium is provided. The computer-readable storage medium may be a volatile or non-volatile computer-readable storage medium. The computer-readable storage medium stores a computer program, wherein when the computer program is executed by a processor, the steps included in the above-mentioned data association feature analysis method are implemented.

Those skilled in the art can clearly understand that, for the convenience and brevity of description, the specific working process of the above-described equipment, devices and units can refer to the corresponding process in the foregoing method embodiments, and details are not repeated here. Those of ordinary skill in the art can realize that the units and algorithm steps of the examples described in conjunction with the embodiments disclosed herein can be implemented by electronic hardware, computer software, or a combination of the two. In order to clearly illustrate the relationship between hardware and software Interchangeability. In the above description, the composition and steps of each example have been generally described according to their functions. Whether these functions are implemented by hardware or software depends on the specific application and design constraints of the technical solution. Those skilled in the art may use different methods to implement the described functions for each specific application, but such implementation should not be regarded as exceeding the scope of the present application.

In the several embodiments provided in this application, it should be understood that the disclosed devices, devices and methods can be implemented in other ways. For example, the device embodiments described above are only illustrative. For example, the division of the units is only logical function division. In actual implementation, there may be other division methods, and units with the same function may also be combined into one Units such as multiple units or components may be combined or integrated into another system, or some features may be omitted, or not implemented. In addition, the mutual coupling or direct coupling or communication connection shown or discussed may be indirect coupling or communication connection through some interfaces, devices or units, and may also be electrical, mechanical or other forms of connection.

The units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, they may be located in one place, or may be distributed to multiple network units. Part or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment of the present application.

In addition, each functional unit in each embodiment of the present application may be integrated into one processing unit, each unit may exist separately physically, or two or more units may be integrated into one unit. The above-mentioned integrated units can be implemented in the form of hardware or in the form of software functional units.

If the integrated unit is realized in the form of a software function unit and sold or used as an independent product, it can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present application is essentially or the part that contributes to the prior art, or all or part of the technical solution can be embodied in the form of software products, and the computer software products are stored in a computer. The readable storage medium includes several instructions to enable a computer device (which may be a personal computer, server, or network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of the present application. The aforementioned computer-readable storage medium includes: U disk, mobile hard disk, read-only memory (ROM, Read-Only Memory), magnetic disk or optical disk and other media that can store program codes.

The above is only a specific embodiment of the application, but the scope of protection of the application is not limited thereto. Any person familiar with the technical field can easily think of various equivalents within the scope of the technology disclosed in the application. Modifications or replacements, these modifications or replacements shall be covered within the scope of protection of this application. Therefore, the protection scope of the present application should be based on the protection scope of the claims.

Claims

A data association feature analysis method, the method comprising:

If the input initial sample data is received, convert the initial sample data according to a preset data conversion rule to obtain a corresponding sample feature matrix and a sample detection result matrix;

performing feature analysis on each column of sample data in the sample feature matrix according to preset sample feature analysis rules and the sample detection result matrix to obtain feature distribution values corresponding to each column of sample data;

Performing distribution statistics on the characteristic distribution value to obtain a composite test value corresponding to each column of the sample data;

The associated column information corresponding to the preset associated screening coefficient is screened out from the sample feature matrix according to the composite test value.
The data association feature analysis method according to claim 1, wherein the data conversion rules include sample data mapping information and detection result mapping information, and the initial sample data is converted according to the preset data conversion rules to obtain corresponding The sample feature matrix and sample test result matrix of , including:

Mapping the sample feature data of each sample in the initial sample data according to the sample data mapping information to obtain a corresponding sample feature matrix;

The detection result of each sample in the initial sample data is mapped according to the detection result mapping information to obtain a corresponding sample detection result matrix.
The data correlation feature analysis method according to claim 1, wherein the sample feature analysis rules include latent variable calculation formulas and feature calculation formulas, and the preset sample feature analysis rules and the sample detection result matrix are used for all Perform feature analysis on each column of sample data in the sample feature matrix to obtain the characteristic distribution value corresponding to each column of sample data, including:

calculating the sample feature matrix according to the hidden variable calculation formula to obtain a corresponding hidden variable matrix;

Calculate each column of sample data in the hidden variable matrix, the sample detection result matrix, and the sample feature matrix according to the feature calculation formula to obtain a feature distribution value corresponding to each column of the sample data.
The data association feature analysis method according to claim 3, wherein said calculation of each column of sample data in the hidden variable matrix, the sample detection result matrix, and the sample feature matrix according to the feature calculation formula, To obtain the characteristic distribution values corresponding to the sample data in each column, including:

Calculate the number of rows of the sample feature matrix and the number of columns of the hidden variable matrix according to the degree of freedom value calculation formula in the feature calculation formula to obtain the corresponding degree of freedom value;

performing an inverse operation on the hidden variable matrix, the sample detection result matrix, and the sample feature matrix according to the block matrix formula in the feature calculation formula to obtain an estimated value corresponding to each column of the sample data;

According to the distribution value calculation formula in the feature calculation formula, each of the degree of freedom value, the estimated value corresponding to each column of the sample data, the hidden variable matrix, the sample detection result matrix and the sample feature matrix A column of sample data is calculated to obtain the characteristic distribution value corresponding to each column of sample data.
The data association feature analysis method according to claim 1, wherein said performing distribution statistics on said feature distribution value to obtain a composite test value corresponding to each column of said sample data comprises:

Perform extreme value distribution statistics on the characteristic distribution values corresponding to the sample data in each column to obtain statistical information on the characteristic distribution values of the sample data in each column;

According to the preset inspection value data table, the composite inspection value corresponding to the statistical form of each characteristic distribution value statistical information is obtained.
The data correlation feature analysis method according to claim 1, wherein, before filtering out the correlation column information corresponding to the preset correlation screening coefficient from the sample feature matrix according to the composite inspection value, it includes:

The number of columns of the sample feature matrix is calculated according to a preset calculation formula of the screening coefficient to obtain the correlation screening coefficient.
The data association characteristic analysis method according to claim 1, wherein said association column information corresponding to a preset association screening coefficient is screened out from said sample characteristic matrix according to said composite inspection value, comprising:

Judging whether the composite inspection value of the sample data in each column is smaller than the correlation screening coefficient, so as to determine the composite inspection value smaller than the correlation screening coefficient as the target inspection value according to the judgment result;

Obtaining and combining column coded values corresponding to the target test values in the sample feature matrix as associated column information corresponding to the associated screening coefficients.
A data association feature analysis device, said device comprising:

A data conversion unit, configured to convert the initial sample data according to a preset data conversion rule to obtain a corresponding sample feature matrix and a sample detection result matrix if the input initial sample data is received;

A feature distribution value acquisition unit, configured to perform feature analysis on each column of sample data in the sample feature matrix according to preset sample feature analysis rules and the sample detection result matrix to obtain feature distribution values corresponding to each column of sample data ;

A composite inspection value acquisition unit, configured to perform distribution statistics on the characteristic distribution value to obtain a composite inspection value corresponding to the sample data in each column;

The association column information acquisition unit is configured to filter out the association column information corresponding to the preset association screening coefficients from the sample feature matrix according to the composite test value.
A computer device, comprising a memory, a processor, and a computer program stored on the memory and operable on the processor, and the processor implements the following steps when executing the computer program:

If the input initial sample data is received, convert the initial sample data according to a preset data conversion rule to obtain a corresponding sample feature matrix and a sample detection result matrix;

performing feature analysis on each column of sample data in the sample feature matrix according to preset sample feature analysis rules and the sample detection result matrix to obtain feature distribution values corresponding to each column of sample data;

Performing distribution statistics on the characteristic distribution value to obtain a composite test value corresponding to each column of the sample data;

The associated column information corresponding to the preset associated screening coefficient is screened out from the sample feature matrix according to the composite test value.
The computer device according to claim 9, wherein the data conversion rules include sample data mapping information and detection result mapping information, and the initial sample data is converted according to the preset data conversion rules to obtain corresponding sample features Matrix and sample test result matrix, including:

Mapping the sample feature data of each sample in the initial sample data according to the sample data mapping information to obtain a corresponding sample feature matrix;

The detection result of each sample in the initial sample data is mapped according to the detection result mapping information to obtain a corresponding sample detection result matrix.
The computer device according to claim 9, wherein the sample feature analysis rules include latent variable calculation formulas and feature calculation formulas, and the sample feature analysis rules are calculated according to the preset sample feature analysis rules and the sample detection result matrix Perform characteristic analysis on each column of sample data in the matrix to obtain the characteristic distribution value corresponding to each column of sample data, including:

calculating the sample feature matrix according to the hidden variable calculation formula to obtain a corresponding hidden variable matrix;

Calculate each column of sample data in the hidden variable matrix, the sample detection result matrix, and the sample feature matrix according to the feature calculation formula to obtain a feature distribution value corresponding to each column of the sample data.
The computer device according to claim 11, wherein, according to the feature calculation formula, the hidden variable matrix, the sample detection result matrix and each column of sample data in the sample feature matrix are calculated to obtain the same The characteristic distribution values corresponding to the sample data in each column, including:

Calculate the number of rows of the sample feature matrix and the number of columns of the hidden variable matrix according to the degree of freedom value calculation formula in the feature calculation formula to obtain the corresponding degree of freedom value;

performing an inverse operation on the hidden variable matrix, the sample detection result matrix, and the sample feature matrix according to the block matrix formula in the feature calculation formula to obtain an estimated value corresponding to each column of the sample data;

According to the distribution value calculation formula in the feature calculation formula, each of the degree of freedom value, the estimated value corresponding to each column of the sample data, the hidden variable matrix, the sample detection result matrix and the sample feature matrix A column of sample data is calculated to obtain the characteristic distribution value corresponding to each column of sample data.
The computer device according to claim 9, wherein performing distribution statistics on the characteristic distribution value to obtain a composite inspection value corresponding to each column of the sample data includes:

Perform extreme value distribution statistics on the characteristic distribution values corresponding to the sample data in each column to obtain statistical information on the characteristic distribution values of the sample data in each column;

According to the preset inspection value data table, the composite inspection value corresponding to the statistical form of each characteristic distribution value statistical information is obtained.
The computer device according to claim 9, wherein, before filtering out the association column information corresponding to the preset association screening coefficient from the sample feature matrix according to the composite inspection value, it includes:

The number of columns of the sample feature matrix is calculated according to a preset calculation formula of the screening coefficient to obtain the correlation screening coefficient.
The computer device according to claim 9, wherein said filtering out the correlation column information corresponding to the preset correlation screening coefficient from the sample feature matrix according to the composite test value comprises:

Judging whether the composite inspection value of the sample data in each column is smaller than the correlation screening coefficient, so as to determine the composite inspection value smaller than the correlation screening coefficient as the target inspection value according to the judgment result;

Obtaining and combining column coded values corresponding to the target test values in the sample feature matrix as associated column information corresponding to the associated screening coefficients.
A computer-readable storage medium, the computer-readable storage medium stores a computer program, and when the computer program is executed by a processor, the following operations are performed:

If the input initial sample data is received, convert the initial sample data according to a preset data conversion rule to obtain a corresponding sample feature matrix and a sample detection result matrix;

performing feature analysis on each column of sample data in the sample feature matrix according to preset sample feature analysis rules and the sample detection result matrix to obtain feature distribution values corresponding to each column of sample data;

Performing distribution statistics on the characteristic distribution value to obtain a composite test value corresponding to each column of the sample data;

The associated column information corresponding to the preset associated screening coefficient is screened out from the sample feature matrix according to the composite test value.
The computer-readable storage medium according to claim 16, wherein the data conversion rules include sample data mapping information and detection result mapping information, and the initial sample data is converted according to the preset data conversion rules to obtain corresponding The sample feature matrix and sample test result matrix of , including:

Mapping the sample feature data of each sample in the initial sample data according to the sample data mapping information to obtain a corresponding sample feature matrix;

The detection result of each sample in the initial sample data is mapped according to the detection result mapping information to obtain a corresponding sample detection result matrix.
The computer-readable storage medium according to claim 16, wherein the sample feature analysis rules include hidden variable calculation formulas and feature calculation formulas, and the preset sample feature analysis rules and the sample detection result matrix are used for all Perform feature analysis on each column of sample data in the sample feature matrix to obtain the characteristic distribution value corresponding to each column of sample data, including:

calculating the sample feature matrix according to the hidden variable calculation formula to obtain a corresponding hidden variable matrix;

Calculate each column of sample data in the hidden variable matrix, the sample detection result matrix, and the sample feature matrix according to the feature calculation formula to obtain a feature distribution value corresponding to each column of the sample data.
The computer-readable storage medium according to claim 18, wherein the calculation of each column of sample data in the hidden variable matrix, the sample detection result matrix, and the sample feature matrix is performed according to the feature calculation formula, To obtain the characteristic distribution values corresponding to the sample data in each column, including:

Calculate the number of rows of the sample feature matrix and the number of columns of the hidden variable matrix according to the degree of freedom value calculation formula in the feature calculation formula to obtain the corresponding degree of freedom value;

performing an inverse operation on the hidden variable matrix, the sample detection result matrix, and the sample feature matrix according to the block matrix formula in the feature calculation formula to obtain an estimated value corresponding to each column of the sample data;

According to the distribution value calculation formula in the feature calculation formula, each of the degree of freedom value, the estimated value corresponding to each column of the sample data, the hidden variable matrix, the sample detection result matrix and the sample feature matrix A column of sample data is calculated to obtain the characteristic distribution value corresponding to each column of sample data.
The computer-readable storage medium according to claim 16, wherein performing distribution statistics on the characteristic distribution value to obtain a composite inspection value corresponding to each column of the sample data includes:

Perform extreme value distribution statistics on the characteristic distribution values corresponding to the sample data in each column to obtain statistical information on the characteristic distribution values of the sample data in each column;

According to the preset inspection value data table, the composite inspection value corresponding to the statistical form of each characteristic distribution value statistical information is obtained.