WO2023050490A1 - Data association feature analysis method and apparatus, and device and medium - Google Patents

Data association feature analysis method and apparatus, and device and medium Download PDF

Info

Publication number
WO2023050490A1
WO2023050490A1 PCT/CN2021/124577 CN2021124577W WO2023050490A1 WO 2023050490 A1 WO2023050490 A1 WO 2023050490A1 CN 2021124577 W CN2021124577 W CN 2021124577W WO 2023050490 A1 WO2023050490 A1 WO 2023050490A1
Authority
WO
WIPO (PCT)
Prior art keywords
sample
matrix
column
feature
data
Prior art date
Application number
PCT/CN2021/124577
Other languages
French (fr)
Chinese (zh)
Inventor
陈东来
Original Assignee
深圳前海环融联易信息科技服务有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳前海环融联易信息科技服务有限公司 filed Critical 深圳前海环融联易信息科技服务有限公司
Publication of WO2023050490A1 publication Critical patent/WO2023050490A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/25Integrating or interfacing systems involving database management systems
    • G06F16/258Data format conversion from or to a database
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/16Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization

Definitions

  • the present application relates to the technical field of big data analysis, and in particular to a data association feature analysis method, device, equipment and medium.
  • the relationship between the cause and the result can be obtained through big data analysis, such as the association analysis between the genome and the disease, so as to determine which genes are related to the disease.
  • big data analysis such as the association analysis between the genome and the disease, so as to determine which genes are related to the disease.
  • the inventors found that due to the gene The amount of information contained is huge, and the amount of data contained in the gene sequence to be analyzed is also very large. With the increase of the number of samples, the existing correlation feature analysis method is inefficient and inaccurate when analyzing massive genetic data. Obtain the position of genes associated with disease. Therefore, there is a problem in the prior art that it is impossible to quickly analyze massive data information to accurately obtain associated features.
  • the embodiment of the present application provides a data association feature analysis method, device, equipment and medium, aiming to solve the problem existing in the prior art methods that it is impossible to quickly analyze massive data information to accurately obtain association features.
  • the embodiment of the present application provides a data association feature analysis method, which includes:
  • the associated column information corresponding to the preset associated screening coefficient is screened out from the sample feature matrix according to the composite test value.
  • the embodiment of the present application provides a data association feature analysis device, which includes:
  • a data conversion unit configured to convert the initial sample data according to a preset data conversion rule to obtain a corresponding sample feature matrix and a sample detection result matrix if the input initial sample data is received;
  • a feature distribution value acquisition unit configured to perform feature analysis on each column of sample data in the sample feature matrix according to preset sample feature analysis rules and the sample detection result matrix to obtain feature distribution values corresponding to each column of sample data ;
  • a composite inspection value acquisition unit configured to perform distribution statistics on the characteristic distribution value to obtain a composite inspection value corresponding to the sample data in each column;
  • the association column information acquisition unit is configured to filter out the association column information corresponding to the preset association screening coefficients from the sample feature matrix according to the composite test value.
  • the embodiment of the present application further provides a computer device, which includes a memory, a processor, and a computer program stored on the memory and operable on the processor, and the processor executes the computer program.
  • the program implements the data association feature analysis method described in the first aspect above.
  • the embodiment of the present application further provides a computer-readable storage medium, wherein the computer-readable storage medium stores a computer program, and when the computer program is executed by a processor, the processor executes the above-mentioned first step.
  • the data association feature analysis method In one aspect, the data association feature analysis method.
  • Embodiments of the present application provide a data association feature analysis method, device, computer equipment, and readable storage medium.
  • the initial sample data is converted and processed to obtain the sample feature matrix and the sample detection result matrix.
  • the sample feature analysis rules and the sample detection result matrix the feature analysis is performed on each column of sample data in the sample feature matrix to obtain the corresponding feature distribution value. Perform distribution statistics on the characteristic distribution values corresponding to the sample data in each column to obtain the corresponding composite inspection value, and filter out the associated column information corresponding to the associated screening coefficient from the sample characteristic matrix according to the composite inspection value.
  • the characteristic distribution value can be obtained according to the sample characteristic analysis rules for distribution statistics, and the associated column information can be screened out from the sample characteristic matrix according to the composite inspection value obtained by distribution statistics, which can realize rapid analysis of massive data information to obtain Accurately associate features.
  • FIG. 1 is a schematic flow diagram of a data association feature analysis method provided in an embodiment of the present application
  • FIG. 2 is a schematic diagram of the sub-flow of the data association feature analysis method provided by the embodiment of the present application.
  • FIG. 3 is a schematic diagram of another sub-flow of the data association feature analysis method provided by the embodiment of the present application.
  • FIG. 4 is a schematic diagram of another sub-flow of the data association feature analysis method provided by the embodiment of the present application.
  • FIG. 5 is a schematic diagram of another sub-flow of the data association feature analysis method provided by the embodiment of the present application.
  • FIG. 6 is another schematic flowchart of the data association feature analysis method provided by the embodiment of the present application.
  • FIG. 7 is a schematic diagram of another sub-flow of the data association feature analysis method provided by the embodiment of the present application.
  • FIG. 8 is a schematic block diagram of a data association feature analysis device provided in an embodiment of the present application.
  • Fig. 9 is a schematic block diagram of a computer device provided by an embodiment of the present application.
  • FIG. 1 is a schematic flow diagram of the data association feature analysis method provided by the embodiment of the present application; the data association feature analysis method is applied to the user terminal or the management server, and the data association feature analysis method is installed on the user terminal or The application software in the management server is executed.
  • the management server is the server that can execute the data correlation feature analysis method to perform correlation feature analysis on the initial sample data.
  • the management server can be a server built inside an enterprise or a government department. It is a terminal device, such as a desktop computer, a notebook computer, a tablet computer or a mobile phone, which can perform a data correlation feature analysis method to perform correlation feature analysis on the initial sample data. As shown in FIG. 1, the method includes steps S110-S140.
  • the initial sample data is converted according to a preset data conversion rule to obtain a corresponding sample feature matrix and a sample detection result matrix.
  • the user can input the initial sample data to the user terminal or the management server.
  • the initial sample data can be the genetic data and test results of the sample.
  • the genetic data can be all or part of the gene sequence contained in a pair of chromosomes.
  • the test result is whether the patient is infected or not.
  • the detection information of the disease, the technical scheme can screen out the gene points with strong correlation with the detection results from the genetic data through data correlation analysis.
  • the initial sample data can be converted according to data conversion rules, wherein the data conversion rules include sample data mapping information and detection result mapping information.
  • step S110 includes sub-steps S111 and S112.
  • the sample characteristic data of each sample in the initial sample data can be mapped according to the sample data mapping information, and the sample characteristic data is the genetic data of each sample in the initial sample data, and various types of genetic data can be
  • the sample feature matrix is obtained through the mapping process, and the obtained sample feature matrix includes sample data corresponding to the genetic data of each sample.
  • the sample The data mapping information correspondingly includes mapping information of AA mapping 0, AG mapping 1, and GG mapping 2.
  • the initial sample data contains 1963 samples, and the genetic data of each sample contains 317503 gene points, so a sample feature matrix with 1963 rows and 317503 columns can be obtained correspondingly.
  • the detection results in the initial sample data may be mapped according to the detection result mapping information.
  • the detection results may include the detection results of one or more diseases. If there is only one disease in the detection result, the detection result of "disease” is mapped to "1", and the detection result of "not diseased” is mapped to "0"; The detection result of suffering from multiple diseases is mapped to "1", and the other detection results are mapped to "0".
  • the detection results of 1963 samples in the initial sample data are mapped to obtain a sample detection result matrix with 1963 rows and 1 column.
  • the sample feature analysis rules are the specific rules for analyzing the sample feature matrix. Based on the sample feature analysis rules and the sample detection result matrix, the feature analysis of each column of sample data in the sample feature matrix can be performed to obtain the feature distribution value corresponding to each column of sample data.
  • the feature distribution value is the distribution value of the feature of each gene point in each sample in a specific distribution state.
  • the sample feature analysis rules include latent variable calculation formulas and feature calculation formulas.
  • step S120 includes sub-steps S121 and S122.
  • the sample feature matrix can be calculated according to the hidden variable calculation formula to obtain the corresponding hidden variable matrix, which includes the hidden correlation between each column of sample data and the corresponding detection results.
  • sample feature matrix X can be decomposed according to the latent variable calculation formula, then the sample feature matrix X can be expressed by formula (1):
  • the characteristic distribution value of each column of sample data can be calculated separately according to the characteristic calculation formula.
  • the feature calculation formula includes a degree of freedom value calculation formula, a block matrix formula and a distribution value calculation formula.
  • step S122 includes sub-steps S1221 , S1222 and S1223 .
  • n is the number of rows of the sample feature matrix X
  • d is the number of columns of the hidden variable matrix G.
  • the hidden variable matrix, the sample detection result matrix and the sample feature matrix can be inversely operated according to the block matrix formula to obtain an estimated value corresponding to each column of sample data.
  • X i YB i +G ⁇ i +E i
  • X i the sample data of column i in the sample feature matrix
  • Y the sample detection result matrix
  • B i the sample
  • G is the hidden variable matrix
  • ⁇ i is the coefficient corresponding to the hidden variable matrix
  • E i is the residual
  • the residuals corresponding to any column of sample data are independent of each other.
  • T is the matrix transpose symbol.
  • the characteristic distribution value of each column of sample data can be further calculated. Specifically, the corresponding characteristic distribution value can be calculated by the distribution value calculation formula. Since each column of sample data contains the characteristic data corresponding to each sample at the same gene point, the corresponding characteristic data of each column of sample data can be calculated. The characteristic distribution value contains a gene point and the characteristic distribution value corresponding to all samples, that is, the number of distribution values contained in the characteristic distribution value of each column of sample data is equal to the number of samples.
  • the distribution value calculation formula can be expressed by formula (4):
  • z is the degree of freedom value
  • t i is the calculated characteristic distribution value
  • Distribution statistics can be performed on the characteristic distribution values to obtain the corresponding composite inspection value, and each column of sample data can correspondingly obtain a composite inspection value.
  • step S130 includes sub-steps S131 and S132.
  • extreme value distribution statistics can be performed on the characteristic distribution value of each column of sample data. Specifically, when the sample size is infinite, the distribution statistics of the characteristic distribution value t of any column of sample data is approximately a normal distribution, using the extreme value distribution theorem The distribution form with the largest absolute value can be determined as the target distribution form corresponding to the characteristic distribution value t of the current column of sample data, and the distribution parameters of the target distribution form can be further obtained as the corresponding characteristic distribution value statistical information.
  • a normal distribution can be represented by expression (5):
  • the ⁇ and ⁇ in the above expression are the corresponding distribution parameters.
  • the user terminal or management server also pre-stores a test value data table, which contains the test value corresponding to each statistical form. After obtaining the statistical information of the characteristic distribution value, it can be based on the statistical mentality corresponding to the statistical information. , obtain a corresponding test value from the test value data table as a composite test value by means of table lookup.
  • the associated column information corresponding to the preset associated screening coefficient is screened out from the sample feature matrix according to the composite test value.
  • the sample feature matrix can be screened according to the compound test value and the correlation screening coefficient to obtain the corresponding correlation column information.
  • the correlation column information can contain at least one column code value, and the correlation column information The included column encoding values can be used to indicate the gene points in the gene sequence that have a strong correlation with the disease.
  • step S1401 is further included before step S140 .
  • the corresponding correlation screening coefficient can also be calculated according to the calculation formula of the screening coefficient and the column number of the sample feature matrix.
  • the calculation formula of the screening coefficient can be expressed by formula (6) :
  • e is the preset parameter value in the formula
  • m is the column number of the sample feature matrix
  • step S140 includes sub-steps S141 and S142.
  • the composite test value of each column of sample data is less than the correlation screening coefficient. If it is less than, it indicates that the gene point corresponding to the composite test value is a gene point with significant correlation; if it is not less than, it indicates that the composite test value The gene points corresponding to the values have no significant correlation. According to the judgment result, the compound test value smaller than the correlation screening coefficient can be obtained as the target test value.
  • each target test value corresponds to a column of sample data in the sample feature matrix
  • the column code value corresponding to each target test value can be obtained from the sample feature matrix and combined, Get the corresponding associated column information.
  • the initial sample data is converted according to the data conversion rules to obtain the sample feature matrix and the sample detection result matrix
  • the sample feature matrix is processed according to the sample feature analysis rules and the sample detection result matrix
  • Perform characteristic analysis on each column of sample data to obtain the corresponding characteristic distribution value
  • filter out the correlation value from the sample feature matrix according to the composite test value Correlation column information corresponding to the filter coefficient
  • the characteristic distribution value can be obtained according to the sample characteristic analysis rules for distribution statistics, and the associated column information can be screened out from the sample characteristic matrix according to the composite inspection value obtained by distribution statistics, which can realize rapid analysis of massive data information to obtain Accurately associate features.
  • the embodiment of the present application also provides a data association feature analysis device, which can be configured in a user terminal or a management server, and the data association feature analysis device is used to implement any implementation of the aforementioned data association feature analysis method example.
  • a data association feature analysis device which can be configured in a user terminal or a management server, and the data association feature analysis device is used to implement any implementation of the aforementioned data association feature analysis method example.
  • FIG. 8 is a schematic block diagram of an apparatus for analyzing data association features provided by an embodiment of the present application.
  • the data correlation feature analysis device 100 includes a data conversion unit 110 , a feature distribution value acquisition unit 120 , a composite test value acquisition unit 130 and an association column information acquisition unit 140 .
  • the data conversion unit 110 is configured to convert the initial sample data according to preset data conversion rules to obtain a corresponding sample feature matrix and sample detection result matrix if the input initial sample data is received.
  • the data conversion unit 110 includes a subunit: a sample feature matrix acquisition unit, configured to map the sample feature data of each sample in the initial sample data according to the sample data mapping information, Obtaining a corresponding sample feature matrix; a sample detection result matrix acquisition unit configured to perform mapping processing on the detection result of each sample in the initial sample data according to the detection result mapping information to obtain a corresponding sample detection result matrix.
  • the feature distribution value acquisition unit 120 is configured to perform feature analysis on each column of sample data in the sample feature matrix according to preset sample feature analysis rules and the sample detection result matrix to obtain a feature distribution corresponding to each column of sample data value.
  • the characteristic distribution value acquisition unit 120 includes a subunit: a hidden variable matrix acquisition unit, which is used to calculate the sample feature matrix according to the hidden variable calculation formula to obtain a corresponding hidden variable matrix; A calculation unit, configured to calculate each column of sample data in the hidden variable matrix, the sample detection result matrix, and the sample feature matrix according to the feature calculation formula, so as to obtain the feature corresponding to each column of the sample data distribution value.
  • the feature calculation unit includes a subunit: a degree of freedom value calculation unit, which is used to calculate the number of rows of the sample feature matrix and the implicit The number of columns of the variable matrix is calculated to obtain the corresponding degree of freedom; the estimated value calculation unit is used to calculate the hidden variable matrix, the sample detection result matrix and the sample according to the block matrix formula in the feature calculation formula Performing an inverse operation on the characteristic matrix to obtain the estimated value corresponding to the sample data in each column; the distribution value calculation unit is used to correspond to the degree of freedom value and the sample data in each column according to the distribution value calculation formula in the feature calculation formula Calculate the estimated value of the hidden variable matrix, the sample detection result matrix and the sample data in each column of the sample feature matrix to obtain the characteristic distribution value corresponding to each column of the sample data.
  • a degree of freedom value calculation unit which is used to calculate the number of rows of the sample feature matrix and the implicit The number of columns of the variable matrix is calculated to obtain the corresponding degree of freedom
  • the estimated value calculation unit is used to calculate the hidden variable matrix, the sample
  • the composite inspection value acquisition unit 130 is configured to perform distribution statistics on the characteristic distribution values to obtain a composite inspection value corresponding to each column of the sample data.
  • the composite test value acquisition unit 130 includes a subunit: a characteristic distribution value statistics unit, which is used to perform extreme value distribution statistics on the characteristic distribution values corresponding to each column of sample data, and obtain the Statistical information of characteristic distribution values of the sample data; a test value acquisition unit, configured to obtain a composite test value corresponding to the statistical form of each feature distribution value statistical information according to a preset test value data table.
  • the association column information acquisition unit 140 is configured to filter out the association column information corresponding to the preset association screening coefficients from the sample feature matrix according to the composite test value.
  • the data correlation feature analysis device 100 further includes a subunit: a correlation screening coefficient calculation unit, which is used to calculate the number of columns of the sample feature matrix according to a preset screening coefficient calculation formula, and obtain the The correlation screening coefficient mentioned above.
  • a correlation screening coefficient calculation unit which is used to calculate the number of columns of the sample feature matrix according to a preset screening coefficient calculation formula, and obtain the The correlation screening coefficient mentioned above.
  • the association column information acquisition unit 140 includes a subunit: a target inspection value determination unit, which is used to judge whether the composite inspection value of each column of the sample data is smaller than the association screening coefficient, according to As a result of the judgment, the composite test value smaller than the associated screening coefficient is determined as the target test value; the column code value combination unit is used to obtain the column code value corresponding to the target test value in the sample feature matrix and combine it as the The associated column information corresponding to the associated screening coefficient.
  • the data association feature analysis device applies the above-mentioned data association feature analysis method, converts the initial sample data according to the data conversion rules to obtain a sample feature matrix and a sample detection result matrix, and according to the sample feature analysis rules and sample detection
  • the result matrix analyzes the characteristics of each column of sample data in the sample feature matrix to obtain the corresponding characteristic distribution value, and performs distribution statistics on the characteristic distribution value corresponding to the sample data in each column to obtain the corresponding composite inspection value.
  • the associated column information corresponding to the associated screening coefficient is filtered out from the matrix.
  • the characteristic distribution value can be obtained according to the sample characteristic analysis rules for distribution statistics, and the associated column information can be screened out from the sample characteristic matrix according to the composite inspection value obtained by distribution statistics, which can realize rapid analysis of massive data information to obtain Accurately associate features.
  • the above-mentioned device for analyzing data association features can be implemented in the form of a computer program, and the computer program can be run on a computer device as shown in FIG. 9 .
  • FIG. 9 is a schematic block diagram of a computer device provided by an embodiment of the present application.
  • the computer device may be a user terminal or a management server for performing the data correlation feature analysis method to perform correlation feature analysis on the initial sample data.
  • the computer device 500 includes a processor 502 connected through a system bus 501 , a memory and a network interface 505 , wherein the memory may include a storage medium 503 and an internal memory 504 .
  • the storage medium 503 can store an operating system 5031 and a computer program 5032 .
  • the processor 502 may execute the data association feature analysis method, wherein the storage medium 503 may be a volatile storage medium or a non-volatile storage medium.
  • the processor 502 is used to provide calculation and control capabilities and support the operation of the entire computer device 500 .
  • the internal memory 504 provides an environment for the operation of the computer program 5032 in the storage medium 503.
  • the processor 502 can execute the data association feature analysis method.
  • the network interface 505 is used for network communication, such as providing data transmission and the like.
  • the network interface 505 is used for network communication, such as providing data transmission and the like.
  • FIG. 9 is only a block diagram of a partial structure related to the solution of this application, and does not constitute a limitation on the computer device 500 on which the solution of this application is applied.
  • the specific computer device 500 may include more or fewer components than shown, or combine certain components, or have a different arrangement of components.
  • the processor 502 is configured to run the computer program 5032 stored in the memory, so as to realize the corresponding functions in the above-mentioned data association feature analysis method.
  • the embodiment of the computer device shown in FIG. 9 does not constitute a limitation on the specific composition of the computer device.
  • the computer device may include more or less components than those shown in the illustration. Or combine certain components, or different component arrangements.
  • the computer device may only include a memory and a processor. In such an embodiment, the structures and functions of the memory and the processor are consistent with those of the embodiment shown in FIG. 9 , and will not be repeated here.
  • the processor 502 may be a central processing unit (Central Processing Unit, CPU), and the processor 502 may also be other general-purpose processors, digital signal processors (Digital Signal Processor, DSP), Application Specific Integrated Circuit (ASIC), off-the-shelf programmable gate array (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc.
  • the general-purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
  • a computer readable storage medium may be a volatile or non-volatile computer-readable storage medium.
  • the computer-readable storage medium stores a computer program, wherein when the computer program is executed by a processor, the steps included in the above-mentioned data association feature analysis method are implemented.
  • the disclosed devices, devices and methods can be implemented in other ways.
  • the device embodiments described above are only illustrative.
  • the division of the units is only logical function division.
  • there may be other division methods, and units with the same function may also be combined into one Units such as multiple units or components may be combined or integrated into another system, or some features may be omitted, or not implemented.
  • the mutual coupling or direct coupling or communication connection shown or discussed may be indirect coupling or communication connection through some interfaces, devices or units, and may also be electrical, mechanical or other forms of connection.
  • the units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, they may be located in one place, or may be distributed to multiple network units. Part or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment of the present application.
  • each functional unit in each embodiment of the present application may be integrated into one processing unit, each unit may exist separately physically, or two or more units may be integrated into one unit.
  • the above-mentioned integrated units can be implemented in the form of hardware or in the form of software functional units.
  • the integrated unit is realized in the form of a software function unit and sold or used as an independent product, it can be stored in a computer-readable storage medium.
  • the readable storage medium includes several instructions to enable a computer device (which may be a personal computer, server, or network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of the present application.
  • the aforementioned computer-readable storage medium includes: U disk, mobile hard disk, read-only memory (ROM, Read-Only Memory), magnetic disk or optical disk and other media that can store program codes.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Data Mining & Analysis (AREA)
  • Pure & Applied Mathematics (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Computational Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Algebra (AREA)
  • Software Systems (AREA)
  • Investigating Or Analysing Biological Materials (AREA)

Abstract

Disclosed in the present application are a data association feature analysis method and apparatus, and a device and a medium. The method comprises: performing conversion processing on initial sample data according to a data conversion rule, so as to obtain a sample feature matrix and a sample testing result matrix; performing feature analysis on each column of sample data in the sample feature matrix according to a sample feature analysis rule and the sample testing result matrix, so as to obtain a corresponding feature distribution value; compiling distribution statistics on the feature distribution value corresponding to each column of the sample data, so as to obtain a corresponding composite test value; and according to the composite test value, screening, from the sample feature matrix, out associated column information that corresponds to an associated screening coefficient.

Description

数据关联特征分析方法、装置、设备及介质Data association feature analysis method, device, equipment and medium
本申请要求于2021年09月30日提交中国专利局、申请号为202111164594.6,发明名称为“数据关联特征分析方法、装置、设备及介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of the Chinese patent application with the application number 202111164594.6 and the invention title "data association feature analysis method, device, equipment and medium" submitted to the China Patent Office on September 30, 2021, the entire contents of which are incorporated by reference in this application.
技术领域technical field
本申请涉及大数据分析技术领域,尤其涉及一种数据关联特征分析方法、装置、设备及介质。The present application relates to the technical field of big data analysis, and in particular to a data association feature analysis method, device, equipment and medium.
背景技术Background technique
通过大数据分析可获取原因与结果之间的关联关系,如对于基因组与所患疾病之间进行关联分析,从而确定具体哪些基因与所患疾病之间关联关系,然而发明人发现,由于基因所包含的信息量巨大,所需进行分析的基因序列所包含的数据量也十分庞大,随着样本数量的增加,现有的关联特征分析方法对海量基因数据进行分析时效率较低,且无法准确获取与患疾病之间存在关联关系的基因位置。因此,现有技术方法中存在无法对海量数据信息快速进行分析以准确获取关联特征的问题。The relationship between the cause and the result can be obtained through big data analysis, such as the association analysis between the genome and the disease, so as to determine which genes are related to the disease. However, the inventors found that due to the gene The amount of information contained is huge, and the amount of data contained in the gene sequence to be analyzed is also very large. With the increase of the number of samples, the existing correlation feature analysis method is inefficient and inaccurate when analyzing massive genetic data. Obtain the position of genes associated with disease. Therefore, there is a problem in the prior art that it is impossible to quickly analyze massive data information to accurately obtain associated features.
发明内容Contents of the invention
本申请实施例提供了一种数据关联特征分析方法、装置、设备及介质,旨在解决现有技术方法中所存在的无法对海量数据信息快速进行分析以准确获取关联特征的问题。The embodiment of the present application provides a data association feature analysis method, device, equipment and medium, aiming to solve the problem existing in the prior art methods that it is impossible to quickly analyze massive data information to accurately obtain association features.
第一方面,本申请实施例提供了一种数据关联特征分析方法,其包括:In the first aspect, the embodiment of the present application provides a data association feature analysis method, which includes:
若接收到所输入的初始样本数据,根据预置的数据转换规则对所述初始样本数据进行转换得到对应的样本特征矩阵及样本检测结果矩阵;If the input initial sample data is received, convert the initial sample data according to a preset data conversion rule to obtain a corresponding sample feature matrix and a sample detection result matrix;
根据预置的样本特征分析规则及所述样本检测结果矩阵对所述样本特征矩阵中每一列样本数据进行特征分析得到与每一列所述样本数据对应的特征分布值;performing feature analysis on each column of sample data in the sample feature matrix according to preset sample feature analysis rules and the sample detection result matrix to obtain feature distribution values corresponding to each column of sample data;
对所述特征分布值进行分布统计得到与每一列所述样本数据对应的复合检验值;Performing distribution statistics on the characteristic distribution value to obtain a composite test value corresponding to each column of the sample data;
根据所述复合检验值从所述样本特征矩阵中筛选出与预置的关联筛选系数对应的关联列信息。The associated column information corresponding to the preset associated screening coefficient is screened out from the sample feature matrix according to the composite test value.
第二方面,本申请实施例提供了一种数据关联特征分析装置,其包括:In the second aspect, the embodiment of the present application provides a data association feature analysis device, which includes:
数据转换单元,用于若接收到所输入的初始样本数据,根据预置的数据转换规则对所述初始样本数据进行转换得到对应的样本特征矩阵及样本检测结果矩阵;A data conversion unit, configured to convert the initial sample data according to a preset data conversion rule to obtain a corresponding sample feature matrix and a sample detection result matrix if the input initial sample data is received;
特征分布值获取单元,用于根据预置的样本特征分析规则及所述样本检测结果矩阵对所述样本特征矩阵中每一列样本数据进行特征分析得到与每一列所述样本数据对应的特征分布值;A feature distribution value acquisition unit, configured to perform feature analysis on each column of sample data in the sample feature matrix according to preset sample feature analysis rules and the sample detection result matrix to obtain feature distribution values corresponding to each column of sample data ;
复合检验值获取单元,用于对所述特征分布值进行分布统计得到与每一列所述样本数据对应的复合检验值;A composite inspection value acquisition unit, configured to perform distribution statistics on the characteristic distribution value to obtain a composite inspection value corresponding to the sample data in each column;
关联列信息获取单元,用于根据所述复合检验值从所述样本特征矩阵中筛选出与预置的关联筛选系数对应的关联列信息。The association column information acquisition unit is configured to filter out the association column information corresponding to the preset association screening coefficients from the sample feature matrix according to the composite test value.
第三方面,本申请实施例又提供了一种计算机设备,其包括存储器、处理器及存储在所述存储器上并可在所述处理器上运行的计算机程序,所述处理器执行所述计算机程序时实现上述第一方面所述的数据关联特征分析方法。In the third aspect, the embodiment of the present application further provides a computer device, which includes a memory, a processor, and a computer program stored on the memory and operable on the processor, and the processor executes the computer program. The program implements the data association feature analysis method described in the first aspect above.
第四方面,本申请实施例还提供了一种计算机可读存储介质,其中所述计算机可读存储介质存储有计算机程序,所述计算机程序当被处理器执行时使所述处理器执行上述第一方面所述的数据关联特征分析方法。In a fourth aspect, the embodiment of the present application further provides a computer-readable storage medium, wherein the computer-readable storage medium stores a computer program, and when the computer program is executed by a processor, the processor executes the above-mentioned first step. In one aspect, the data association feature analysis method.
本申请实施例提供了一种数据关联特征分析方法、装置、计算机设备及可读存储介质。 根据数据转换规则对初始样本数据进行转换处理得到样本特征矩阵和样本检测结果矩阵,根据样本特征分析规则及样本检测结果矩阵对样本特征矩阵中每一列样本数据进行特征分析得到对应的特征分布值,对每一列所述样本数据对应的特征分布值进行分布统计得到对应的复合检验值,根据复合检验值从样本特征矩阵中筛选出与关联筛选系数对应的关联列信息。通过上述方法,可根据样本特征分析规则获取特征分布值进行分布统计,根据分布统计得到的复合检验值从样本特征矩阵中筛选出关联列信息,可实现对海量数据信息进行快速分析,以获取到准确关联特征。Embodiments of the present application provide a data association feature analysis method, device, computer equipment, and readable storage medium. According to the data conversion rules, the initial sample data is converted and processed to obtain the sample feature matrix and the sample detection result matrix. According to the sample feature analysis rules and the sample detection result matrix, the feature analysis is performed on each column of sample data in the sample feature matrix to obtain the corresponding feature distribution value. Perform distribution statistics on the characteristic distribution values corresponding to the sample data in each column to obtain the corresponding composite inspection value, and filter out the associated column information corresponding to the associated screening coefficient from the sample characteristic matrix according to the composite inspection value. Through the above method, the characteristic distribution value can be obtained according to the sample characteristic analysis rules for distribution statistics, and the associated column information can be screened out from the sample characteristic matrix according to the composite inspection value obtained by distribution statistics, which can realize rapid analysis of massive data information to obtain Accurately associate features.
附图说明Description of drawings
为了更清楚地说明本申请实施例技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。In order to illustrate the technical solutions of the embodiments of the present application more clearly, the drawings that need to be used in the description of the embodiments will be briefly introduced below. Obviously, the drawings in the following description are some embodiments of the present application. Ordinary technicians can also obtain other drawings based on these drawings on the premise of not paying creative work.
图1为本申请实施例提供的数据关联特征分析方法的流程示意图;FIG. 1 is a schematic flow diagram of a data association feature analysis method provided in an embodiment of the present application;
图2为本申请实施例提供的数据关联特征分析方法的子流程示意图;FIG. 2 is a schematic diagram of the sub-flow of the data association feature analysis method provided by the embodiment of the present application;
图3为本申请实施例提供的数据关联特征分析方法的另一子流程示意图;FIG. 3 is a schematic diagram of another sub-flow of the data association feature analysis method provided by the embodiment of the present application;
图4为本申请实施例提供的数据关联特征分析方法的另一子流程示意图;FIG. 4 is a schematic diagram of another sub-flow of the data association feature analysis method provided by the embodiment of the present application;
图5为本申请实施例提供的数据关联特征分析方法的另一子流程示意图;FIG. 5 is a schematic diagram of another sub-flow of the data association feature analysis method provided by the embodiment of the present application;
图6为本申请实施例提供的数据关联特征分析方法的另一流程示意图;FIG. 6 is another schematic flowchart of the data association feature analysis method provided by the embodiment of the present application;
图7为本申请实施例提供的数据关联特征分析方法的另一子流程示意图;FIG. 7 is a schematic diagram of another sub-flow of the data association feature analysis method provided by the embodiment of the present application;
图8为本申请实施例提供的数据关联特征分析装置的示意性框图;FIG. 8 is a schematic block diagram of a data association feature analysis device provided in an embodiment of the present application;
图9为本申请实施例提供的计算机设备的示意性框图。Fig. 9 is a schematic block diagram of a computer device provided by an embodiment of the present application.
具体实施方式Detailed ways
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。The following will clearly and completely describe the technical solutions in the embodiments of the present application with reference to the drawings in the embodiments of the present application. Obviously, the described embodiments are part of the embodiments of the present application, not all of them. Based on the embodiments in this application, all other embodiments obtained by persons of ordinary skill in the art without making creative efforts belong to the scope of protection of this application.
应当理解,当在本说明书和所附权利要求书中使用时,术语“包括”和“包含”指示所描述特征、整体、步骤、操作、元素和/或组件的存在,但并不排除一个或多个其它特征、整体、步骤、操作、元素、组件和/或其集合的存在或添加。It should be understood that when used in this specification and the appended claims, the terms "comprising" and "comprises" indicate the presence of described features, integers, steps, operations, elements and/or components, but do not exclude one or Presence or addition of multiple other features, integers, steps, operations, elements, components and/or collections thereof.
还应当理解,在此本申请说明书中所使用的术语仅仅是出于描述特定实施例的目的而并不意在限制本申请。如在本申请说明书和所附权利要求书中所使用的那样,除非上下文清楚地指明其它情况,否则单数形式的“一”、“一个”及“该”意在包括复数形式。It should also be understood that the terminology used in the specification of this application is for the purpose of describing particular embodiments only and is not intended to limit the application. As used in this specification and the appended claims, the singular forms "a", "an" and "the" are intended to include plural referents unless the context clearly dictates otherwise.
还应当进一步理解,在本申请说明书和所附权利要求书中使用的术语“和/或”是指相关联列出的项中的一个或多个的任何组合以及所有可能组合,并且包括这些组合。It should also be further understood that the term "and/or" used in the description of the present application and the appended claims refers to any combination and all possible combinations of one or more of the associated listed items, and includes these combinations .
请参阅图1,图1是本申请实施例提供的数据关联特征分析方法的流程示意图;该数据关联特征分析方法应用于用户终端或管理服务器中,该数据关联特征分析方法通过安装于用户终端或管理服务器中的应用软件进行执行,管理服务器即是可执行数据关联特征分析方法以对初始样本数据进行关联特征分析的服务器,管理服务器可以是企业或政府部门内部所构建的服务器端,用户终端即是可执行数据关联特征分析方法以对初始样本数据进行关联特征分析的终端设备,例如台式电脑、笔记本电脑、平板电脑或手机等。如图1所示,该方法包括步骤S110~S140。Please refer to FIG. 1. FIG. 1 is a schematic flow diagram of the data association feature analysis method provided by the embodiment of the present application; the data association feature analysis method is applied to the user terminal or the management server, and the data association feature analysis method is installed on the user terminal or The application software in the management server is executed. The management server is the server that can execute the data correlation feature analysis method to perform correlation feature analysis on the initial sample data. The management server can be a server built inside an enterprise or a government department. It is a terminal device, such as a desktop computer, a notebook computer, a tablet computer or a mobile phone, which can perform a data correlation feature analysis method to perform correlation feature analysis on the initial sample data. As shown in FIG. 1, the method includes steps S110-S140.
S110、若接收到所输入的初始样本数据,根据预置的数据转换规则对所述初始样本数据进行转换得到对应的样本特征矩阵及样本检测结果矩阵。S110. If the input initial sample data is received, convert the initial sample data according to a preset data conversion rule to obtain a corresponding sample feature matrix and sample detection result matrix.
若接收到所输入的初始样本数据,根据预置的数据转换规则对所述初始样本数据进行转换得到对应的样本特征矩阵及样本检测结果矩阵。用户可输入初始样本数据至用户终端或管理服务器,初始样本数据即可以是样本的基因数据及检测结果,基因数据可以是一对染色体中所包含的全部或部分基因序列,检测结果即为是否患病的检测信息,本技术方案通过数据关联性分析即可从基因数据中筛选出与检测结果之间存在强关联性的基因点位。可根据数据转换规则对初始样本数据进行转换,其中数据转换规则包括样本数据映射信息及检测结果映射信息。If the input initial sample data is received, the initial sample data is converted according to a preset data conversion rule to obtain a corresponding sample feature matrix and a sample detection result matrix. The user can input the initial sample data to the user terminal or the management server. The initial sample data can be the genetic data and test results of the sample. The genetic data can be all or part of the gene sequence contained in a pair of chromosomes. The test result is whether the patient is infected or not. The detection information of the disease, the technical scheme can screen out the gene points with strong correlation with the detection results from the genetic data through data correlation analysis. The initial sample data can be converted according to data conversion rules, wherein the data conversion rules include sample data mapping information and detection result mapping information.
在一实施例中,如图2所示,步骤S110包括子步骤S111和S112。In one embodiment, as shown in FIG. 2, step S110 includes sub-steps S111 and S112.
S111、根据所述样本数据映射信息对所述初始样本数据中每一样本的样本特征数据进行映射处理,得到对应的样本特征矩阵。S111. Perform a mapping process on sample feature data of each sample in the initial sample data according to the sample data mapping information to obtain a corresponding sample feature matrix.
具体的,可根据样本数据映射信息对初始样本数据中每一样本的样本特征数据进行映射处理,样本特征数据即为初始样本数据中每一样本的基因数据,可将多种类型的基因数据进行映射处理得到样本特征矩阵,则所得到的样本特征矩阵中包含与每一样本的基因数据对应的样本数据。具体的,一条染色体中每一基因点位对应包含两个碱基,则对于碱基对中的一个基因点位可包含多种基因型,如A=T-A=T、A=T-G≡C、G≡C-G≡C三种,其中A或G为等位基因,将其中出现次数少的碱基确定为次等位基因,例如G出现次数少于A,则将G称为次等位基因,样本数据映射信息中对应包括AA映射0,AG映射1,GG映射2的映射信息。Specifically, the sample characteristic data of each sample in the initial sample data can be mapped according to the sample data mapping information, and the sample characteristic data is the genetic data of each sample in the initial sample data, and various types of genetic data can be The sample feature matrix is obtained through the mapping process, and the obtained sample feature matrix includes sample data corresponding to the genetic data of each sample. Specifically, each gene point in a chromosome contains two bases, and a gene point in a base pair can contain multiple genotypes, such as A=T-A=T, A=T-G≡C, G ≡C-G≡C, where A or G is an allele, and the base that appears less frequently is determined as a secondary allele. For example, G appears less than A, and G is called a secondary allele. The sample The data mapping information correspondingly includes mapping information of AA mapping 0, AG mapping 1, and GG mapping 2.
例如,初始样本数据中包含1963个样本,每一样本的基因数据中包含317503个基因点位,则对应可以得到一个1963行、317503列的样本特征矩阵。For example, the initial sample data contains 1963 samples, and the genetic data of each sample contains 317503 gene points, so a sample feature matrix with 1963 rows and 317503 columns can be obtained correspondingly.
S112、根据所述检测结果映射信息对所述初始样本数据中每一样本的检测结果进行映射处理,得到对应的样本检测结果矩阵。S112. Perform a mapping process on the detection results of each sample in the initial sample data according to the detection result mapping information to obtain a corresponding sample detection result matrix.
可根据检测结果映射信息对初始样本数据中的检测结果进行映射处理,具体的,检测结果中可包含一种或多种疾病的检测结果。对于检测结果中只有一种疾病,则将“患病”的检测结果映射为“1”,“未患病”的检测结果映射为“0”;对于检测结果中包含多种疾病,则将同时患多种疾病的检测结果映射为“1”,其它检测结果映射为“0”。The detection results in the initial sample data may be mapped according to the detection result mapping information. Specifically, the detection results may include the detection results of one or more diseases. If there is only one disease in the detection result, the detection result of "disease" is mapped to "1", and the detection result of "not diseased" is mapped to "0"; The detection result of suffering from multiple diseases is mapped to "1", and the other detection results are mapped to "0".
例如,对初始样本数据中1963个样本的检测结果进行映射,得到一个1963行、1列的样本检测结果矩阵。For example, the detection results of 1963 samples in the initial sample data are mapped to obtain a sample detection result matrix with 1963 rows and 1 column.
S120、根据预置的样本特征分析规则及所述样本检测结果矩阵对所述样本特征矩阵中每一列样本数据进行特征分析得到与每一列所述样本数据对应的特征分布值。S120. Perform feature analysis on each column of sample data in the sample feature matrix according to preset sample feature analysis rules and the sample detection result matrix to obtain feature distribution values corresponding to each column of sample data.
根据预置的样本特征分析规则及所述样本检测结果矩阵对所述样本特征矩阵中每一列样本数据进行特征分析得到与每一列所述样本数据对应的特征分布值,每一列样本数据即为每一样本在同一基因点位所对应的特征数据。样本特征分析规则即为对样本特征矩阵进行分析的具体规则,可基于样本特征分析规则及样本检测结果矩阵对样本特征矩阵中每一列样本数据进行特征分析,得到每一列样本数据对应的特征分布值,特征分布值即为每一样本中每一基因点位的特征在特定分布态中的分布值。其中,所述样本特征分析规则包括隐变量计算公式及特征计算公式。Perform feature analysis on each column of sample data in the sample feature matrix according to the preset sample feature analysis rules and the sample detection result matrix to obtain a feature distribution value corresponding to each column of sample data, and each column of sample data is each The feature data corresponding to a sample at the same gene point. The sample feature analysis rules are the specific rules for analyzing the sample feature matrix. Based on the sample feature analysis rules and the sample detection result matrix, the feature analysis of each column of sample data in the sample feature matrix can be performed to obtain the feature distribution value corresponding to each column of sample data. , the feature distribution value is the distribution value of the feature of each gene point in each sample in a specific distribution state. Wherein, the sample feature analysis rules include latent variable calculation formulas and feature calculation formulas.
在一实施例中,如图3所示,步骤S120包括子步骤S121和S122。In one embodiment, as shown in FIG. 3 , step S120 includes sub-steps S121 and S122.
S121、根据所述隐变量计算公式对所述样本特征矩阵进行计算得到对应的隐变量矩阵。S121. Calculate the sample feature matrix according to the hidden variable calculation formula to obtain a corresponding hidden variable matrix.
首先可根据隐变量计算公式对样本特征矩阵进行计算以获取对应的隐变量矩阵,隐变量矩阵即包含每一列样本数据与对应检测结果之间所存在的隐藏相关性。First, the sample feature matrix can be calculated according to the hidden variable calculation formula to obtain the corresponding hidden variable matrix, which includes the hidden correlation between each column of sample data and the corresponding detection results.
具体的,可根据隐变量计算公式对样本特征矩阵进行矩阵分解,则样本特征矩阵X可采 用公式(1)进行表示:Specifically, the sample feature matrix can be decomposed according to the latent variable calculation formula, then the sample feature matrix X can be expressed by formula (1):
X=UDV T=U 1D 1V 1 T+U 2D 2V 2 T                   (1); X = UDV T = U 1 D 1 V 1 T + U 2 D 2 V 2 T (1);
其中,T为矩阵转置符号,其中,矩阵U及矩阵V的列是正交的,即V T V=I、U T U=I,矩阵I即为以1为对角的单位矩阵;U 1及U 2是U按列分块所得到的子矩阵,即U=(U 1,U 2),V 1及V 2是V按列分块所得到的子矩阵,即V=(V 1,V 2),D=diag(D 1,D 2)是对角矩阵,称为奇异值矩阵,数值从大到小排列,分解得到的矩阵U 1即可作为隐变量矩阵G。 Wherein, T is a matrix transposition symbol, wherein, the columns of matrix U and matrix V are orthogonal, that is, V T V = I, U T U = I, matrix I is the identity matrix with 1 as the diagonal; U 1 and U 2 are the sub-matrices obtained by U block by column, that is, U=(U 1 , U 2 ), V 1 and V 2 are the sub-matrix obtained by V by block by column, that is, V=(V 1 ,V 2 ), D=diag(D 1 ,D 2 ) is a diagonal matrix, called a singular value matrix, the values are arranged in descending order, and the decomposed matrix U 1 can be used as the latent variable matrix G.
S122、根据所述特征计算公式对所述隐变量矩阵、所述样本检测结果矩阵及所述样本特征矩阵中每一列样本数据进行计算,以得到与每一列所述样本数据对应的特征分布值。S122. Calculate each column of sample data in the hidden variable matrix, the sample detection result matrix, and the sample feature matrix according to the feature calculation formula, so as to obtain a feature distribution value corresponding to each column of the sample data.
获取到隐变量矩阵后,即可根据特征计算公式分别计算每一列样本数据的特征分布值。其中,特征计算公式包括自由度值计算式、分块矩阵公式及分布值计算公式。After obtaining the latent variable matrix, the characteristic distribution value of each column of sample data can be calculated separately according to the characteristic calculation formula. Among them, the feature calculation formula includes a degree of freedom value calculation formula, a block matrix formula and a distribution value calculation formula.
在一实施例中,如图4所示,步骤S122包括子步骤S1221、S1222和S1223。In one embodiment, as shown in FIG. 4 , step S122 includes sub-steps S1221 , S1222 and S1223 .
S1221、根据所述特征计算公式中的自由度值计算式对所述样本特征矩阵的行数及所述隐变量矩阵的列数进行计算得到对应的自由度值。S1221. Calculate the number of rows of the sample feature matrix and the number of columns of the hidden variable matrix according to the degree of freedom value calculation formula in the feature calculation formula to obtain a corresponding degree of freedom value.
首先根据自由度值计算式对样本特征矩阵的行数及隐变量矩阵的列数进行计算,得到对应的自由度值,则自由度值对于每一列样本数据均可通用。自由度值计算式可采用公式(2)进行表示:First, the number of rows of the sample feature matrix and the number of columns of the hidden variable matrix are calculated according to the calculation formula of the degree of freedom value, and the corresponding degree of freedom value is obtained, then the degree of freedom value is common to each column of sample data. The calculation formula of degrees of freedom can be expressed by formula (2):
Figure PCTCN2021124577-appb-000001
Figure PCTCN2021124577-appb-000001
其中,n为样本特征矩阵X的行数,d为隐变量矩阵G的列数。Among them, n is the number of rows of the sample feature matrix X, and d is the number of columns of the hidden variable matrix G.
S1222、根据所述特征计算公式中的分块矩阵公式对所述隐变量矩阵、所述样本检测结果矩阵及所述样本特征矩阵进行逆运算得到每一列所述样本数据对应的估计值。S1222. Perform an inverse operation on the hidden variable matrix, the sample detection result matrix, and the sample feature matrix according to the block matrix formula in the feature calculation formula to obtain an estimated value corresponding to each column of the sample data.
可根据分块矩阵公式对隐变量矩阵、样本检测结果矩阵及所述样本特征矩阵进行逆运算,得到与每一列样本数据对应的估计值。对于每一列样本数据,均存在以下计算关系:X i=YB i+GΓ i+E i,其中X i即为样本特征矩阵中第i列样本数据,Y为样本检测结果矩阵,B i为样本检测结果矩阵Y对应的系数,G为隐变量矩阵,Γ i为隐变量矩阵对应的系数,E i为残差,任意一列样本数据所对应的残差之间相互独立。 The hidden variable matrix, the sample detection result matrix and the sample feature matrix can be inversely operated according to the block matrix formula to obtain an estimated value corresponding to each column of sample data. For each column of sample data, there is the following calculation relationship: X i =YB i +GΓ i +E i , where X i is the sample data of column i in the sample feature matrix, Y is the sample detection result matrix, and B i is the sample The coefficient corresponding to the detection result matrix Y, G is the hidden variable matrix, Γ i is the coefficient corresponding to the hidden variable matrix, E i is the residual, and the residuals corresponding to any column of sample data are independent of each other.
B i所对应的估计值
Figure PCTCN2021124577-appb-000002
即为与每一列样本数据对应的估计值。
Figure PCTCN2021124577-appb-000003
可采用公式(3)进行表示:
Estimated value corresponding to B i
Figure PCTCN2021124577-appb-000002
That is, the estimated value corresponding to each column of sample data.
Figure PCTCN2021124577-appb-000003
Formula (3) can be used to express:
Figure PCTCN2021124577-appb-000004
Figure PCTCN2021124577-appb-000004
其中,T为矩阵转置符号。Among them, T is the matrix transpose symbol.
S1223、根据所述特征计算公式中的分布值计算公式对所述自由度值、每一列所述样本数据对应的估计值、所述隐变量矩阵、所述样本检测结果矩阵及所述样本特征矩阵中每一列样本数据进行计算,得到与每一列所述样本数据对应的特征分布值。S1223. According to the distribution value calculation formula in the feature calculation formula, calculate the degree of freedom value, the estimated value corresponding to each column of the sample data, the hidden variable matrix, the sample detection result matrix, and the sample feature matrix Calculate the sample data in each column to obtain the characteristic distribution value corresponding to the sample data in each column.
基于计算得到的自由度值、每一列样本数据的估计值以及隐变量矩阵、样本检测结果矩阵即可进一步计算得到每一列样本数据的特征分布值。具体的,可通过分布值计算公式计算得到对应的特征分布值,由于每一列样本数据中均包含每一样本在同一基因点位所对应的特征数据,则可计算得到与每一列样本数据对应的特征分布值中包含一个基因点位与所有样本分别对应的特征分布值,也即每一列样本数据的特征分布值所包含分布值的数量即与样本数量相等。Based on the calculated degrees of freedom, the estimated value of each column of sample data, the matrix of hidden variables, and the matrix of sample detection results, the characteristic distribution value of each column of sample data can be further calculated. Specifically, the corresponding characteristic distribution value can be calculated by the distribution value calculation formula. Since each column of sample data contains the characteristic data corresponding to each sample at the same gene point, the corresponding characteristic data of each column of sample data can be calculated. The characteristic distribution value contains a gene point and the characteristic distribution value corresponding to all samples, that is, the number of distribution values contained in the characteristic distribution value of each column of sample data is equal to the number of samples.
分布值计算公式可采用公式(4)进行表示:The distribution value calculation formula can be expressed by formula (4):
Figure PCTCN2021124577-appb-000005
Figure PCTCN2021124577-appb-000005
其中,z即为自由度值,t i即为计算得到特征分布值。 Among them, z is the degree of freedom value, and t i is the calculated characteristic distribution value.
S130、对所述特征分布值进行分布统计得到与每一列所述样本数据对应的复合检验值。S130. Perform distribution statistics on the characteristic distribution values to obtain a composite test value corresponding to each column of the sample data.
对所述特征分布值进行分布统计得到与每一列所述样本数据对应的复合检验值。可对特征分布值进行分布统计得到对应的复合检验值,则每一列样本数据均能够对应获取到一个复合检验值。Performing distribution statistics on the characteristic distribution values to obtain a composite test value corresponding to each column of the sample data. Distribution statistics can be performed on the characteristic distribution values to obtain the corresponding composite inspection value, and each column of sample data can correspondingly obtain a composite inspection value.
在一实施例中,如图5所示,步骤S130包括子步骤S131和S132。In one embodiment, as shown in FIG. 5 , step S130 includes sub-steps S131 and S132.
S131、对每一列所述样本数据对应的特征分布值进行极值分布统计,得到每一列所述样本数据的特征分布值统计信息。S131. Perform extreme value distribution statistics on the characteristic distribution values corresponding to the sample data in each column to obtain statistical information on the characteristic distribution values of the sample data in each column.
具体的,可对每一列样本数据的特征分布值进行极值分布统计,具体的,当样本量无穷大时任意一列样本数据的特征分布值t的分布统计近似为正态分布,使用极值分布定理可将绝对值最大的分布形态确定为与当前一列样本数据的特征分布值t对应的目标分布形态,并进一步获取目标分布形态的分布参数作为对应的特征分布值统计信息。Specifically, extreme value distribution statistics can be performed on the characteristic distribution value of each column of sample data. Specifically, when the sample size is infinite, the distribution statistics of the characteristic distribution value t of any column of sample data is approximately a normal distribution, using the extreme value distribution theorem The distribution form with the largest absolute value can be determined as the target distribution form corresponding to the characteristic distribution value t of the current column of sample data, and the distribution parameters of the target distribution form can be further obtained as the corresponding characteristic distribution value statistical information.
例如,正态分布可采用表达式(5)进行表示:For example, a normal distribution can be represented by expression (5):
R~N(μ,σ 2)                     (5); R ~ N (μ, σ 2 ) (5);
上述表达式中的μ及σ即为对应的分布参数。The μ and σ in the above expression are the corresponding distribution parameters.
S132、根据预置的检验值数据表获取与每一所述特征分布值统计信息的统计形态对应的复合检验值。S132. Obtain a composite inspection value corresponding to the statistical form of each characteristic distribution value statistical information according to the preset inspection value data table.
用户终端或管理服务器中还预先存储有检验值数据表,检验值数据表中包含与每一统计形态对应的检验值,获取到特征分布值统计信息后,即可根据该统计信息对应的统计心态,通过查表方式从检验值数据表获取对应的一个检验值作为复合检验值。The user terminal or management server also pre-stores a test value data table, which contains the test value corresponding to each statistical form. After obtaining the statistical information of the characteristic distribution value, it can be based on the statistical mentality corresponding to the statistical information. , obtain a corresponding test value from the test value data table as a composite test value by means of table lookup.
S140、根据所述复合检验值从所述样本特征矩阵中筛选出与预置的关联筛选系数对应的关联列信息。S140. Filter out the correlation column information corresponding to the preset correlation screening coefficients from the sample feature matrix according to the composite test value.
根据所述复合检验值从所述样本特征矩阵中筛选出与预置的关联筛选系数对应的关联列信息。获取到复合检验值后,即可根据复合检验值及关联筛选系数对样本特征矩阵进行筛选,以从中获取对应的关联列信息,关联列信息中可包含至少一个列编码值,则关联列信息中包含的列编码值即可用于指示基因序列中与所患疾病之间存在较强相关性的基因点位。The associated column information corresponding to the preset associated screening coefficient is screened out from the sample feature matrix according to the composite test value. After the compound test value is obtained, the sample feature matrix can be screened according to the compound test value and the correlation screening coefficient to obtain the corresponding correlation column information. The correlation column information can contain at least one column code value, and the correlation column information The included column encoding values can be used to indicate the gene points in the gene sequence that have a strong correlation with the disease.
在一实施例中,如图6所示,步骤S140之前还包括步骤S1401。In one embodiment, as shown in FIG. 6 , step S1401 is further included before step S140 .
S1401、根据预置的筛选系数计算公式对所述样本特征矩阵的列数进行计算,得到所述关联筛选系数。S1401. Calculate the number of columns of the sample feature matrix according to a preset calculation formula for screening coefficients to obtain the associated screening coefficients.
在根据关联筛选系数对样本特征矩阵进行筛选之前,还可根据筛选系数计算公式及样本特征矩阵的列数计算得到对应的关联筛选系数,具体的,筛选系数计算公式可采用公式(6)进行表示:Before screening the sample feature matrix according to the correlation screening coefficient, the corresponding correlation screening coefficient can also be calculated according to the calculation formula of the screening coefficient and the column number of the sample feature matrix. Specifically, the calculation formula of the screening coefficient can be expressed by formula (6) :
S=e/m                             (6);S=e/m (6);
其中,e为公式中预设参数值,m为样本特征矩阵的列数,S即为计算得到的关联筛选系数。例如,取e为0.05,m为317503,则对应计算得到S=1.57×10 -7Among them, e is the preset parameter value in the formula, m is the column number of the sample feature matrix, and S is the calculated correlation screening coefficient. For example, if e is set to 0.05 and m is set to 317503, the corresponding calculation results in S=1.57×10 -7 .
在一实施例中,如图7所示,步骤S140包括子步骤S141和S142。In one embodiment, as shown in FIG. 7 , step S140 includes sub-steps S141 and S142.
S141、对每一列所述样本数据的复合检验值是否小于所述关联筛选系数进行判断,以根据判断结果将小于所述关联筛选系数的复合检验值确定为目标检验值。S141. Judging whether the composite inspection value of the sample data in each column is smaller than the correlation screening coefficient, so as to determine the composite inspection value smaller than the correlation screening coefficient as the target inspection value according to the judgment result.
可判断每一列样本数据的复合检验值是否小于关联筛选系数,若小于,则表明该复合检验值所对应的基因点位为具有显著相关性的基因点位;若不小于,则表明该复合检验值所对应的基因点位不具有显著相关性。可根据判断结果获取小于关联筛选系数的复合检验值作为目标检验值。It can be judged whether the composite test value of each column of sample data is less than the correlation screening coefficient. If it is less than, it indicates that the gene point corresponding to the composite test value is a gene point with significant correlation; if it is not less than, it indicates that the composite test value The gene points corresponding to the values have no significant correlation. According to the judgment result, the compound test value smaller than the correlation screening coefficient can be obtained as the target test value.
S142、获取所述样本特征矩阵中与所述目标检验值对应的列编码值进行组合以作为与所 述关联筛选系数对应的关联列信息。S142. Obtain and combine column coded values corresponding to the target test value in the sample feature matrix as associated column information corresponding to the associated screening coefficient.
根据目标检验值对样本特征矩阵进行筛选,每一目标检验值在样本特征矩阵中均对应一列样本数据,则可从样本特征矩阵中获取与每一目标检验值对应的列编码值并进行组合,得到对应的关联列信息。关联列信息中列编码值所对应的基因点位即与所患疾病之间存在较强相关性。Filter the sample feature matrix according to the target test value, and each target test value corresponds to a column of sample data in the sample feature matrix, then the column code value corresponding to each target test value can be obtained from the sample feature matrix and combined, Get the corresponding associated column information. There is a strong correlation between the gene points corresponding to the column code values in the associated column information and the diseases.
在本申请实施例所提供的数据关联特征分析方法中,根据数据转换规则对初始样本数据进行转换处理得到样本特征矩阵和样本检测结果矩阵,根据样本特征分析规则及样本检测结果矩阵对样本特征矩阵中每一列样本数据进行特征分析得到对应的特征分布值,对每一列所述样本数据对应的特征分布值进行分布统计得到对应的复合检验值,根据复合检验值从样本特征矩阵中筛选出与关联筛选系数对应的关联列信息。通过上述方法,可根据样本特征分析规则获取特征分布值进行分布统计,根据分布统计得到的复合检验值从样本特征矩阵中筛选出关联列信息,可实现对海量数据信息进行快速分析,以获取到准确关联特征。In the data association feature analysis method provided in the embodiment of the present application, the initial sample data is converted according to the data conversion rules to obtain the sample feature matrix and the sample detection result matrix, and the sample feature matrix is processed according to the sample feature analysis rules and the sample detection result matrix Perform characteristic analysis on each column of sample data to obtain the corresponding characteristic distribution value, perform distribution statistics on the characteristic distribution value corresponding to each column of sample data to obtain the corresponding composite test value, and filter out the correlation value from the sample feature matrix according to the composite test value Correlation column information corresponding to the filter coefficient. Through the above method, the characteristic distribution value can be obtained according to the sample characteristic analysis rules for distribution statistics, and the associated column information can be screened out from the sample characteristic matrix according to the composite inspection value obtained by distribution statistics, which can realize rapid analysis of massive data information to obtain Accurately associate features.
本申请实施例还提供一种数据关联特征分析装置,该数据关联特征分析装置可配置于用户终端或管理服务器中,该数据关联特征分析装置用于执行前述的数据关联特征分析方法的任一实施例。具体地,请参阅图8,图8为本申请实施例提供的数据关联特征分析装置的示意性框图。The embodiment of the present application also provides a data association feature analysis device, which can be configured in a user terminal or a management server, and the data association feature analysis device is used to implement any implementation of the aforementioned data association feature analysis method example. Specifically, please refer to FIG. 8 , which is a schematic block diagram of an apparatus for analyzing data association features provided by an embodiment of the present application.
如图8所示,数据关联特征分析装置100包括数据转换单元110、特征分布值获取单元120、复合检验值获取单元130和关联列信息获取单元140。As shown in FIG. 8 , the data correlation feature analysis device 100 includes a data conversion unit 110 , a feature distribution value acquisition unit 120 , a composite test value acquisition unit 130 and an association column information acquisition unit 140 .
数据转换单元110,用于若接收到所输入的初始样本数据,根据预置的数据转换规则对所述初始样本数据进行转换得到对应的样本特征矩阵及样本检测结果矩阵。The data conversion unit 110 is configured to convert the initial sample data according to preset data conversion rules to obtain a corresponding sample feature matrix and sample detection result matrix if the input initial sample data is received.
在一具体实施例中,所述数据转换单元110包括子单元:样本特征矩阵获取单元,用于根据所述样本数据映射信息对所述初始样本数据中每一样本的样本特征数据进行映射处理,得到对应的样本特征矩阵;样本检测结果矩阵获取单元,用于根据所述检测结果映射信息对所述初始样本数据中每一样本的检测结果进行映射处理,得到对应的样本检测结果矩阵。In a specific embodiment, the data conversion unit 110 includes a subunit: a sample feature matrix acquisition unit, configured to map the sample feature data of each sample in the initial sample data according to the sample data mapping information, Obtaining a corresponding sample feature matrix; a sample detection result matrix acquisition unit configured to perform mapping processing on the detection result of each sample in the initial sample data according to the detection result mapping information to obtain a corresponding sample detection result matrix.
特征分布值获取单元120,用于根据预置的样本特征分析规则及所述样本检测结果矩阵对所述样本特征矩阵中每一列样本数据进行特征分析得到与每一列所述样本数据对应的特征分布值。The feature distribution value acquisition unit 120 is configured to perform feature analysis on each column of sample data in the sample feature matrix according to preset sample feature analysis rules and the sample detection result matrix to obtain a feature distribution corresponding to each column of sample data value.
在一具体实施例中,所述特征分布值获取单元120包括子单元:隐变量矩阵获取单元,用于根据所述隐变量计算公式对所述样本特征矩阵进行计算得到对应的隐变量矩阵;特征计算单元,用于根据所述特征计算公式对所述隐变量矩阵、所述样本检测结果矩阵及所述样本特征矩阵中每一列样本数据进行计算,以得到与每一列所述样本数据对应的特征分布值。In a specific embodiment, the characteristic distribution value acquisition unit 120 includes a subunit: a hidden variable matrix acquisition unit, which is used to calculate the sample feature matrix according to the hidden variable calculation formula to obtain a corresponding hidden variable matrix; A calculation unit, configured to calculate each column of sample data in the hidden variable matrix, the sample detection result matrix, and the sample feature matrix according to the feature calculation formula, so as to obtain the feature corresponding to each column of the sample data distribution value.
在一具体实施例中,所述特征计算单元包括子单元:自由度值计算单元,用于根据所述特征计算公式中的自由度值计算式对所述样本特征矩阵的行数及所述隐变量矩阵的列数进行计算得到对应的自由度值;估计值计算单元,用于根据所述特征计算公式中的分块矩阵公式对所述隐变量矩阵、所述样本检测结果矩阵及所述样本特征矩阵进行逆运算得到每一列所述样本数据对应的估计值;分布值计算单元,用于根据所述特征计算公式中的分布值计算公式对所述自由度值、每一列所述样本数据对应的估计值、所述隐变量矩阵、所述样本检测结果矩阵及所述样本特征矩阵中每一列样本数据进行计算,得到与每一列所述样本数据对应的特征分布值。In a specific embodiment, the feature calculation unit includes a subunit: a degree of freedom value calculation unit, which is used to calculate the number of rows of the sample feature matrix and the implicit The number of columns of the variable matrix is calculated to obtain the corresponding degree of freedom; the estimated value calculation unit is used to calculate the hidden variable matrix, the sample detection result matrix and the sample according to the block matrix formula in the feature calculation formula Performing an inverse operation on the characteristic matrix to obtain the estimated value corresponding to the sample data in each column; the distribution value calculation unit is used to correspond to the degree of freedom value and the sample data in each column according to the distribution value calculation formula in the feature calculation formula Calculate the estimated value of the hidden variable matrix, the sample detection result matrix and the sample data in each column of the sample feature matrix to obtain the characteristic distribution value corresponding to each column of the sample data.
复合检验值获取单元130,用于对所述特征分布值进行分布统计得到与每一列所述样本数据对应的复合检验值。The composite inspection value acquisition unit 130 is configured to perform distribution statistics on the characteristic distribution values to obtain a composite inspection value corresponding to each column of the sample data.
在一具体实施例中,所述复合检验值获取单元130包括子单元:特征分布值统计单元, 用于对每一列所述样本数据对应的特征分布值进行极值分布统计,得到每一列所述样本数据的特征分布值统计信息;检验值获取单元,用于根据预置的检验值数据表获取与每一所述特征分布值统计信息的统计形态对应的复合检验值。In a specific embodiment, the composite test value acquisition unit 130 includes a subunit: a characteristic distribution value statistics unit, which is used to perform extreme value distribution statistics on the characteristic distribution values corresponding to each column of sample data, and obtain the Statistical information of characteristic distribution values of the sample data; a test value acquisition unit, configured to obtain a composite test value corresponding to the statistical form of each feature distribution value statistical information according to a preset test value data table.
关联列信息获取单元140,用于根据所述复合检验值从所述样本特征矩阵中筛选出与预置的关联筛选系数对应的关联列信息。The association column information acquisition unit 140 is configured to filter out the association column information corresponding to the preset association screening coefficients from the sample feature matrix according to the composite test value.
在一具体实施例中,所述数据关联特征分析装置100还包括子单元:关联筛选系数计算单元,用于根据预置的筛选系数计算公式对所述样本特征矩阵的列数进行计算,得到所述关联筛选系数。In a specific embodiment, the data correlation feature analysis device 100 further includes a subunit: a correlation screening coefficient calculation unit, which is used to calculate the number of columns of the sample feature matrix according to a preset screening coefficient calculation formula, and obtain the The correlation screening coefficient mentioned above.
在一具体实施例中,所述关联列信息获取单元140包括子单元:目标检验值确定单元,用于对每一列所述样本数据的复合检验值是否小于所述关联筛选系数进行判断,以根据判断结果将小于所述关联筛选系数的复合检验值确定为目标检验值;列编码值组合单元,用于获取所述样本特征矩阵中与所述目标检验值对应的列编码值进行组合以作为与所述关联筛选系数对应的关联列信息。In a specific embodiment, the association column information acquisition unit 140 includes a subunit: a target inspection value determination unit, which is used to judge whether the composite inspection value of each column of the sample data is smaller than the association screening coefficient, according to As a result of the judgment, the composite test value smaller than the associated screening coefficient is determined as the target test value; the column code value combination unit is used to obtain the column code value corresponding to the target test value in the sample feature matrix and combine it as the The associated column information corresponding to the associated screening coefficient.
在本申请实施例所提供的数据关联特征分析装置应用上述数据关联特征分析方法,根据数据转换规则对初始样本数据进行转换处理得到样本特征矩阵和样本检测结果矩阵,根据样本特征分析规则及样本检测结果矩阵对样本特征矩阵中每一列样本数据进行特征分析得到对应的特征分布值,对每一列所述样本数据对应的特征分布值进行分布统计得到对应的复合检验值,根据复合检验值从样本特征矩阵中筛选出与关联筛选系数对应的关联列信息。通过上述方法,可根据样本特征分析规则获取特征分布值进行分布统计,根据分布统计得到的复合检验值从样本特征矩阵中筛选出关联列信息,可实现对海量数据信息进行快速分析,以获取到准确关联特征。The data association feature analysis device provided in the embodiment of the present application applies the above-mentioned data association feature analysis method, converts the initial sample data according to the data conversion rules to obtain a sample feature matrix and a sample detection result matrix, and according to the sample feature analysis rules and sample detection The result matrix analyzes the characteristics of each column of sample data in the sample feature matrix to obtain the corresponding characteristic distribution value, and performs distribution statistics on the characteristic distribution value corresponding to the sample data in each column to obtain the corresponding composite inspection value. According to the composite inspection value from the sample characteristics The associated column information corresponding to the associated screening coefficient is filtered out from the matrix. Through the above method, the characteristic distribution value can be obtained according to the sample characteristic analysis rules for distribution statistics, and the associated column information can be screened out from the sample characteristic matrix according to the composite inspection value obtained by distribution statistics, which can realize rapid analysis of massive data information to obtain Accurately associate features.
上述数据关联特征分析装置可以实现为计算机程序的形式,该计算机程序可以在如图9所示的计算机设备上运行。The above-mentioned device for analyzing data association features can be implemented in the form of a computer program, and the computer program can be run on a computer device as shown in FIG. 9 .
请参阅图9,图9是本申请实施例提供的计算机设备的示意性框图。该计算机设备可以是用于执行数据关联特征分析方法以对初始样本数据进行关联特征分析的用户终端或管理服务器。Please refer to FIG. 9 , which is a schematic block diagram of a computer device provided by an embodiment of the present application. The computer device may be a user terminal or a management server for performing the data correlation feature analysis method to perform correlation feature analysis on the initial sample data.
参阅图9,该计算机设备500包括通过系统总线501连接的处理器502、存储器和网络接口505,其中,存储器可以包括存储介质503和内存储器504。Referring to FIG. 9 , the computer device 500 includes a processor 502 connected through a system bus 501 , a memory and a network interface 505 , wherein the memory may include a storage medium 503 and an internal memory 504 .
该存储介质503可存储操作系统5031和计算机程序5032。该计算机程序5032被执行时,可使得处理器502执行数据关联特征分析方法,其中,存储介质503可以为易失性的存储介质或非易失性的存储介质。The storage medium 503 can store an operating system 5031 and a computer program 5032 . When the computer program 5032 is executed, the processor 502 may execute the data association feature analysis method, wherein the storage medium 503 may be a volatile storage medium or a non-volatile storage medium.
该处理器502用于提供计算和控制能力,支撑整个计算机设备500的运行。The processor 502 is used to provide calculation and control capabilities and support the operation of the entire computer device 500 .
该内存储器504为存储介质503中的计算机程序5032的运行提供环境,该计算机程序5032被处理器502执行时,可使得处理器502执行数据关联特征分析方法。The internal memory 504 provides an environment for the operation of the computer program 5032 in the storage medium 503. When the computer program 5032 is executed by the processor 502, the processor 502 can execute the data association feature analysis method.
该网络接口505用于进行网络通信,如提供数据信息的传输等。本领域技术人员可以理解,图9中示出的结构,仅仅是与本申请方案相关的部分结构的框图,并不构成对本申请方案所应用于其上的计算机设备500的限定,具体的计算机设备500可以包括比图中所示更多或更少的部件,或者组合某些部件,或者具有不同的部件布置。The network interface 505 is used for network communication, such as providing data transmission and the like. Those skilled in the art can understand that the structure shown in FIG. 9 is only a block diagram of a partial structure related to the solution of this application, and does not constitute a limitation on the computer device 500 on which the solution of this application is applied. The specific computer device 500 may include more or fewer components than shown, or combine certain components, or have a different arrangement of components.
其中,所述处理器502用于运行存储在存储器中的计算机程序5032,以实现上述的数据关联特征分析方法中对应的功能。Wherein, the processor 502 is configured to run the computer program 5032 stored in the memory, so as to realize the corresponding functions in the above-mentioned data association feature analysis method.
本领域技术人员可以理解,图9中示出的计算机设备的实施例并不构成对计算机设备具体构成的限定,在其他实施例中,计算机设备可以包括比图示更多或更少的部件,或者组合 某些部件,或者不同的部件布置。例如,在一些实施例中,计算机设备可以仅包括存储器及处理器,在这样的实施例中,存储器及处理器的结构及功能与图9所示实施例一致,在此不再赘述。Those skilled in the art can understand that the embodiment of the computer device shown in FIG. 9 does not constitute a limitation on the specific composition of the computer device. In other embodiments, the computer device may include more or less components than those shown in the illustration. Or combine certain components, or different component arrangements. For example, in some embodiments, the computer device may only include a memory and a processor. In such an embodiment, the structures and functions of the memory and the processor are consistent with those of the embodiment shown in FIG. 9 , and will not be repeated here.
应当理解,在本申请实施例中,处理器502可以是中央处理单元(Central Processing Unit,CPU),该处理器502还可以是其他通用处理器、数字信号处理器(Digital Signal Processor,DSP)、专用集成电路(Application Specific Integrated Circuit,ASIC)、现成可编程门阵列(Field-Programmable Gate Array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。其中,通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。It should be understood that in the embodiment of the present application, the processor 502 may be a central processing unit (Central Processing Unit, CPU), and the processor 502 may also be other general-purpose processors, digital signal processors (Digital Signal Processor, DSP), Application Specific Integrated Circuit (ASIC), off-the-shelf programmable gate array (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. Wherein, the general-purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
在本申请的另一实施例中提供计算机可读存储介质。该计算机可读存储介质可以为易失性或非易失性的计算机可读存储介质。该计算机可读存储介质存储有计算机程序,其中计算机程序被处理器执行时实现上述的数据关联特征分析方法中所包含的步骤。In another embodiment of the present application a computer readable storage medium is provided. The computer-readable storage medium may be a volatile or non-volatile computer-readable storage medium. The computer-readable storage medium stores a computer program, wherein when the computer program is executed by a processor, the steps included in the above-mentioned data association feature analysis method are implemented.
所属领域的技术人员可以清楚地了解到,为了描述的方便和简洁,上述描述的设备、装置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。本领域普通技术人员可以意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、计算机软件或者二者的结合来实现,为了清楚地说明硬件和软件的可互换性,在上述说明中已经按照功能一般性地描述了各示例的组成及步骤。这些功能究竟以硬件还是软件方式来执行取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。Those skilled in the art can clearly understand that, for the convenience and brevity of description, the specific working process of the above-described equipment, devices and units can refer to the corresponding process in the foregoing method embodiments, and details are not repeated here. Those of ordinary skill in the art can realize that the units and algorithm steps of the examples described in conjunction with the embodiments disclosed herein can be implemented by electronic hardware, computer software, or a combination of the two. In order to clearly illustrate the relationship between hardware and software Interchangeability. In the above description, the composition and steps of each example have been generally described according to their functions. Whether these functions are implemented by hardware or software depends on the specific application and design constraints of the technical solution. Those skilled in the art may use different methods to implement the described functions for each specific application, but such implementation should not be regarded as exceeding the scope of the present application.
在本申请所提供的几个实施例中,应该理解到,所揭露的设备、装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述单元的划分,仅仅为逻辑功能划分,实际实现时可以有另外的划分方式,也可以将具有相同功能的单元集合成一个单元,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另外,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口、装置或单元的间接耦合或通信连接,也可以是电的,机械的或其它的形式连接。In the several embodiments provided in this application, it should be understood that the disclosed devices, devices and methods can be implemented in other ways. For example, the device embodiments described above are only illustrative. For example, the division of the units is only logical function division. In actual implementation, there may be other division methods, and units with the same function may also be combined into one Units such as multiple units or components may be combined or integrated into another system, or some features may be omitted, or not implemented. In addition, the mutual coupling or direct coupling or communication connection shown or discussed may be indirect coupling or communication connection through some interfaces, devices or units, and may also be electrical, mechanical or other forms of connection.
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本申请实施例方案的目的。The units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, they may be located in one place, or may be distributed to multiple network units. Part or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment of the present application.
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以是两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。In addition, each functional unit in each embodiment of the present application may be integrated into one processing unit, each unit may exist separately physically, or two or more units may be integrated into one unit. The above-mentioned integrated units can be implemented in the form of hardware or in the form of software functional units.
所述集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分,或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个计算机可读存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施例所述方法的全部或部分步骤。而前述的计算机可读存储介质包括:U盘、移动硬盘、只读存储器(ROM,Read-Only Memory)、磁碟或者光盘等各种可以存储程序代码的介质。If the integrated unit is realized in the form of a software function unit and sold or used as an independent product, it can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present application is essentially or the part that contributes to the prior art, or all or part of the technical solution can be embodied in the form of software products, and the computer software products are stored in a computer. The readable storage medium includes several instructions to enable a computer device (which may be a personal computer, server, or network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of the present application. The aforementioned computer-readable storage medium includes: U disk, mobile hard disk, read-only memory (ROM, Read-Only Memory), magnetic disk or optical disk and other media that can store program codes.
以上所述,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到各种等效的修改或替换,这些修改或替换都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以权利要求的保护范围为准。The above is only a specific embodiment of the application, but the scope of protection of the application is not limited thereto. Any person familiar with the technical field can easily think of various equivalents within the scope of the technology disclosed in the application. Modifications or replacements, these modifications or replacements shall be covered within the scope of protection of this application. Therefore, the protection scope of the present application should be based on the protection scope of the claims.

Claims (20)

  1. 一种数据关联特征分析方法,所述方法包括:A data association feature analysis method, the method comprising:
    若接收到所输入的初始样本数据,根据预置的数据转换规则对所述初始样本数据进行转换得到对应的样本特征矩阵及样本检测结果矩阵;If the input initial sample data is received, convert the initial sample data according to a preset data conversion rule to obtain a corresponding sample feature matrix and a sample detection result matrix;
    根据预置的样本特征分析规则及所述样本检测结果矩阵对所述样本特征矩阵中每一列样本数据进行特征分析得到与每一列所述样本数据对应的特征分布值;performing feature analysis on each column of sample data in the sample feature matrix according to preset sample feature analysis rules and the sample detection result matrix to obtain feature distribution values corresponding to each column of sample data;
    对所述特征分布值进行分布统计得到与每一列所述样本数据对应的复合检验值;Performing distribution statistics on the characteristic distribution value to obtain a composite test value corresponding to each column of the sample data;
    根据所述复合检验值从所述样本特征矩阵中筛选出与预置的关联筛选系数对应的关联列信息。The associated column information corresponding to the preset associated screening coefficient is screened out from the sample feature matrix according to the composite test value.
  2. 根据权利要求1所述的数据关联特征分析方法,其中,所述数据转换规则包括样本数据映射信息及检测结果映射信息,所述根据预置的数据转换规则对所述初始样本数据进行转换得到对应的样本特征矩阵及样本检测结果矩阵,包括:The data association feature analysis method according to claim 1, wherein the data conversion rules include sample data mapping information and detection result mapping information, and the initial sample data is converted according to the preset data conversion rules to obtain corresponding The sample feature matrix and sample test result matrix of , including:
    根据所述样本数据映射信息对所述初始样本数据中每一样本的样本特征数据进行映射处理,得到对应的样本特征矩阵;Mapping the sample feature data of each sample in the initial sample data according to the sample data mapping information to obtain a corresponding sample feature matrix;
    根据所述检测结果映射信息对所述初始样本数据中每一样本的检测结果进行映射处理,得到对应的样本检测结果矩阵。The detection result of each sample in the initial sample data is mapped according to the detection result mapping information to obtain a corresponding sample detection result matrix.
  3. 根据权利要求1所述的数据关联特征分析方法,其中,所述样本特征分析规则包括隐变量计算公式及特征计算公式,所述根据预置的样本特征分析规则及所述样本检测结果矩阵对所述样本特征矩阵中每一列样本数据进行特征分析得到与每一列所述样本数据对应的特征分布值,包括:The data correlation feature analysis method according to claim 1, wherein the sample feature analysis rules include latent variable calculation formulas and feature calculation formulas, and the preset sample feature analysis rules and the sample detection result matrix are used for all Perform feature analysis on each column of sample data in the sample feature matrix to obtain the characteristic distribution value corresponding to each column of sample data, including:
    根据所述隐变量计算公式对所述样本特征矩阵进行计算得到对应的隐变量矩阵;calculating the sample feature matrix according to the hidden variable calculation formula to obtain a corresponding hidden variable matrix;
    根据所述特征计算公式对所述隐变量矩阵、所述样本检测结果矩阵及所述样本特征矩阵中每一列样本数据进行计算,以得到与每一列所述样本数据对应的特征分布值。Calculate each column of sample data in the hidden variable matrix, the sample detection result matrix, and the sample feature matrix according to the feature calculation formula to obtain a feature distribution value corresponding to each column of the sample data.
  4. 根据权利要求3所述的数据关联特征分析方法,其中,所述根据所述特征计算公式对所述隐变量矩阵、所述样本检测结果矩阵及所述样本特征矩阵中每一列样本数据进行计算,以得到与每一列所述样本数据对应的特征分布值,包括:The data association feature analysis method according to claim 3, wherein said calculation of each column of sample data in the hidden variable matrix, the sample detection result matrix, and the sample feature matrix according to the feature calculation formula, To obtain the characteristic distribution values corresponding to the sample data in each column, including:
    根据所述特征计算公式中的自由度值计算式对所述样本特征矩阵的行数及所述隐变量矩阵的列数进行计算得到对应的自由度值;Calculate the number of rows of the sample feature matrix and the number of columns of the hidden variable matrix according to the degree of freedom value calculation formula in the feature calculation formula to obtain the corresponding degree of freedom value;
    根据所述特征计算公式中的分块矩阵公式对所述隐变量矩阵、所述样本检测结果矩阵及所述样本特征矩阵进行逆运算得到每一列所述样本数据对应的估计值;performing an inverse operation on the hidden variable matrix, the sample detection result matrix, and the sample feature matrix according to the block matrix formula in the feature calculation formula to obtain an estimated value corresponding to each column of the sample data;
    根据所述特征计算公式中的分布值计算公式对所述自由度值、每一列所述样本数据对应的估计值、所述隐变量矩阵、所述样本检测结果矩阵及所述样本特征矩阵中每一列样本数据进行计算,得到与每一列所述样本数据对应的特征分布值。According to the distribution value calculation formula in the feature calculation formula, each of the degree of freedom value, the estimated value corresponding to each column of the sample data, the hidden variable matrix, the sample detection result matrix and the sample feature matrix A column of sample data is calculated to obtain the characteristic distribution value corresponding to each column of sample data.
  5. 根据权利要求1所述的数据关联特征分析方法,其中,所述对所述特征分布值进行分布统计得到与每一列所述样本数据对应的复合检验值,包括:The data association feature analysis method according to claim 1, wherein said performing distribution statistics on said feature distribution value to obtain a composite test value corresponding to each column of said sample data comprises:
    对每一列所述样本数据对应的特征分布值进行极值分布统计,得到每一列所述样本数据的特征分布值统计信息;Perform extreme value distribution statistics on the characteristic distribution values corresponding to the sample data in each column to obtain statistical information on the characteristic distribution values of the sample data in each column;
    根据预置的检验值数据表获取与每一所述特征分布值统计信息的统计形态对应的复合检验值。According to the preset inspection value data table, the composite inspection value corresponding to the statistical form of each characteristic distribution value statistical information is obtained.
  6. 根据权利要求1所述的数据关联特征分析方法,其中,所述根据所述复合检验值从所述样本特征矩阵中筛选出与预置的关联筛选系数对应的关联列信息之前,包括:The data correlation feature analysis method according to claim 1, wherein, before filtering out the correlation column information corresponding to the preset correlation screening coefficient from the sample feature matrix according to the composite inspection value, it includes:
    根据预置的筛选系数计算公式对所述样本特征矩阵的列数进行计算,得到所述关联筛选系数。The number of columns of the sample feature matrix is calculated according to a preset calculation formula of the screening coefficient to obtain the correlation screening coefficient.
  7. 根据权利要求1所述的数据关联特征分析方法,其中,所述根据所述复合检验值从所述样本特征矩阵中筛选出与预置的关联筛选系数对应的关联列信息,包括:The data association characteristic analysis method according to claim 1, wherein said association column information corresponding to a preset association screening coefficient is screened out from said sample characteristic matrix according to said composite inspection value, comprising:
    对每一列所述样本数据的复合检验值是否小于所述关联筛选系数进行判断,以根据判断结果将小于所述关联筛选系数的复合检验值确定为目标检验值;Judging whether the composite inspection value of the sample data in each column is smaller than the correlation screening coefficient, so as to determine the composite inspection value smaller than the correlation screening coefficient as the target inspection value according to the judgment result;
    获取所述样本特征矩阵中与所述目标检验值对应的列编码值进行组合以作为与所述关联筛选系数对应的关联列信息。Obtaining and combining column coded values corresponding to the target test values in the sample feature matrix as associated column information corresponding to the associated screening coefficients.
  8. 一种数据关联特征分析装置,所述装置包括:A data association feature analysis device, said device comprising:
    数据转换单元,用于若接收到所输入的初始样本数据,根据预置的数据转换规则对所述初始样本数据进行转换得到对应的样本特征矩阵及样本检测结果矩阵;A data conversion unit, configured to convert the initial sample data according to a preset data conversion rule to obtain a corresponding sample feature matrix and a sample detection result matrix if the input initial sample data is received;
    特征分布值获取单元,用于根据预置的样本特征分析规则及所述样本检测结果矩阵对所述样本特征矩阵中每一列样本数据进行特征分析得到与每一列所述样本数据对应的特征分布值;A feature distribution value acquisition unit, configured to perform feature analysis on each column of sample data in the sample feature matrix according to preset sample feature analysis rules and the sample detection result matrix to obtain feature distribution values corresponding to each column of sample data ;
    复合检验值获取单元,用于对所述特征分布值进行分布统计得到与每一列所述样本数据对应的复合检验值;A composite inspection value acquisition unit, configured to perform distribution statistics on the characteristic distribution value to obtain a composite inspection value corresponding to the sample data in each column;
    关联列信息获取单元,用于根据所述复合检验值从所述样本特征矩阵中筛选出与预置的关联筛选系数对应的关联列信息。The association column information acquisition unit is configured to filter out the association column information corresponding to the preset association screening coefficients from the sample feature matrix according to the composite test value.
  9. 一种计算机设备,包括存储器、处理器及存储在所述存储器上并可在所述处理器上运行的计算机程序,所述处理器执行所述计算机程序时实现以下步骤:A computer device, comprising a memory, a processor, and a computer program stored on the memory and operable on the processor, and the processor implements the following steps when executing the computer program:
    若接收到所输入的初始样本数据,根据预置的数据转换规则对所述初始样本数据进行转换得到对应的样本特征矩阵及样本检测结果矩阵;If the input initial sample data is received, convert the initial sample data according to a preset data conversion rule to obtain a corresponding sample feature matrix and a sample detection result matrix;
    根据预置的样本特征分析规则及所述样本检测结果矩阵对所述样本特征矩阵中每一列样本数据进行特征分析得到与每一列所述样本数据对应的特征分布值;performing feature analysis on each column of sample data in the sample feature matrix according to preset sample feature analysis rules and the sample detection result matrix to obtain feature distribution values corresponding to each column of sample data;
    对所述特征分布值进行分布统计得到与每一列所述样本数据对应的复合检验值;Performing distribution statistics on the characteristic distribution value to obtain a composite test value corresponding to each column of the sample data;
    根据所述复合检验值从所述样本特征矩阵中筛选出与预置的关联筛选系数对应的关联列信息。The associated column information corresponding to the preset associated screening coefficient is screened out from the sample feature matrix according to the composite test value.
  10. 根据权利要求9所述的计算机设备,其中,所述数据转换规则包括样本数据映射信息及检测结果映射信息,所述根据预置的数据转换规则对所述初始样本数据进行转换得到对应的样本特征矩阵及样本检测结果矩阵,包括:The computer device according to claim 9, wherein the data conversion rules include sample data mapping information and detection result mapping information, and the initial sample data is converted according to the preset data conversion rules to obtain corresponding sample features Matrix and sample test result matrix, including:
    根据所述样本数据映射信息对所述初始样本数据中每一样本的样本特征数据进行映射处理,得到对应的样本特征矩阵;Mapping the sample feature data of each sample in the initial sample data according to the sample data mapping information to obtain a corresponding sample feature matrix;
    根据所述检测结果映射信息对所述初始样本数据中每一样本的检测结果进行映射处理,得到对应的样本检测结果矩阵。The detection result of each sample in the initial sample data is mapped according to the detection result mapping information to obtain a corresponding sample detection result matrix.
  11. 根据权利要求9所述的计算机设备,其中,所述样本特征分析规则包括隐变量计算公式及特征计算公式,所述根据预置的样本特征分析规则及所述样本检测结果矩阵对所述样本特征矩阵中每一列样本数据进行特征分析得到与每一列所述样本数据对应的特征分布值,包括:The computer device according to claim 9, wherein the sample feature analysis rules include latent variable calculation formulas and feature calculation formulas, and the sample feature analysis rules are calculated according to the preset sample feature analysis rules and the sample detection result matrix Perform characteristic analysis on each column of sample data in the matrix to obtain the characteristic distribution value corresponding to each column of sample data, including:
    根据所述隐变量计算公式对所述样本特征矩阵进行计算得到对应的隐变量矩阵;calculating the sample feature matrix according to the hidden variable calculation formula to obtain a corresponding hidden variable matrix;
    根据所述特征计算公式对所述隐变量矩阵、所述样本检测结果矩阵及所述样本特征矩阵中每一列样本数据进行计算,以得到与每一列所述样本数据对应的特征分布值。Calculate each column of sample data in the hidden variable matrix, the sample detection result matrix, and the sample feature matrix according to the feature calculation formula to obtain a feature distribution value corresponding to each column of the sample data.
  12. 根据权利要求11所述的计算机设备,其中,所述根据所述特征计算公式对所述隐变量矩阵、所述样本检测结果矩阵及所述样本特征矩阵中每一列样本数据进行计算,以得到与每一列所述样本数据对应的特征分布值,包括:The computer device according to claim 11, wherein, according to the feature calculation formula, the hidden variable matrix, the sample detection result matrix and each column of sample data in the sample feature matrix are calculated to obtain the same The characteristic distribution values corresponding to the sample data in each column, including:
    根据所述特征计算公式中的自由度值计算式对所述样本特征矩阵的行数及所述隐变量矩阵的列数进行计算得到对应的自由度值;Calculate the number of rows of the sample feature matrix and the number of columns of the hidden variable matrix according to the degree of freedom value calculation formula in the feature calculation formula to obtain the corresponding degree of freedom value;
    根据所述特征计算公式中的分块矩阵公式对所述隐变量矩阵、所述样本检测结果矩阵及所述样本特征矩阵进行逆运算得到每一列所述样本数据对应的估计值;performing an inverse operation on the hidden variable matrix, the sample detection result matrix, and the sample feature matrix according to the block matrix formula in the feature calculation formula to obtain an estimated value corresponding to each column of the sample data;
    根据所述特征计算公式中的分布值计算公式对所述自由度值、每一列所述样本数据对应的估计值、所述隐变量矩阵、所述样本检测结果矩阵及所述样本特征矩阵中每一列样本数据进行计算,得到与每一列所述样本数据对应的特征分布值。According to the distribution value calculation formula in the feature calculation formula, each of the degree of freedom value, the estimated value corresponding to each column of the sample data, the hidden variable matrix, the sample detection result matrix and the sample feature matrix A column of sample data is calculated to obtain the characteristic distribution value corresponding to each column of sample data.
  13. 根据权利要求9所述的计算机设备,其中,所述对所述特征分布值进行分布统计得到与每一列所述样本数据对应的复合检验值,包括:The computer device according to claim 9, wherein performing distribution statistics on the characteristic distribution value to obtain a composite inspection value corresponding to each column of the sample data includes:
    对每一列所述样本数据对应的特征分布值进行极值分布统计,得到每一列所述样本数据的特征分布值统计信息;Perform extreme value distribution statistics on the characteristic distribution values corresponding to the sample data in each column to obtain statistical information on the characteristic distribution values of the sample data in each column;
    根据预置的检验值数据表获取与每一所述特征分布值统计信息的统计形态对应的复合检验值。According to the preset inspection value data table, the composite inspection value corresponding to the statistical form of each characteristic distribution value statistical information is obtained.
  14. 根据权利要求9所述的计算机设备,其中,所述根据所述复合检验值从所述样本特征矩阵中筛选出与预置的关联筛选系数对应的关联列信息之前,包括:The computer device according to claim 9, wherein, before filtering out the association column information corresponding to the preset association screening coefficient from the sample feature matrix according to the composite inspection value, it includes:
    根据预置的筛选系数计算公式对所述样本特征矩阵的列数进行计算,得到所述关联筛选系数。The number of columns of the sample feature matrix is calculated according to a preset calculation formula of the screening coefficient to obtain the correlation screening coefficient.
  15. 根据权利要求9所述的计算机设备,其中,所述根据所述复合检验值从所述样本特征矩阵中筛选出与预置的关联筛选系数对应的关联列信息,包括:The computer device according to claim 9, wherein said filtering out the correlation column information corresponding to the preset correlation screening coefficient from the sample feature matrix according to the composite test value comprises:
    对每一列所述样本数据的复合检验值是否小于所述关联筛选系数进行判断,以根据判断结果将小于所述关联筛选系数的复合检验值确定为目标检验值;Judging whether the composite inspection value of the sample data in each column is smaller than the correlation screening coefficient, so as to determine the composite inspection value smaller than the correlation screening coefficient as the target inspection value according to the judgment result;
    获取所述样本特征矩阵中与所述目标检验值对应的列编码值进行组合以作为与所述关联筛选系数对应的关联列信息。Obtaining and combining column coded values corresponding to the target test values in the sample feature matrix as associated column information corresponding to the associated screening coefficients.
  16. 一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,当所述计算机程序被处理器执行时执行以下操作:A computer-readable storage medium, the computer-readable storage medium stores a computer program, and when the computer program is executed by a processor, the following operations are performed:
    若接收到所输入的初始样本数据,根据预置的数据转换规则对所述初始样本数据进行转换得到对应的样本特征矩阵及样本检测结果矩阵;If the input initial sample data is received, convert the initial sample data according to a preset data conversion rule to obtain a corresponding sample feature matrix and a sample detection result matrix;
    根据预置的样本特征分析规则及所述样本检测结果矩阵对所述样本特征矩阵中每一列样本数据进行特征分析得到与每一列所述样本数据对应的特征分布值;performing feature analysis on each column of sample data in the sample feature matrix according to preset sample feature analysis rules and the sample detection result matrix to obtain feature distribution values corresponding to each column of sample data;
    对所述特征分布值进行分布统计得到与每一列所述样本数据对应的复合检验值;Performing distribution statistics on the characteristic distribution value to obtain a composite test value corresponding to each column of the sample data;
    根据所述复合检验值从所述样本特征矩阵中筛选出与预置的关联筛选系数对应的关联列信息。The associated column information corresponding to the preset associated screening coefficient is screened out from the sample feature matrix according to the composite test value.
  17. 根据权利要求16所述的计算机可读存储介质,其中,所述数据转换规则包括样本数据映射信息及检测结果映射信息,所述根据预置的数据转换规则对所述初始样本数据进行转换得到对应的样本特征矩阵及样本检测结果矩阵,包括:The computer-readable storage medium according to claim 16, wherein the data conversion rules include sample data mapping information and detection result mapping information, and the initial sample data is converted according to the preset data conversion rules to obtain corresponding The sample feature matrix and sample test result matrix of , including:
    根据所述样本数据映射信息对所述初始样本数据中每一样本的样本特征数据进行映射处理,得到对应的样本特征矩阵;Mapping the sample feature data of each sample in the initial sample data according to the sample data mapping information to obtain a corresponding sample feature matrix;
    根据所述检测结果映射信息对所述初始样本数据中每一样本的检测结果进行映射处理,得到对应的样本检测结果矩阵。The detection result of each sample in the initial sample data is mapped according to the detection result mapping information to obtain a corresponding sample detection result matrix.
  18. 根据权利要求16所述的计算机可读存储介质,其中,所述样本特征分析规则包括隐变量计算公式及特征计算公式,所述根据预置的样本特征分析规则及所述样本检测结果矩阵对所述样本特征矩阵中每一列样本数据进行特征分析得到与每一列所述样本数据对应的特征 分布值,包括:The computer-readable storage medium according to claim 16, wherein the sample feature analysis rules include hidden variable calculation formulas and feature calculation formulas, and the preset sample feature analysis rules and the sample detection result matrix are used for all Perform feature analysis on each column of sample data in the sample feature matrix to obtain the characteristic distribution value corresponding to each column of sample data, including:
    根据所述隐变量计算公式对所述样本特征矩阵进行计算得到对应的隐变量矩阵;calculating the sample feature matrix according to the hidden variable calculation formula to obtain a corresponding hidden variable matrix;
    根据所述特征计算公式对所述隐变量矩阵、所述样本检测结果矩阵及所述样本特征矩阵中每一列样本数据进行计算,以得到与每一列所述样本数据对应的特征分布值。Calculate each column of sample data in the hidden variable matrix, the sample detection result matrix, and the sample feature matrix according to the feature calculation formula to obtain a feature distribution value corresponding to each column of the sample data.
  19. 根据权利要求18所述的计算机可读存储介质,其中,所述根据所述特征计算公式对所述隐变量矩阵、所述样本检测结果矩阵及所述样本特征矩阵中每一列样本数据进行计算,以得到与每一列所述样本数据对应的特征分布值,包括:The computer-readable storage medium according to claim 18, wherein the calculation of each column of sample data in the hidden variable matrix, the sample detection result matrix, and the sample feature matrix is performed according to the feature calculation formula, To obtain the characteristic distribution values corresponding to the sample data in each column, including:
    根据所述特征计算公式中的自由度值计算式对所述样本特征矩阵的行数及所述隐变量矩阵的列数进行计算得到对应的自由度值;Calculate the number of rows of the sample feature matrix and the number of columns of the hidden variable matrix according to the degree of freedom value calculation formula in the feature calculation formula to obtain the corresponding degree of freedom value;
    根据所述特征计算公式中的分块矩阵公式对所述隐变量矩阵、所述样本检测结果矩阵及所述样本特征矩阵进行逆运算得到每一列所述样本数据对应的估计值;performing an inverse operation on the hidden variable matrix, the sample detection result matrix, and the sample feature matrix according to the block matrix formula in the feature calculation formula to obtain an estimated value corresponding to each column of the sample data;
    根据所述特征计算公式中的分布值计算公式对所述自由度值、每一列所述样本数据对应的估计值、所述隐变量矩阵、所述样本检测结果矩阵及所述样本特征矩阵中每一列样本数据进行计算,得到与每一列所述样本数据对应的特征分布值。According to the distribution value calculation formula in the feature calculation formula, each of the degree of freedom value, the estimated value corresponding to each column of the sample data, the hidden variable matrix, the sample detection result matrix and the sample feature matrix A column of sample data is calculated to obtain the characteristic distribution value corresponding to each column of sample data.
  20. 根据权利要求16所述的计算机可读存储介质,其中,所述对所述特征分布值进行分布统计得到与每一列所述样本数据对应的复合检验值,包括:The computer-readable storage medium according to claim 16, wherein performing distribution statistics on the characteristic distribution value to obtain a composite inspection value corresponding to each column of the sample data includes:
    对每一列所述样本数据对应的特征分布值进行极值分布统计,得到每一列所述样本数据的特征分布值统计信息;Perform extreme value distribution statistics on the characteristic distribution values corresponding to the sample data in each column to obtain statistical information on the characteristic distribution values of the sample data in each column;
    根据预置的检验值数据表获取与每一所述特征分布值统计信息的统计形态对应的复合检验值。According to the preset inspection value data table, the composite inspection value corresponding to the statistical form of each characteristic distribution value statistical information is obtained.
PCT/CN2021/124577 2021-09-30 2021-10-19 Data association feature analysis method and apparatus, and device and medium WO2023050490A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202111164594.6 2021-09-30
CN202111164594.6A CN113609204B (en) 2021-09-30 2021-09-30 Data association characteristic analysis method, device, equipment and medium

Publications (1)

Publication Number Publication Date
WO2023050490A1 true WO2023050490A1 (en) 2023-04-06

Family

ID=78343317

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/124577 WO2023050490A1 (en) 2021-09-30 2021-10-19 Data association feature analysis method and apparatus, and device and medium

Country Status (2)

Country Link
CN (1) CN113609204B (en)
WO (1) WO2023050490A1 (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120215458A1 (en) * 2009-07-14 2012-08-23 Board Of Regents, The University Of Texas System Orthologous Phenotypes and Non-Obvious Human Disease Models
CN108567418A (en) * 2018-05-17 2018-09-25 陕西师范大学 A kind of pulse signal inferior health detection method and detecting system based on PCANet
CN110674104A (en) * 2019-08-15 2020-01-10 中国平安人寿保险股份有限公司 Feature combination screening method and device, computer equipment and storage medium
CN113035275A (en) * 2021-04-22 2021-06-25 广东技术师范大学 Feature extraction method for tumor gene point mutation by combining contour coefficient and RJMCMC algorithm

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP2963421A1 (en) * 2014-07-01 2016-01-06 SeNostic GmbH Process for diagnosis of neurodegenerative diseases
CN106354794A (en) * 2016-08-26 2017-01-25 成都汉康信息产业有限公司 Data analysis and processing system
CN111383717A (en) * 2018-12-29 2020-07-07 北京安诺优达医学检验实验室有限公司 Method and system for constructing biological information analysis reference data set

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120215458A1 (en) * 2009-07-14 2012-08-23 Board Of Regents, The University Of Texas System Orthologous Phenotypes and Non-Obvious Human Disease Models
CN108567418A (en) * 2018-05-17 2018-09-25 陕西师范大学 A kind of pulse signal inferior health detection method and detecting system based on PCANet
CN110674104A (en) * 2019-08-15 2020-01-10 中国平安人寿保险股份有限公司 Feature combination screening method and device, computer equipment and storage medium
CN113035275A (en) * 2021-04-22 2021-06-25 广东技术师范大学 Feature extraction method for tumor gene point mutation by combining contour coefficient and RJMCMC algorithm

Also Published As

Publication number Publication date
CN113609204B (en) 2021-12-24
CN113609204A (en) 2021-11-05

Similar Documents

Publication Publication Date Title
McManus et al. Population genetic analysis of the DARC locus (Duffy) reveals adaptation from standing variation associated with malaria resistance in humans
Niroula et al. PON-P2: prediction method for fast and reliable identification of harmful variants
Sun et al. Differential expression analysis for RNAseq using Poisson mixed models
Murray et al. kWIP: The k-mer weighted inner product, a de novo estimator of genetic similarity
Verbist et al. VirVarSeq: a low-frequency virus variant detection pipeline for Illumina sequencing using adaptive base-calling accuracy filtering
Cai et al. DeepSV: accurate calling of genomic deletions from high-throughput sequencing data using deep convolutional neural network
WO2020113673A1 (en) Cancer subtype classification method employing multiomics integration
CN112117011A (en) Infectious disease early risk early warning method and device based on artificial intelligence
CN111883223B (en) Report interpretation method and system for structural variation in patient sample data
Muñoz et al. Genome update of the dimorphic human pathogenic fungi causing paracoccidioidomycosis
Karamichalis et al. An investigation into inter-and intragenomic variations of graphic genomic signatures
Cheung et al. Prediction of biogeographical ancestry from genotype: a comparison of classifiers
WO2021098615A1 (en) Filling method and device for genotype data missing, and server
CN115204183A (en) Knowledge enhancement based dual-channel emotion analysis method, device and equipment
Makowski et al. Mutational analysis of SARS-CoV-2 variants of concern reveals key tradeoffs between receptor affinity and antibody escape
Liao et al. ROC curve analysis in the presence of imperfect reference standards
WO2019242445A1 (en) Detection method, device, computer equipment and storage medium of pathogen operation group
WO2023050490A1 (en) Data association feature analysis method and apparatus, and device and medium
Sengupta et al. Performance and accuracy evaluation of reference panels for genotype imputation in sub-Saharan African populations
CN113590603A (en) Data processing method, device, equipment and medium based on intelligent selection of data source
He et al. Factorial estimating assembly base errors using k-mer abundance difference (KAD) between short reads and genome assembled sequences
CN116525108A (en) SNP data-based prediction method, device, equipment and storage medium
WO2022258077A2 (en) Remote sensing image feature discretization method and apparatus based on ii-type fuzzy rough model, storage medium, and computer device.
CN113611358A (en) Sample pathogenic bacteria typing method and system
CN114004802A (en) Data labeling method and device based on fuzzy comprehensive evaluation method and storage medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21959054

Country of ref document: EP

Kind code of ref document: A1