CN115148284A - Pre-processing method and system of gene data - Google Patents

Pre-processing method and system of gene data Download PDF

Info

Publication number
CN115148284A
CN115148284A CN202210734462.0A CN202210734462A CN115148284A CN 115148284 A CN115148284 A CN 115148284A CN 202210734462 A CN202210734462 A CN 202210734462A CN 115148284 A CN115148284 A CN 115148284A
Authority
CN
China
Prior art keywords
data
gene
incidence relation
attribute data
preset
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202210734462.0A
Other languages
Chinese (zh)
Other versions
CN115148284B (en
Inventor
石传煜
刘晓明
王冠
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Manzhiyan Bio Technology Co ltd
Original Assignee
Manzhiyan Bio Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Manzhiyan Bio Technology Co ltd filed Critical Manzhiyan Bio Technology Co ltd
Priority to CN202210734462.0A priority Critical patent/CN115148284B/en
Publication of CN115148284A publication Critical patent/CN115148284A/en
Application granted granted Critical
Publication of CN115148284B publication Critical patent/CN115148284B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/40Population genetics; Linkage disequilibrium
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B50/00ICT programming tools or database systems specially adapted for bioinformatics

Landscapes

  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Engineering & Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Medical Informatics (AREA)
  • Theoretical Computer Science (AREA)
  • Biophysics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Biotechnology (AREA)
  • Databases & Information Systems (AREA)
  • Bioethics (AREA)
  • Genetics & Genomics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Software Systems (AREA)
  • Public Health (AREA)
  • Evolutionary Computation (AREA)
  • Epidemiology (AREA)
  • Data Mining & Analysis (AREA)
  • Ecology (AREA)
  • Artificial Intelligence (AREA)
  • Physiology (AREA)
  • Chemical & Material Sciences (AREA)
  • Analytical Chemistry (AREA)
  • Molecular Biology (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The invention provides a method and a system for preprocessing gene data; wherein the method comprises the following steps: determining attribute data of the gene detection item; determining associated gene data corresponding to the gene detection items according to the attribute data and a preset association relation; and compressing the original gene detection data according to the associated gene data to obtain target gene detection data. According to the invention, the associated gene data required by the corresponding gene detection item is accurately positioned through the attribute data of the gene detection item and the preset association relation, so that the compression of the original gene detection data is realized, the unnecessary encryption, transmission, analysis and the like of the gene data can be reduced, and the transmission cost and the analysis processing cost of the gene data can be effectively reduced.

Description

Pre-processing method and system of gene data
Technical Field
The invention relates to the technical field of gene detection, in particular to a gene data preprocessing method, a gene data preprocessing system, electronic equipment and a computer storage medium.
Background
The gene is a genetic basic unit, carries DNA or RNA sequence of genetic information, transmits the genetic information to the next generation through replication, and guides the synthesis of protein to express the genetic information carried by the gene, thereby controlling the character expression of organism individuals. The gene detection is a technology for detecting DNA through blood, other body fluids or cells, which is to take peripheral venous blood or other tissue cells of a detected person, amplify the gene information of the peripheral venous blood or other tissue cells, detect DNA molecular information in the cells of the detected person through a specific device, the method for analyzing the type and defect of the gene contained in the gene and the normal expression function of the gene enables people to understand the gene information of the people, and to determine the cause of the disease or predict the risk of the body to suffer from a certain disease. Genetic testing can diagnose disease and can also be used for prediction of disease risk. Disease diagnosis is the detection of a mutated gene causing a genetic disease using genetic detection techniques. The most widely used gene tests are the detection of hereditary diseases in newborns, the diagnosis of hereditary diseases and the auxiliary diagnosis of some common diseases.
Gene testing is typically performed by distributed gene sampling, centralized gene data analysis, e.g., a user collecting a gene sample and sending it to a regional site, the regional site performs detection analysis on the gene sample, and transmits the detected gene data to a detection analysis center through a communication network to perform targeted detection analysis. However, the data size of the original gene data sample obtained by the detection is very large, and many unnecessary gene data for the detection project are included, so that the subsequent gene data analysis is difficult, and the data transmission cost is high, and improvement is needed.
Disclosure of Invention
In order to solve at least the technical problems in the background art, the invention provides a method, a system, an electronic device and a computer storage medium for preprocessing gene data.
The first aspect of the present invention provides a method for preprocessing gene data, comprising the steps of:
determining attribute data of the gene detection item;
determining associated gene data corresponding to the gene detection items according to the attribute data and a preset association relation;
and compressing the original gene detection data according to the associated gene data to obtain target gene detection data.
Further, the determining of the attribute data of the genetic testing items includes:
receiving a gene detection instruction input by a field end or a background end, and determining attribute data of a gene detection item according to the gene detection instruction;
or,
and acquiring the attribute data of other site ends in the same geographical range, and determining the attribute data of the gene detection item according to all the attribute data.
Further, the determining associated gene data corresponding to the gene detection item according to the attribute data and a preset association relationship includes:
matching calculation is carried out on the attribute data and the preset incidence relation to obtain a plurality of first incidence relations;
combining the first incidence relations to obtain a second incidence relation;
and determining related gene data corresponding to the gene detection items according to the second association relation.
Further, the performing matching calculation on the attribute data and the preset association relationship to obtain a plurality of first association relationships includes:
calculating the matching degree of the attribute data and the preset incidence relation by adopting the following formula:
Figure BDA0003714722710000021
in the formula s i Representing the matching degree of the attribute data and the ith preset incidence relation in the database; x is a radical of a fluorine atom j A jth sub-matrix representing a first gene data matrix corresponding to the attribute data,
Figure BDA0003714722710000022
representing a first base corresponding to the attribute dataA jth sub-matrix of the data matrix after the data matrix is fused with a second gene data matrix corresponding to the preset incidence relation, wherein N represents the number of the sub-matrices; sigma 2 Representing the variance of the fused data matrix;
and taking the preset incidence relation with the matching degree larger than or equal to a first threshold value as the first incidence relation.
Further, the preset association relationship is obtained by:
determining an initial association relationship;
acquiring big data for incidence relation analysis, inputting the big data for incidence relation analysis into a deep analysis model, and outputting incidence relation for correction by the deep analysis model;
and judging whether the attribute data meet a first preset condition, if so, determining the preset incidence relation according to the initial incidence relation and the incidence relation for correction.
Further, if the attribute data satisfies a second preset condition, then:
calling the incidence relation analysis big data corresponding to the correction incidence relation;
performing heat analysis on the big data for incidence relation analysis according to a time dimension and a quantity dimension to determine a heat value;
and judging whether the heat value is greater than or equal to a third threshold value, if so, determining the preset incidence relation according to the initial incidence relation and the incidence relation for correction.
Further, the performing heat analysis on the big data for association analysis according to the time dimension and the quantity dimension to determine a heat value includes:
if the ratio of the number dimension to the time dimension is greater than or equal to a fourth threshold, the calorific value is positively correlated with the ratio based on a first ratio;
if the ratio of the number dimension to the time dimension is less than a fourth threshold and greater than or equal to a fifth threshold, then the heat value is positively correlated with the ratio based on a second ratio;
if the ratio of the number dimension to the time dimension is less than a fifth threshold, then the heat value is positively correlated with the ratio based on a third ratio;
wherein the first proportion, the second proportion and the third proportion are decreased in turn.
The invention provides a gene data preprocessing system, which comprises an acquisition module, a processing module and a storage module; the processing module is connected with the acquisition module and the storage module;
the storage module is used for storing executable computer program codes;
the acquisition module is used for acquiring original gene detection data, attribute data of gene detection items and a preset incidence relation and transmitting the data to the processing module;
the processing module is configured to execute the method according to any one of the preceding claims by calling the executable computer program code in the storage module.
A third aspect of the present invention provides an electronic device comprising: a memory storing executable program code; a processor coupled with the memory; the processor calls the executable program code stored in the memory to perform the method of any of the preceding claims.
A fourth aspect of the invention provides a computer storage medium having stored thereon a computer program which, when executed by a processor, performs a method as set forth in any one of the preceding claims.
According to the scheme, the associated gene data required by the corresponding gene detection item is accurately positioned through the attribute data of the gene detection item and the preset association relation, so that the compression of the original gene detection data is realized, unnecessary encryption, transmission, analysis and the like of the gene data can be reduced, and the transmission cost and the analysis processing cost of the gene data can be effectively reduced.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present invention and therefore should not be considered as limiting the scope, and those skilled in the art can also obtain other related drawings based on the drawings without inventive efforts.
FIG. 1 is a schematic flow chart of a method for preprocessing gene data according to an embodiment of the present invention;
FIG. 2 is a schematic structural diagram of a gene data preprocessing system according to an embodiment of the present invention;
fig. 3 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention clearer, the present invention will be described in further detail with reference to the accompanying drawings, and it is apparent that the described embodiments are only a part of the embodiments of the present invention, not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The terminology used in the embodiments of the present application is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in the examples of this application and the appended claims, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise, and "a plurality" typically includes at least two.
It should be understood that the term "and/or" as used herein is merely one type of association that describes an associated object, meaning that three relationships may exist, e.g., a and/or B may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the character "/" herein generally indicates that the former and latter related objects are in an "or" relationship.
It should be understood that, although the terms first, second, third, etc. may be used in the embodiments of the present application to describe \8230; \8230, these \8230; should not be limited to these terms. These terms are used only to distinguish between 8230; and vice versa. For example, without departing from the scope of embodiments of the present application, a first of the methods may be used as 8230, a second of the methods may be used as 8230a first of the methods may be used as 8230a second of the methods may be used as 8230a third of the methods.
The words "if", as used herein may be interpreted as "at \8230; \8230whenor" when 8230; \8230when or "in response to a determination" or "in response to a detection", depending on the context. Similarly, the phrases "if determined" or "if detected (a stated condition or event)" may be interpreted as "when determined" or "in response to a determination" or "when detected (a stated condition or event)" or "in response to a detection (a stated condition or event)", depending on the context.
It is also noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a good or system that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such good or system. Without further limitation, an element defined by the phrases "comprising one of \8230;" does not exclude the presence of additional like elements in an article or system comprising the element.
Preferred embodiments of the present invention will be described in detail below with reference to the accompanying drawings.
Example one
Referring to fig. 1, fig. 1 is a schematic flow chart of a method for preprocessing gene data according to an embodiment of the present invention. As shown in fig. 1, a method for preprocessing gene data according to an embodiment of the present invention includes the following steps:
determining attribute data of the gene detection item;
determining associated gene data corresponding to the gene detection items according to the attribute data and a preset association relation;
and compressing the original gene detection data according to the associated gene data to obtain target gene detection data.
In the embodiment of the invention, the gene detection in the prior art comprises sample collection, gene data extraction and gene analysis, wherein the sample collection and the gene data extraction can be separately implemented or can be implemented in a centralized manner, and specifically, the distributed detection equipment can independently complete the sample collection and the gene data extraction and then transmit the extracted gene data to a detection and analysis center of a background; or the distributed detection equipment only finishes sample collection, then transports the sample to a regional site (for example, a client collects the sample by itself and then mails the sample, and a professional collects the sample and then transports the sample to the regional site in a unified way) to extract gene data, and then transmits the extracted gene data to a background detection and analysis center. However, in any way, after the genetic testing data is obtained, the complete genetic testing data is directly transmitted to the testing and analyzing center, but the data volume of the original genetic data sample is very large, which contains many genetic data unnecessary for the testing project, resulting in that the subsequent genetic data analysis is difficult, and the data transmission cost is also very high, and the disadvantage is particularly obvious when the data volume of the genetic testing data to be transmitted is large.
In view of the above, the invention designs the gene data preprocessing method, which accurately locates the associated gene data required by the corresponding gene detection item through the attribute data of the gene detection item and the preset association relationship, thereby realizing the compression of the original gene detection data, reducing the unnecessary encryption, transmission, analysis and the like of the gene data, and effectively reducing the transmission cost of the gene data and the analysis processing cost of the detection and analysis center.
It should be noted that, the field end and the detection and analysis center communicate with each other through the network to realize the transmission of the gene detection data. By way of example, and not limitation, the network may comprise: ad hoc networks (ad hoc networks), intranets, extranets, virtual Private Networks (VPNs), local Area Networks (LANs), wireless LANs (WLANs), wide Area Networks (WANs), wireless WANs (wws), WWAN), metropolitan Area Networks (MANs), portions of the internet, portions of Public Switched Telephone Networks (PSTNs), mobile Telephone networks, ISDN (integrated service digital networks, integrated services digital networks), wireless LANs, LTE (long term evolution), CDMA (code division multiple access), bluetooth (code division multiple access), satellite communications, and the like, or a combination of two or more of these.
And before transmitting the target gene detection data to the detection analysis center, encrypting the target gene detection data, wherein the encryption algorithm used for the encryption processing may be an MD5 algorithm, an SHA1 algorithm, an HMAC algorithm, an AES/DES/3DES algorithm, an RSA algorithm, an ECC algorithm, and the like, which is not limited in the present invention.
Further, the determining of the attribute data of the genetic testing items includes:
receiving a gene detection instruction input by a field end or a background end, and determining attribute data of a gene detection item according to the gene detection instruction;
or,
and acquiring the attribute data of other site ends in the same geographical range, and determining the attribute data of the gene detection item according to all the attribute data.
In the embodiment of the present invention, the field side refers to a detection device at a genetic testing field or a regional site, wherein the detection device at the genetic testing field may be a portable genetic testing cassette (a customer may perform genetic sample collection by himself), or a genetic sampling detection device operated by a professional. The gene sampling detection personnel can manually input the gene detection item information or the information can be sent by a background end (such as a server or a detection and analysis center), so that the field end can know the gene detection item, and the automatic gene data extraction of the sample to be analyzed is triggered. In addition, there is a centralized processing situation in gene detection, at this time, the field end to be used (or just started, or new equipment) can acquire the attribute data received by other field ends that have already been used to analyze its gene detection item, and then determine the corresponding attribute data (for example, if all other field ends are the same attribute data, it can be determined that the centralized detection and detection item are the same, and correspondingly, the attribute data are also the same), and can also automatically prompt the attribute data to the operator for confirmation, so that the input load of the operator can be reduced by this method.
Further, the determining associated gene data corresponding to the gene detection item according to the attribute data and a preset association relationship includes:
matching calculation is carried out on the attribute data and the preset incidence relation to obtain a plurality of first incidence relations;
combining a plurality of first incidence relations to obtain a second incidence relation;
and determining related gene data corresponding to the gene detection items according to the second association relation.
In the embodiment of the invention, the database of the incidence relation is established in advance, and the incidence relation corresponding to various gene detection items is stored in the database. Therefore, a plurality of first association relations can be obtained by calculating the matching degree of the attribute data of a certain gene detection item and each preset association relation in the database, and then a second association relation can be obtained by combining the first association relations, wherein at the moment, the associated gene data corresponding to the second association relation is actually larger than the gene data required by a certain gene detection item, but the exceeding range is proper. Therefore, the invention realizes the accurate acquisition of the associated gene data through the matching calculation and the merging processing, and simultaneously ensures that the size of the data volume is in an acceptable range.
It should be noted that, the attribute data and the preset incidence relation are both configured with data matrixes correspondingly, and the data matrixes are generated based on the related gene sequence data, so that the matching calculation of the attribute data and the preset incidence relation is to calculate the similarity of the two data matrixes in terms of factors such as gene sequence type, gene sequence data amount and the like.
Further, the performing matching calculation on the attribute data and the preset association relationship to obtain a plurality of first association relationships includes:
calculating the matching degree of the attribute data and the preset incidence relation by adopting the following formula:
Figure BDA0003714722710000091
in the formula, s i Representing the matching degree of the attribute data and the ith preset incidence relation in the database; x is the number of j A jth sub-matrix representing a first gene data matrix corresponding to said attribute data,
Figure BDA0003714722710000092
representing the jth sub-matrix of the data matrix after the fusion of the first gene data matrix corresponding to the attribute data and the second gene data matrix corresponding to the preset incidence relation, wherein N represents the number of the sub-matrices; sigma 2 Representing the variance of the fused data matrix;
and taking the preset incidence relation with the matching degree larger than or equal to a first threshold value as the first incidence relation.
In the embodiment of the present invention, the attribute data and the data matrix of each preset incidence relation stored in the database are fused, for example, the data at the corresponding positions in the data matrix may be respectively averaged, and then the difference degree between the data matrix corresponding to the attribute data and the fused data matrix is calculated according to the above formula, so as to determine the matching degree between the two, and thus the incidence relation with the matching degree greater than or equal to the first threshold value may be determined as the first incidence relation.
In this case, the following further development is proposed:
the taking the preset incidence relation with the matching degree larger than or equal to a first threshold as the first incidence relation comprises:
performing intersection calculation on each preset incidence relation with the matching degree larger than or equal to a first threshold value to obtain the number of incidence relations in the intersection;
determining a second threshold value according to the number, and taking the first association relation of which the matching degree is greater than or equal to the second threshold value as the final first association relation;
wherein the second threshold is positively correlated with the quantity.
In the improved scheme, the preliminarily determined first association relation is further screened for the second time by using a second threshold value. Specifically, when the number of the association relations among the preliminarily determined intersections of the first association relations is smaller, the degree of dispersion of the first association relations is higher, and at this time, the single matching degree of the attribute data and the first association relations is not high enough and is difficult to accurately determine, so that the secondary screening is suitably performed by adopting a smaller second threshold value, or the secondary screening is not performed any more, so as to avoid omission of the association relations, namely, the associated gene data; on the contrary, the concentration of the first incidence relations is high, the single matching degree of the attribute data and the first incidence relations is high, and the secondary screening can be performed by adopting a larger second threshold value, so that the accuracy of determining the incidence relations is improved, and useless gene data is reduced.
Further, the preset association relationship is obtained by:
determining an initial association relationship;
acquiring big data for incidence relation analysis, inputting the big data for incidence relation analysis into a deep analysis model, and outputting incidence relation for correction by the deep analysis model;
and judging whether the attribute data meet a first preset condition, if so, determining the preset incidence relation according to the initial incidence relation and the incidence relation for correction.
In the embodiment of the present invention, the present invention utilizes a deep analysis model to analyze the change of the correlation between the genetic testing items and the genetic data, which reflects the change of the genetic data required by a specific genetic testing item, for example, the initial correlation represents the correlation between the mature and standard genetic testing items and the required genetic data, and the correlation for correction represents the correlation between the genetic testing items and the required genetic data which are at the front edge, new or not generally adopted. For the change of the incidence relation, the invention determines whether to determine the preset incidence relation according to the initial incidence relation and the incidence relation for correction at the same time by analyzing whether the attribute data meets the first preset condition.
Examples are as follows:
the attribute data may include information such as "for detection" and "for scientific research", and correspondingly, the first preset condition may be "for scientific research". Therefore, if the "scientific research information" is extracted from the attribute data, it is determined that the associated gene data corresponding to the specific gene detection project should be obtained based on the latest association relationship, and the initial association relationship and the correction association relationship may be merged (for example, a union set is obtained), so that the obtained associated gene data covers a larger variety of gene sequences/fragments, which is beneficial for scientific researchers to smoothly integrate the corresponding gene sequences/fragments for scientific research. If the information for detection is extracted from the attribute data, the specific gene detection item is judged to adopt the operation specification of the most mature standard, so that the initial association relationship can be directly used as the preset association relationship at the moment, the normalization and the reliability of the gene detection result are improved, and the data volume can be reduced.
It should be noted that, the big data for association analysis may crawl internet data related to a specific gene detection item, such as academic journals, news reports, public articles, and the like, according to a preset period, and extract a new association, i.e., a revised association, which is not included in the initial association, from the big data, for example, a scientific journal report of a certain specialty may find that analyzing a certain gene sequence may improve the reliability of the specific gene detection item. These new associations represent the technological front of gene testing, and it is beneficial to provide relevant gene data to researchers, but obviously not to clinical testers.
The deep analysis model can be constructed according to a neural network algorithm, and the construction and training modes of the model are not described herein.
Further, if the attribute data satisfies a second preset condition, then:
calling the incidence relation analysis big data corresponding to the correction incidence relation;
performing heat analysis on the big data for incidence relation analysis according to a time dimension and a quantity dimension to determine a heat value;
and judging whether the heat value is greater than or equal to a third threshold value, if so, determining the preset incidence relation according to the initial incidence relation and the incidence relation for correction.
In the embodiment of the present invention, the attribute data of the specific gene detection item may not include information such as "for detection" and "for scientific research" (that is, the second preset condition is satisfied) due to various reasons, and in this case, in order to ensure "sufficiency" of the associated gene data, the present invention defaults that the attribute of the specific gene detection item is "for scientific research". At this time, analyzing the time dimension and the quantity dimension of the big data for analyzing the incidence relation of the new incidence relation for correction so as to determine the heat value of the new incidence relation, for example, when the analysis finds that the new incidence relation suddenly increases in recent heat or is always in higher heat in a larger period, the incidence relation is indicated to have higher potential scientific research value and higher possibility of being used by scientific researchers, and at this time, the initial incidence relation and the incidence relation for correction can be combined to obtain a preset incidence relation; otherwise, it means that the new association relationship is not recognized by other researchers in the field, the scientific research value is low, the possibility of being used by the scientific research personnel is low, and the initial association relationship can still be used at this time.
It should be noted that, no matter whether the attribute data satisfies the first preset condition or the second preset condition, in order to ensure the accuracy of determining the associated gene data, the determined associated gene data may be sent to an operator at the field end or a person related to the detection and analysis center in a preview manner, and the foregoing compression processing may be performed on the original gene detection data after receiving the confirmation feedback.
Further, the performing heat analysis on the big data for association analysis according to the time dimension and the quantity dimension to determine a heat value includes:
if the ratio of the number dimension to the time dimension is greater than or equal to a fourth threshold, then the heat value and the ratio are positively correlated based on a first ratio;
if the ratio of the number dimension to the time dimension is less than a fourth threshold and greater than or equal to a fifth threshold, then the heat value is positively correlated with the ratio based on a second ratio;
if the ratio of the number dimension to the time dimension is less than a fifth threshold, then the heat value and the ratio are positively correlated based on a third ratio;
wherein the first proportion, the second proportion and the third proportion are decreased in turn.
In the embodiment of the present invention, when analyzing the heat of a new correlation for correction, three cases can be distinguished, that is: 1) A centralized explosion mode; 2) Slowly heating; 3) And the three conditions correspond to the three heat value determination modes. Wherein, for the first case, it is described that a new correlation relationship for correction is approved or a great progress is made, so that the related research enters a hot state, at this time, the probability of using the related gene data corresponding to the new correlation relationship for correction by the researcher corresponding to the gene detection project is high, and a larger first proportion is preferably adopted to determine the heat value; for the second case, it is indicated that a new correlation for correction is in a state of being gradually approved or progressing, and at this time, the probability of using the correlation gene data corresponding to the new correlation for correction by the researcher corresponding to the gene inspection item is high, and it is preferable to determine the heat value by using a second ratio of an intermediate value; for the third case, it is stated that a new correlation relationship for correction is not widely recognized and is always in a stable state, and at this time, the probability of using the associated gene data corresponding to the new correlation relationship for correction by the scientific research personnel corresponding to the gene detection project is the lowest, and a larger third ratio is preferably adopted to determine the heat value.
It should be noted that, in the processing manners corresponding to the above three cases, corresponding preconditions may also be set, for example, the processing manner is executed only after the number dimension reaches a certain value, otherwise, the data amount may be determined to be insufficient, and it is difficult to ensure the accuracy of the analysis result.
Example two
Referring to fig. 2, fig. 2 is a schematic structural diagram of a pre-processing system for gene data according to an embodiment of the present invention. As shown in fig. 2, the pre-processing system for gene data according to the embodiment of the present invention includes an obtaining module (101), a processing module (102), and a storage module (103); the processing module (102) is connected with the acquisition module (101) and the storage module (103);
the storage module (103) for storing executable computer program code;
the acquisition module (101) is used for acquiring original gene detection data, attribute data of gene detection items and a preset incidence relation and transmitting the data to the processing module (102);
the processing module (102) is configured to execute the method according to any of the preceding claims by calling the executable computer program code in the storage module (103).
For the specific functions of the pre-processing system for gene data in this embodiment, reference is made to the first embodiment, and since the system in this embodiment adopts all the technical solutions of the first embodiment, at least all the beneficial effects brought by the technical solutions of the first embodiment are achieved, and no further description is given here.
EXAMPLE III
Referring to fig. 3, fig. 3 is an electronic device according to an embodiment of the present invention, including: a memory storing executable program code; a processor coupled with the memory; the processor calls the executable program code stored in the memory to execute the method according to the first embodiment.
Example four
The embodiment of the invention also discloses a computer storage medium, wherein a computer program is stored on the storage medium, and the computer program executes the method in the first embodiment when being executed by a processor.
Computer storage media for embodiments of the invention may employ any combination of one or more computer-readable media. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, or device.
A computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, or device.
Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
Computer program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).
It should be noted that the foregoing is only a preferred embodiment of the present invention and the technical principles employed. It will be understood by those skilled in the art that the present invention is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the invention. Therefore, although the present invention has been described in more detail by the above embodiments, the present invention is not limited to the above embodiments, and may include other equivalent embodiments without departing from the spirit of the present invention.

Claims (10)

1. A method for preprocessing gene data is characterized by comprising the following steps:
determining attribute data of the gene detection item;
determining associated gene data corresponding to the gene detection items according to the attribute data and a preset association relation;
and compressing the original gene detection data according to the associated gene data to obtain target gene detection data.
2. The method for preprocessing gene data according to claim 1, wherein: the determining of the attribute data of the genetic testing items comprises the following steps:
receiving a gene detection instruction input by a field end or a background end, and determining attribute data of a gene detection item according to the gene detection instruction;
or,
and acquiring the attribute data of other site ends in the same geographical range, and determining the attribute data of the gene detection item according to all the attribute data.
3. The method for preprocessing gene data according to claim 1 or 2, wherein: determining associated gene data corresponding to the gene detection item according to the attribute data and a preset association relation, wherein the determining comprises the following steps:
matching calculation is carried out on the attribute data and the preset incidence relation to obtain a plurality of first incidence relations;
combining the first incidence relations to obtain a second incidence relation;
and determining associated gene data corresponding to the gene detection item according to the second association relation.
4. The method for preprocessing gene data according to claim 3, wherein: the matching calculation of the attribute data and the preset incidence relation is performed to obtain a plurality of first incidence relations, and the method comprises the following steps:
calculating the matching degree of the attribute data and the preset incidence relation by adopting the following formula:
Figure FDA0003714722700000021
in the formula s i Representing the attribute data and the first in the database matching degrees of the i preset incidence relations; x is a radical of a fluorine atom j A jth sub-matrix representing a first gene data matrix corresponding to the attribute data,
Figure FDA0003714722700000022
representing the jth sub-matrix of the data matrix after the fusion of the first gene data matrix corresponding to the attribute data and the second gene data matrix corresponding to the preset incidence relation, wherein N represents the number of the sub-matrices; sigma 2 Representing the variance of the fused data matrix;
and taking the preset incidence relation with the matching degree larger than or equal to a first threshold value as the first incidence relation.
5. The method for preprocessing gene data according to claim 1, 2 or 4, wherein: the preset association relationship is obtained by the following method:
determining an initial association relationship;
acquiring big data for incidence relation analysis, inputting the big data for incidence relation analysis into a deep analysis model, and outputting incidence relation for correction by the deep analysis model;
and judging whether the attribute data meet a first preset condition, if so, determining the preset incidence relation according to the initial incidence relation and the incidence relation for correction.
6. The method for preprocessing gene data according to claim 5, wherein: if the attribute data meet a second preset condition, then:
calling the incidence relation analysis big data corresponding to the correction incidence relation;
performing heat analysis on the big data for incidence relation analysis according to the time dimension and the quantity dimension to determine a heat value;
and judging whether the heat value is greater than or equal to a third threshold value, if so, determining the preset incidence relation according to the initial incidence relation and the incidence relation for correction.
7. The method for preprocessing gene data according to claim 6, wherein: the heat degree analysis of the big data for incidence relation analysis according to the time dimension and the quantity dimension to determine the heat degree value comprises the following steps:
if the ratio of the number dimension to the time dimension is greater than or equal to a fourth threshold, then the heat value and the ratio are positively correlated based on a first ratio;
if the ratio of the number dimension to the time dimension is less than a fourth threshold and greater than or equal to a fifth threshold, then the heat value is positively correlated with the ratio based on a second ratio;
if the ratio of the number dimension to the time dimension is less than a fifth threshold, then the heat value is positively correlated with the ratio based on a third ratio;
wherein the first proportion, the second proportion and the third proportion are decreased in turn.
8. A gene data preprocessing system comprises an acquisition module, a processing module and a storage module; the processing module is connected with the acquisition module and the storage module;
the storage module is used for storing executable computer program codes;
the acquisition module is used for acquiring original gene detection data, attribute data of gene detection items and a preset incidence relation and transmitting the data to the processing module;
the method is characterized in that: the processing module for performing the method of any one of claims 1-7 by invoking the executable computer program code in the storage module.
9. An electronic device, comprising: a memory storing executable program code; a processor coupled with the memory; the method is characterized in that: the processor calls the executable program code stored in the memory to perform the method of any of claims 1-7.
10. A computer storage medium having a computer program stored thereon, characterized in that: the computer program, when executed by a processor, performs the method of any one of claims 1-7.
CN202210734462.0A 2022-06-27 2022-06-27 Pre-processing method and system of gene data Active CN115148284B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210734462.0A CN115148284B (en) 2022-06-27 2022-06-27 Pre-processing method and system of gene data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210734462.0A CN115148284B (en) 2022-06-27 2022-06-27 Pre-processing method and system of gene data

Publications (2)

Publication Number Publication Date
CN115148284A true CN115148284A (en) 2022-10-04
CN115148284B CN115148284B (en) 2023-03-17

Family

ID=83407628

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210734462.0A Active CN115148284B (en) 2022-06-27 2022-06-27 Pre-processing method and system of gene data

Country Status (1)

Country Link
CN (1) CN115148284B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115881228A (en) * 2022-10-24 2023-03-31 蔓之研(上海)生物科技有限公司 Gene detection data cleaning method and system based on artificial intelligence

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110246409A1 (en) * 2010-04-05 2011-10-06 Indian Statistical Institute Data set dimensionality reduction processes and machines
US20150039538A1 (en) * 2012-06-01 2015-02-05 Mohamed Hefeeda Method for processing a large-scale data set, and associated apparatus
CN111755076A (en) * 2020-07-01 2020-10-09 北京小白世纪网络科技有限公司 Disease prediction method and system based on spatial separability and using gene detection
CN113517022A (en) * 2021-06-10 2021-10-19 阿里巴巴新加坡控股有限公司 Gene detection method, feature extraction method, device, equipment and system

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110246409A1 (en) * 2010-04-05 2011-10-06 Indian Statistical Institute Data set dimensionality reduction processes and machines
US20150039538A1 (en) * 2012-06-01 2015-02-05 Mohamed Hefeeda Method for processing a large-scale data set, and associated apparatus
CN111755076A (en) * 2020-07-01 2020-10-09 北京小白世纪网络科技有限公司 Disease prediction method and system based on spatial separability and using gene detection
CN113517022A (en) * 2021-06-10 2021-10-19 阿里巴巴新加坡控股有限公司 Gene detection method, feature extraction method, device, equipment and system

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
刘占文: "《基于视觉显著性的图像分割》", 31 March 2019, 西安电子科技大学出版社 *
郁?等: "基于双层耦合网的表型-基因关联分析与预测", 《电子科技大学学报》 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115881228A (en) * 2022-10-24 2023-03-31 蔓之研(上海)生物科技有限公司 Gene detection data cleaning method and system based on artificial intelligence

Also Published As

Publication number Publication date
CN115148284B (en) 2023-03-17

Similar Documents

Publication Publication Date Title
Euhus Understanding mathematical models for breast cancer risk assessment and counseling
US9582639B2 (en) Method and apparatus for mobile disaster victim identification
CN111564223B (en) Infectious disease survival probability prediction method, and prediction model training method and device
Fortes et al. Identifying individuals at high risk of melanoma: a simple tool
WO2000057775A1 (en) System and method for predicting disease onset
CN110619959A (en) Intelligent triage method and system
CN110634563A (en) Differential diagnosis device for diabetic nephropathy and non-diabetic nephropathy
CN115148284B (en) Pre-processing method and system of gene data
CN111508603A (en) Birth defect prediction and risk assessment method and system based on machine learning and electronic equipment
Ogishi et al. Multibatch cytometry data integration for optimal immunophenotyping
KR102342770B1 (en) A health management counseling system using the distribution of predicted disease values
US20180150608A1 (en) Device and method for diagnosing cardiovascular disease using genome information and health medical checkup data
CN113841201A (en) Computerized system and method for de novo prediction of a TCR repertoire associated with cancer independent of antigen
CN115168669A (en) Infectious disease screening method and device, terminal equipment and medium
CN111161884A (en) Disease prediction method, device, equipment and medium for unbalanced data
Ragonnet-Cronin et al. Forecasting HIV-1 genetic cluster growth in Illinois, United States
KR101839572B1 (en) Apparatus Analyzing Disease-related Genes and Method thereof
Lundin et al. Evaluation of a web-based system for survival estimation in breast cancer
CN111027771A (en) Scenic spot passenger flow volume estimation method, system and device and storable medium
CN116453588A (en) STRC gene copy number variation detection method based on whole genome sequencing
CN115881228A (en) Gene detection data cleaning method and system based on artificial intelligence
KR102138166B1 (en) Method for providing artificial intelligence based self-improving genetic test using genome bigdata
KR101708715B1 (en) Device and method for analyzing gene expressing response data of model experiment for actual human response
US20130006067A1 (en) Method and device for determining a risk of graft rejection
Harrison et al. Using the Revised Cardiac Risk Index to predict major postoperative events for people with kidney failure: An external validation and update

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
PE01 Entry into force of the registration of the contract for pledge of patent right

Denomination of invention: A Preprocessing Method and System for Gene Data

Effective date of registration: 20230918

Granted publication date: 20230317

Pledgee: Industrial Bank Co.,Ltd. Shanghai Pilot Free Trade Zone Branch

Pledgor: MANZHIYAN BIO-TECHNOLOGY CO.,LTD.

Registration number: Y2023310000546

PE01 Entry into force of the registration of the contract for pledge of patent right