CN102156825A - Cancer vaccine trial data encoding and processing method based on data mining - Google Patents

Cancer vaccine trial data encoding and processing method based on data mining Download PDF

Info

Publication number
CN102156825A
CN102156825A CN201110074609XA CN201110074609A CN102156825A CN 102156825 A CN102156825 A CN 102156825A CN 201110074609X A CN201110074609X A CN 201110074609XA CN 201110074609 A CN201110074609 A CN 201110074609A CN 102156825 A CN102156825 A CN 102156825A
Authority
CN
China
Prior art keywords
cancer vaccine
representing
coding
vaccine test
values
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201110074609XA
Other languages
Chinese (zh)
Inventor
尹云飞
周尚波
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chongqing University
Original Assignee
Chongqing University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chongqing University filed Critical Chongqing University
Priority to CN201110074609XA priority Critical patent/CN102156825A/en
Publication of CN102156825A publication Critical patent/CN102156825A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Medicines Containing Antibodies Or Antigens For Use As Internal Diagnostic Agents (AREA)

Abstract

The invention discloses a cancer vaccine trial data encoding and processing method based on data mining, belonging to the technical field of biological information. The method mainly relates to feature analysis of cancer vaccine trial data, encoding of cancer vaccine trial data and mining treatment of cancer vaccine trial data. The method provided by the invention divides the cancer vaccine trial data into four types, encodes the data by means of an ''integer section identification method'', and finally takes the encoded data as the ''item'' of data mining to implement mining treatment. The cancer vaccine trial data encoding and treating method based on data mining has the advantages that the knowledge and the rule can be found from the cancer vaccine trial data, and the method has significance in researching life science and pharmaceutical engineering.

Description

A kind of cancer vaccine test figure coding and disposal route based on data mining
Technical field
The invention belongs to the biology information technology field, derive from the establishment engineering practice of " biomolecule information database ".
Background technology
The present invention can carry out data mining to the cancer vaccine test figure.The excavation of cancer vaccine test figure is divided into two steps: the one, and the coding of cancer vaccine test figure; The 2nd, the excavation of cancer vaccine test figure is handled.The basis of these two steps is that the cancer vaccine test figure is carried out signature analysis.
In the clinical medicine test, often produce lot of data.If can excavate, will be a thing highly significant for clinical medical research and development to these data.
The cancer vaccine test figure is a statistics in the worldwide, that carry out the information of cancer vaccine test as a kind of clinical testing data.This Test Information is for the waste of avoiding revision test and test resource, and reduces because these test obtained adverse events and negative findings does not publish and cause the experimenter to be subjected to unnecessary harm, has great importance.
The cancer vaccine test figure have data contain wide, field is more, characteristics such as highly professional.Therefore, how it being excavated is a technical barrier.
Cancer vaccine Data Processing Method commonly used has: " mathematical statistics method ", " empirical method " and " dynamic system method ".Wherein, " mathematical statistics method " is meant that technology such as utilization is averaged, variance simulate the statistical law of cancer vaccine test figure, and this method can only be found lip-deep rule and can not find potential rule." empirical method " is meant and utilizes the rule that exists in the experience judgement cancer vaccine test figure and set up model, verify and revise this model by true number pick then." dynamic system method " is meant and regards the cancer vaccine test figure as a dynamic system, utilize the kinetic parameter system that it is carried out modeling, check its dynamic perfromance and development tendency then in actual motion.
The present invention proposes a kind of cancer vaccine test figure coding and disposal route based on data mining.The thought of this method is: at first the cancer vaccine test figure is carried out signature analysis, according to its feature the attribute of cancer test figure is classified then, then formulate different encoding schemes according to the characteristics of every class, at last the cancer vaccine test figure of coding is carried out association rule mining, in the hope of finding potential knowledge and rule in the cancer vaccine test figure.
The coding and the disposal route of cancer vaccine test figure are described below by an example:
Illustrate, analyze the feature of cancer vaccine test figure.Then, find that these data all have the attribute of " Recruitment " by name.This attribute record carry out the experimenter person's of certain particular cancers vaccine test recruitment situation, have eight kinds of value condition, that is: " Completed ", " Active; not recruiting ", " Enrolling by invitation ", " Not yet recruiting ", " Recruiting ", " Suspended ", " Terminated ", " Withdrawn ", expression respectively: " finishing (recruitment) ", " start but do not begin as yet and recruit ", " investigate " by inviting, " still do not recruit ", " in the recruitment ", " time-out ", " termination " and " cancellation ".We at first use " Recruitment " attribute of " 100 " sign cancer vaccine test figure, eight kinds of value condition representing " Recruitment " then with 001~008 respectively, and with the sign of " Recruitment " attribute and the synthetic integral body of its value identified group, promptly 10001,10002,10003,10004,10005,10006,10007,10008.At last, these codings are considered as " item " in the data mining, the technology of utilization association rule mining is therefrom found potential knowledge and rule.Wherein, " item " is the base unit in the data mining, also is that minimal data is handled unit, can be an integer.
In a word, the cancer vaccine test figure is encoded and handled is a job highly significant, and the coding that the present invention proposes has following characteristics with disposal route: (1) dirigibility is strong: it can carry out artificial yojan or give the different preferences of liking attribute according to priori.(2) application is extensive: it can be at different fields---and it both can be used for the processing of cancer vaccine test figure, also can be used for the processing of biological data, the processing of human society field data, the processing of network selling data etc.(3) be easy to realize: programming realizes that this method---entire method only needs three steps to get final product easily, i.e. " signature analysis ", " coding " and " association rule mining ".
Summary of the invention
The invention discloses a kind of cancer vaccine test figure coding and disposal route, derive from the biology information technology field based on data mining.This method has important value for the exploration of life science and the development of pharmaceutical engineering.
In the present invention:
We relate generally to " signature analysis of cancer vaccine test figure ", " encoding scheme of cancer vaccine test figure ", " disposal route of cancer vaccine test figure ".
1. cancer vaccine test figure signature analysis
(1) data structure is clear
By on January 1st, 2011 the cancer vaccine test data set comprised 1051 records, each record comprises 24 attribute fields.Wherein, the record count of cancer vaccine test data set is also ceaselessly increasing, but the number of attribute field then keeps relative stability.
Table 1 is the structure sample of cancer vaccine test data set.
Table 1 cancer vaccine test figure structure set
Figure BSA00000460196900031
Figure BSA00000460196900041
In table 1,24 attribute fields that the cancer vaccine test data set is involved have been shown.Wherein,
" can simplify " being meant that this information can and show by other information (for example, major key, Candidate Key) location during processing, thus can put aside at coding with excavate the processing stage, and excavate handle after, locate and show this information by major key.
Eight fixing states of Recruitment are respectively " Completed ", " Active, not recruiting ", " Enrolling by invitation ", " Not yet recruiting ", " Recruiting ", " Suspended ", " Terminated ", " Withdrawn ".
Four of Gender fixedly value be respectively " Male ", " Female ", " Both ", " blank is not filled out ".
Six of Age Groups fixedly value be respectively " Adult ", " Adult|Senior ", " Child ", " Child|Adult ", " Child|Adult|Senior ", " Senior ".
Fixedly the stage is respectively " Phase 0 ", " PhaseI ", " Phase I|Phase II ", " Phase II ", " Phase II|Phase III ", " Phase III ", " Phase IV ", " blank is not filled out " to 8 of Phases.
16 mechanisms of Funded by are respectively " Industry ", " Industry|NIH ", " Industry|NIH | U.S.Fed ", " Industry|Other ", " NIH ", " NIH|Other ", " Other ", " Other|Industry ", " Other|NIH ", " Other | NIH|Industry ", " Other|U.S.Fed ", " Other | U.S.Fed|Industry ", " Other|U.S.Fed | NIH ", " U.S.Fed ", " U.S.Fed|NIH ", " U.S.Fed|Other ".
In addition, URL is meant the network address of each group cancer vaccine test, can understand the relevant more detailed information of this vaccine test by this address.URL is corresponding one by one with each group cancer vaccine test, has uniqueness, is a Candidate Key.Therefore the processing stage of coding and excavation, can put aside information, after to be excavated the disposing, locate it by major key again.
(2) Data Source is open
The cancer vaccine test figure derives from the ClinicalTrials.gov website, can openly download.This specific character of cancer vaccine test figure provides convenience for coding and the data mining processing of carrying out the cancer vaccine test figure.Researcher all over the world can come the validity of their disposal route of comparison and the efficient of Processing Algorithm by downloading identical data set.Simultaneously, the cooperation that has also made things convenient for researcher all over the world with exchange.
(3) the data implication contains wide general
The cancer vaccine test figure relates to the every aspect of cancer vaccine test, can use for the cancer vaccine researcher or the cancer scientific research personnel of specialty basically.More detailed if desired information can go to visit more detailed information by the URL address that provides in the cancer vaccine test figure.
2. cancer vaccine test figure encoding scheme
At first all fields are divided into four classes, wherein the first kind comprises that 7 fields, second class comprise that 12 fields, the 3rd class comprise 3 fields, the 4th class comprises 2 fields.Narration respectively below:
The first kind: field that can Direct Digitalization.Such characteristics are field value limited (generally being no more than 20 fixing values).For example, " Recruitment ", " Study Results ", " Gender " etc.
Second class: need classification to handle the field that to encode then.This is divided into two kinds of situations again: a kind of is character string type, needs classification earlier to encode then; Another kind is numerical value or time type, needs between first dividing regions and then encodes.The former comprises " Conditions ", " Interventions " etc.; The latter comprises " First Received ", " Start Date " etc.
The 3rd class: the class that can simplify processing.This is divided into two kinds of situations again: a kind of is other identifier (or abbreviation), for example, and " Other IDs ", " Acronym " field; Another kind is the field that contains more descriptive statement, for example, and " Title " field.The characteristics of these fields are that the description content is more numerous and more jumbled, but can come unique location by major key.
The 4th class: major key or Candidate Key.For example, " NCT ID " and " URL ".Major key and Candidate Key can not produce any its correlation rule on one's body that occurs in when data mining.This is because comprise frequency that the item of major key or Candidate Key occurs always 1.Although the coding of major key and Candidate Key is very simple, go to toward with they " beta pruning " carrying out data mining.
At last, by above-mentioned sorting code number method, we have obtained the cancer vaccine test figure classification process information table shown in the table 2.
Table 2 cancer vaccine test figure classification processing list
Figure BSA00000460196900051
In table 2, two hurdle information have been listed altogether.First hurdle is a classified information, and second hurdle is the handled field information of certain class disposal route.
For the disposal route that comprises field in the first kind be: encode successively according to its fixed value of getting; For the disposal route that comprises field in second class be: at first encode between classification or dividing regions and then to each class; For the disposal route that comprises field in the 3rd class be: simple classification and then coding or do not deal with; For the disposal route that comprises field in the 4th class be: natural number coding or do not deal with.
It should be noted that: the treatment principle of void item (the not field of fill substance) is---if numeric type is then got the average of these all values of field; Be processed into " Other " if character string type is then unified, and what look is an independently value classification.
Detailed encoding scheme is as follows:
We invent a kind of " integer segmentation label method " and encode, that is, represent a value of a field to be referred to as " item " with five integers.Wherein front three is used for identifying this attribute, back two different values that are used for identifying this attribute.For example, the situation of 10001 expression Recruitment field values " Completed " wherein 100 identifies the Recruitment attributes, and 01 identity property value is " Completed ".In like manner, 10302 expression Age Groups field values " Adult | Senior ", wherein, 103 sign Age Groups attributes, 02 identity property value " Adult|Senior ".
Fig. 1 is based on the cancer vaccine test figure encoding scheme of data mining, i.e. " integer segmentation label method ".
In Fig. 1, " item " of a data excavation represented with a five-digit number.Wherein, front three is " attribute-bit position ", and back two is " attribute value flag ".
Analyze: represent with three figure places because of " attribute-bit position ", so it can represent 1000 attribute fields that (000-999) is different.For the purpose of regular, we generally get 900 different numbers between the 100-999, because 900 different attribute fields have satisfied the demand of conventional test far away in cancer vaccine test.In like manner, " attribute value flag " represents that with double figures it can represent 100 values that (00-99) is different, and this has also satisfied the discretize of attribute value and the requirement of classification far away in the cancer vaccine test.
Therefore, the encoding scheme of each field is as follows:
The coding method of Recruitment field: represent " Completed " with 10001; Represent " Active, not recruiting " with 10002; Represent " Enrolling by invitation " with 10003; Represent " Not yet recruiting " with 10004; Represent " Recruiting " with 10005; Represent " Suspended " with 10006; Represent " Terminated " with 10007; Represent " Withdrawn " with 10008.
The coding method of Study Results field: represent " no result " with 10100; Represent " result is arranged " with 10101.
The coding method of Gender field: represent " Male " with 10201; Represent " Femal " with 10202; Represent " Both " with 10203; Represent " blank is not filled out " with 10204.
The coding method of Age Groups field: represent " Adult " with 10301; Represent " Adult|Senior " with 10302; Represent " Child " with 10303; Represent " Child|Adult " with 10304; Represent " Child|Aduld|Senior " with 10305; Represent " Senior " with 10306.
The coding method of Phases field: represent " Phase 0 " with 10400; Represent " PhaseI " with 10401; Represent " Phase I|Phase II " with 10412; Represent " Phase II " with 10402; Represent " PhaseII|Phase III " with 10423; Represent " PhaseIII " with 10403; Represent that with 10404 " PhaseIV " represents " blank is not filled out " with 10405.
The coding method of Funded by field: represent " Industry " with 10501; With 10502 representing " Industry | NIH "; Represent " Industry|NIH|U.S.Fed " with 10503; Represent " Industry|Other " with 10504; Represent " NIH " with 10505; Represent " NIH|Other " with 10506; Represent " Other " with 10507; Represent " Other|Industry " with 10508; With 10509 representing " Other | NIH "; With 10510 representing " Other | NIH | Industry "; With 10511 representing " Other | U.S.Fed "; With 10512 representing " Other | U.S.Fed|Industry "; With 10513 representing " Other | U.S.Fed|NIH "; Represent " U.S.Fed " with 10514; With 10515 representing " U.S.Fed | NIH "; With 10516 representing " U.S.Fed | Other ".
The coding method of Study Types field: represent " Interventional " with 10601; Represent " Observational " with 10602.
The coding method of conditions field: with 10701 values of representing to have " Acute " mark; With 10702 values of representing to have " Advanced " mark; With 10703 values of representing to have " Anal " mark; With 10704 values of representing to have " Astrocytoma " mark; With 10705 values of representing to have " Brain " mark; With 10706 values of representing to have " Breast " mark; With 10707 values of representing to have " Carcinoma " mark; With 10708 values of representing to have " Cervical " mark; With 10709 values of representing to have " Chronic " mark; With 10710 values of representing to have " Colorectal " mark; With 10711 values of representing to have " Epithelial " mark; With 10712 values of representing to have " Fallopian " mark; With 10713 values of representing to have " Glioblastoma " mark; With 10714 values of representing to have " Glioma " mark; With 10715 values of representing to have " Head and Neck " mark; With 10716 values of representing to have " HIV " mark; With 10717 values of representing to have " HPV " mark; With 10718 values of representing to have " Infection " mark; With 10719 values of representing to have " Inflammatory " mark; With 10720 values of representing to have " Intraocular Melanoma " mark; With 10721 values of representing to have " Kidney " mark; With 10722 values of representing to have " Leukemia " mark; With 10723 values of representing to have " Lung " mark; With 10724 values of representing to have " Lymphoma " mark; With 10725 values of representing to have " Malignant " mark; With 10726 values of representing to have " Melanoma " mark; With 10727 values of representing to have " Metastatic " mark; With 10728 values of representing to have " Multiple " mark; With 10729 values of representing to have " Myelodysplastic " mark; With 10730 values of representing to have " Nasopharyngeal " mark; With 10731 values of representing to have " Neoplasms " mark; With 10732 values of representing to have " Neuroblastoma " mark; With the value of 10733 representing to have " Non-Hodgkin ' s " mark; With 10734 values of representing to have " Non-Small Cell " mark; With 10735 values of representing to have " Ovarian " mark; With 10736 values of representing to have " Pancreatic " mark; With 10737 values of representing to have " Papillomavirus " mark; With 10738 values of representing to have " Prophylaxis " mark; With 10739 values of representing to have " Prostate " mark; With 10740 values of representing to have " Renal Cell " mark; With 10741 values of representing to have " Sarcoma " mark; With 10742 values of representing to have " Squamous " mark; With 10743 values of representing to have " StageIII " mark; With 10744 values of representing to have " StageIV " mark; With 10745 values of representing to have " Superficial " mark; Represent remaining value with 10746.
The coding method of Interventions field: with 10801 values of representing to have " Behavioral " mark; With 10802 values of representing to have " Biological " mark; With 10803 values of representing to have " Drug " mark; With 10804 values of representing to have " Genetic " mark; With 10805 values of representing to have " Other " mark; With 10806 values of representing to have " Procedure " mark; With 10807 values of representing to have " Radiation " mark; Represent remaining value with 10808.
The coding method of Sponsors field: with 10901 values of representing to have " Baylor College of Medicine " mark; With 10902 values of representing to have " Baylor Research Institute " mark; With 10903 values of representing to have " Beth Israel Deaconess MedicalCenter " mark; With 10904 values of representing to have " Dana-Farber Cancer Institute " mark; With 10905 values of representing to have " Duke University " mark; With 10906 values of representing to have " Fred Hutchinson Cancer Research Center " mark; With 10907 values of representing to have " H.Lee Moffitt Cancer Center and Research Institute " mark; With 10908 values of representing to have " Ludwig Institute for Cancer Research " mark; With 10909 values of representing to have " M.D.Anderson Cancer Center " mark; With 10910 values of representing to have " Mayo Clinic " mark; With 10911 values of representing to have " Memorial Sloan-Kettering Cancer Center " mark; With 10912 values of representing to have " National Institute of Allergy and Infectious Diseases " mark; With 10913 values of representing to have " Oxford BioMedica " mark; With 10914 values of representing to have " Sidney Kimmel Comprehensive Cancer Center " mark; With 10915 values of representing to have " St.Jude Children ' s Research Hospital " mark; With 10916 values of representing to have " Stanford University " mark; With 10917 values of representing to have " University of Caiifornia " mark; With 10918 values of representing to have " University of Chicago " mark; With 10919 values of representing to have " University of Maryland " mark; With 10920 values of representing to have " University of Michigan Cancer Center " mark; With 10921 values of representing to have " University of Pennsylvania " mark; With 10922 values of representing to have " University of Pittsburgh " mark; With 10923 values of representing to have " University of Virginia " mark; With 10924 values of representing to have " University of Wisconsin " mark; Represent remaining value with 10925.
The coding method of Enrollment field: with 11001 values of representing in the interval [0,100]; With 11002 values of representing in the interval [101,200]; With 11003 values of representing in the interval [201,300]; With 11004 values of representing in the interval [301,400]; With 11005 values of representing in the interval [401,500]; With 11006 values of representing in the interval [501,600]; With 11007 values of representing in the interval [601,700]; With 11008 values of representing in the interval [701,800]; With 11009 values of representing in the interval [801,900]; With 11010 values of representing in the interval [901,1000]; With 11011 values of representing in the interval [1001,2000]; With 11012 values of representing in the interval [2000,10000]; With 11013 represent the interval [10001 ,+8) in value.
The coding method of Study Designs field: with 11101 values of representing to have " Allocation:Non-Randomized " mark; With 11102 values of representing to have " Allocation:Randomized " mark; With 11103 values of representing to have " Control:Active Control " mark; With 11104 values of representing to have " Control:Historical Control " mark; With 11105 values of representing to have " Control:Uncontrolled " mark; With 11106 values of representing to have " Endpoint Classification:Safety Study " mark; With 11107 values of representing to have " Endpoint Classification:Safety " mark; With 11108 values of representing to have " Masking:Open Label " mark; With 11109 values of representing to have " Observational Model:Defined Population " mark; With 11110 values of representing to have " Observational Model:Ecologic or Community " mark; Represent remaining value with 11111.
The coding method of First Received field: represented 1999 and each time in the past with 11201; With 11202 each times of representing in 2000; With 11203 each times of representing in calendar year 2001; With 11204 each times of representing in 2002; With 11205 each times of representing in 2003; With 11206 each times of representing in 2004; With 11207 each times of representing in 2005; With 11208 each times of representing in 2006; With 11209 each times of representing in 2007; With 11210 each times of representing in 2008; With 11211 each times of representing in 2009; With 11212 each times of representing in 2010; With 11213 each times of representing in 2011; Other by that analogy.
The coding method of Start Date field: represent the time in the past nineteen ninety with 11301; With 11302 times of representing between 1991 to 1998; With 11303 each times of representing in 1999; With 11304 each times of representing in 2000; With 11305 each times of representing in calendar year 2001; With 11306 each times of representing in 2002; With 11307 each times of representing in 2003; With 11308 each times of representing in 2004; With 11309 each times of representing in 2005; With 11310 each times of representing in 2006; With 11311 each times of representing in 2007; With 11312 each times of representing in 2008; With 11313 each times of representing in 2009; With 11314 each times of representing in 2010; With 11315 each times of representing in 2011; Other by that analogy.
The coding method of Completion Date field: represented 1999 and each time in the past with 11401; With 11402 each times of representing in 2000; With 11403 each times of representing in calendar year 2001; With 11404 each times of representing in 2002; With 11405 each times of representing in 2003; With 11406 each times of representing in 2004; With 11407 each times of representing in 2005; With 11408 each times of representing in 2006; With 11409 each times of representing in 2007; With 11410 each times of representing in 2008; With 11409 each times of representing in 2009; With 11410 each times of representing in 2010; Represent each time in 2011 with 11411; Other by that analogy.
The coding method of Last Updated field: represented 2004 and each time in the past with 11501; With 11502 each times of representing in 2005; With 11503 each times of representing in 2006; With 11504 each times of representing in 2007; With 11505 each times of representing in 2008; With 11506 each times of representing in 2009; With 11507 each times of representing in 2010; With 11508 each times of representing in 2011; Other by that analogy.
The coding method of Last Verified field: represent that with 11,601 1998 reach each time in the past; With 11602 each times of representing in 1999; With 11603 each times of representing in 2000; With 11604 each times of representing in calendar year 2001; With 11605 each times of representing in 2002; With 11606 each times of representing in 2003; With 11607 each times of representing in 2004; With 11608 each times of representing in 2005; With 11609 each times of representing in 2006; With 11610 each times of representing in 2007; With 11611 each times of representing in 2008; With 11612 each times of representing in 2009; With 11613 each times of representing in 2010; With 11614 each times of representing in 2011; Other by that analogy.
The coding method of Primary Completion Date field: represented 1999 and each time in the past with 11701; With 11702 each times of representing in 2000; With 11703 each times of representing in calendar year 2001; With 11704 each times of representing in 2002; 11705 represent each time in 2003; With 11706 each times of representing in 2004; With 11707 each times of representing in 2005; With 11708 each times of representing in 2006; With 11709 each times of representing in 2007; With 11710 each times of representing in 2008; With 11711 each times of representing in 2009; With 11712 each times of representing in 2010; With 11713 each times of representing in 2011; Other by that analogy.
The coding method of Outcome Measures field: with 11801 values of representing to have " adverse events " mark; With 11802 values of representing to have " Clinical response " mark; With 11802 values of representing to have " Clinical tumor regression " mark; With 11803 values of representing to have " Disease-free survival " mark; With 11804 values of representing to have " Event-free survival " mark; With 11805 values of representing to have " Geometric mean titers " mark; With 11806 values of representing to have " Immune response " mark; With 11807 values of representing to have " Immunologicresponse " mark; With 11808 values of representing to have " Number of Participants " mark; With 11809 values of representing to have " Occurrence " mark; With 11810 values of representing to have " Overall survival " mark; With 11811 values of representing to have " Progression-free survival " mark; With 11812 values of representing to have " Response rate " mark; With 11813 values of representing to have " safety " mark; With 11814 values of representing to have " Time to treatment failure " mark; With 11815 values of representing to have " To evaluate the safety " mark; With 11816 values of representing to have " Toxicity " mark; The remaining value of 11817 expressions.
3. cancer vaccine Data Processing Method
The disposal route of cancer vaccine test figure is meant that specifically the excavation of cancer vaccine test figure handles: the coding of each field value of cancer vaccine test figure is considered as " item " in the data mining, excavates the incidence relation that exists between them by adding up the frequency that difference " item " occurs simultaneously.Wherein, " item " is the base unit in the data mining, in order to identify the value of commodity or attribute.
Method step is as follows:
At first, the cancer vaccine test figure behind the coding is read in internal memory.Data file is made up of several rows, and each row is represented a cancer vaccine Test Information.In each row, the value of each field relevant with the cancer vaccine test is represented by the coding method that provides previously.Such data file is exactly the regular text of being made up of numeral and separator.Each row of data file is represented with one " collection " in calculator memory after removing separator.Each " collection " is made up of several " items ".Item is the coded data of cutting apart with separator, and its structure is as follows:
Figure BSA00000460196900111
Specifically referring to Fig. 1.
The second, set up the internal memory chained list.Each " collection " is together in series with linked list data structure.Setting up in the process of chained list, the number of total collection of statistics and the number of different " items ".At first all that each is gone couple together with single-track link table, and the individual event chained list with each row is together in series again then, form the chained list set thus.The chained list set of this moment corresponding whole data set.
The 3rd, generate a frequent collection.Generate a frequent collection according to the chained list of setting up.Check the frequency of each appearance, leave out the item that does not reach minimum support, the item that reservation surpasses minimum support, obtain a chained list of simplifying thus, be referred to as " a frequent collection ".
The 4th, generate the K item and frequently collect.Generate binomial Candidate Set regeneration binomial by a frequent collection and frequently collect, and then recurrence generation K item frequently collects.This is the process of a recurrence, constitutes the binomial collection in twos by a frequent item of concentrating, and is referred to as " binomial Candidate Set "; Reexamine in the binomial Candidate Set each the frequency of occurrences, leave out and do not reach the minimum item of supporting and keep item, obtain " binomial frequently collects " thus more than or equal to minimum support.Frequently collect until generating the K item by that analogy, the frequent collection of K item is meant the frequent collection of the maximum that can recurrence generates.
The 5th, generate correlation rule.Frequently collect the generation correlation rule according to the K item that generates.The K item is frequently concentrated and has been comprised all frequent items that occur, and for each frequent item that occurs, for example " ABC " with the frequency of occurrences of the frequency of occurrences removal " ABC " of " A ", just obtained the degree of confidence of correlation rule " A → BC ".That is, and conf (A → BC)=supp (ABC)/supp (A), wherein (supp (A) represents the frequency of occurrences of " A " to conf for the degree of confidence of A → BC) expression rule " A → BC ", the frequency of occurrences of supp (ABC) expression " ABC ".Other by that analogy.
Fig. 2 has shown the treatment scheme of cancer vaccine test figure.
In Fig. 2, the cancer vaccine test figure at first is read into internal memory, and then, the cancer vaccine test figure is stored with the form of chained list.Each the frequency of occurrences in the statistics cancer vaccine test figure will not satisfy the item of minimum support and leave out according to the minimum support threshold value, thereby will obtain " a frequent collection ".Generate " K frequently collects " (generation of any frequent collection all needs to carry out " filtration " according to minimum support) according to " a frequent collection " recurrence that generates.At last, generate correlation rule by " the K item frequently collects " again.
This disposal route characteristics: (1) only need once read text: after the cancer vaccine experimental data is read in internal memory, with the form storage of chained list.Later on reading of cancer vaccine test figure only being needed to operate in chained list does not need to read once more text, has accelerated processing speed like this.(2) data are carried out safety inspection, promptly, check whether it is fit to data mining: when the cancer vaccine test figure is stored with the chain sheet form, check whether the data read in meet the cryptoprinciple of cancer vaccine data, check whether the data of reading in exist the unusual character that can not handle, blank character etc.(3) support the monoid data: this method can be handled the monoid data, and for example, data " 10012#$% " can be handled according to 10012.
Description of drawings
Below with reference to accompanying drawing is that principle of the present invention, treatment scheme are described, wherein:
Fig. 1 is based on the cancer vaccine test figure encoding scheme figure of data mining
Fig. 2 is the processing flow chart of cancer vaccine test figure
Embodiment
Come " a kind of cancer vaccine test figure coding and disposal route based on data mining " of the present invention is further described below in conjunction with accompanying drawing.
" a kind of cancer vaccine test figure coding and disposal route based on data mining " derives from actual demand, and it is encoded the cancer vaccine test figure according to the requirement of data mining, and then excavates (Knowledge Discovery) again.The present invention relates to " cancer vaccine test figure coding " and " cancer vaccine experimental data processing " two large divisions, and these two parts are organic connections." cancer vaccine test figure coding " is the necessary preparation of " cancer vaccine Data Processing in Experiment "; " cancer vaccine Data Processing in Experiment " is the final purpose of " cancer vaccine experimental data coding ".Specifically introduce embodiments of the present invention below step by step:
At first, the cancer vaccine test figure is classified.The cancer vaccine test figure is to that carry out all over the world and the collection cancer vaccine related experiment, and it comprises the purpose, participant, test condition, research type, beginning and ending time etc. of test, and these information are used to be engaged in scientist's reference of cancer vaccine research.Based on this, the classification of cancer vaccine experimental data should be considered 2 points: (1) keeps original core information; (2) conveniently excavate processing.By research, we are divided into four classes with the cancer vaccine test figure:
The first kind: Recruitment, Study Results, Gender, Age Groups, Phases, Funded Bys, Study Types.Their characteristics are that the value of field is limited.
Second class: Conditions, Interventions, Sponsors, Enrollment, Study Designs, First Received, Start Date, Completion Date, Last Updated, Last Verified, Primary Completion Date, Outcome Measures.Their characteristics be the value of field unlimited but can be divided into limited interval.
The 3rd class: Other IDs, Acronym, Title.Their characteristics are that the value of field adopts natural language description.
The 4th class: NCT ID, URL.Their characteristics are field value unique (can not repeat).
Secondly, the cancer vaccine experimental data is encoded.The coding consideration high efficiency of cancer vaccine experimental data, Yi Shixing.So-called " high efficiency " is meant that the code of compiling out should be stored easily, reads easily, the efficient processing easily; So-called " Yi Shixing " is meant the code easy identification and the memory of compiling out.After deliberation, we have invented the coding method of a kind of " integer segmentation label method ", and this method is represented a cancer vaccine test figure with five integers.Wherein, the front three of this integer is used for representation attribute, back two values that are used for representation attribute.For example:
10001 expression Recruitment field values " Completed " wherein 100 are represented the Recruitment field, 01 representative " Completed ".
Number in 11001 expression Enrollment field values [0,100] are interval wherein 110 is represented the Enrollment field, 01 representative [0,100].
11501 expression Last Updated field values 2004 and each times in the past, wherein 115 represent Last Updated field, 01 represented 2004 and each time in the past.
The 3rd, the cancer vaccine test figure behind the coding is kept in the text.The file that is used for depositing the cancer vaccine test figure is a kind of data file, its cancer vaccine test data sheet of each line display.Each row is made up of numeral and separator, and wherein, numeral is the numeral according to foregoing encoding scheme establishment; Separator is ", ".
For example:
10001,10100,...,11817
10002,10100,...,11817
Be preceding two line data of this document.
The 4th, from text, read the cancer vaccine Test Information.With the separator is according to reading cancer vaccine information by the I/O operation from cancer vaccine test coded file.Each coded data is preserved with the data structure of " node ".The design proposal of " node " is as follows:
Wherein, ItemNode is the title of a node, and item is an item that node comprised, and support is the frequency of this appearance, and next Itemnode is a next address of node.
NextItemnode by each node can be connected into single-track link table to several, represents delegation's information of cancer vaccine test coded file with such single-track link table.Then, more all single-track link tables are together in series, have so just constituted the memory form of whole data set.
The 5th, scan all single-track link tables, add up the frequency of each appearance.From first to last scan each single-track link table, the number with identical item is classified as a class and adds up it is designated as " this frequency ".
The 6th, delete ineligible item according to the support threshold value.The item that occurrence frequency is not reached the support threshold value according to the support threshold value in all single-track link tables is left out.The single-track link table that process is simplified has been formed cancer vaccine test figure " a frequent collection ", is characterized in that single frequency that occurs is more than or equal to the support threshold value.For example, 10001 have occurred 32 times and have been retained greater than support threshold value 30, and 11001 have occurred 8 times is deleted less than support threshold value 30.
The 7th, on the basis of " a frequent collection ", combination results " binomial Candidate Set ".The method of combination is to get two " items " arbitrarily to make up.For example, existing three items " 10001 ", " 10101 ", " 10201 ".Its combined result is { " 10001,10101 ", " 10001,10201 ", " 10101,10201 " }.If 10 items are arranged, then its combined result has C 10 2=45.
The 8th, on the basis of " binomial Candidate Set ", produce " binomial frequently collects " according to the support threshold value.The occurrence frequency of statistics all " binomial Candidate Sets " is left out the binomial Candidate Set that occurrence frequency does not reach the support threshold value according to the support threshold value.Thus, formed " binomial frequently collects ".The characteristics of " binomial frequently collects " are that the frequency that occurs simultaneously of two items is greater than the support threshold value.For example, " 10001 " and " 10101 " have occurred 30 times simultaneously and have been retained greater than support threshold value 30, " 10001 " and " 10201 " occurred simultaneously 8 times deleted less than support threshold value 30.
The 9th, in like manner, generate " three Candidate Sets " by " binomial frequently collects " combination again; And then " cutting " generates " three frequent collection " according to support again, until final generation " the K item frequently collects "." the K item frequently collects " is meant the frequent collection of the maximum that can recurrence generates, for example, there are 10 items to occur simultaneously and to have occurred simultaneously 30 times at most, because these 10 items occur 30 times simultaneously more than or equal to minimum support threshold value 30, so " frequent K (K=10) collection " is maximum frequent item set.
The tenth, generate correlation rule according to the frequent item set that generates.For each frequent item set S, it is divided into two parts A, B.Obtain the degree of confidence of rule " A → B " according to degree of confidence formula " Conf (A → B)=Supp (S)/Supp (A) ", wherein, Conf (degree of confidence of expression " A → B " of A → B), the support of Supp (S) expression S, the support of Supp (A) expression A.For example, the support of frequent item set " 10001,10101 " is 30, and the support of " 10001 " is 32, and then the degree of confidence of correlation rule " 10001 → 10101 " is 30/32=0.9375; The support of frequent item set " 10001,10101,10201 " is 8, and then the degree of confidence of correlation rule " 10001 → 10101,10201 " is 8/32=0.25.
At last, select effective correlation rule according to minimal confidence threshold.For example, given confidence threshold value is 0.8, and then the degree of confidence 0.9375 of " 10001 → 10101 " is selected greater than 0.8; The degree of confidence 0.25 of " 10001 → 10101,10201 " is excluded less than 0.8.
The correlation rule that generates is stored in text.Can browse, inquire about, mate these correlation rules of use by reading the mode of text.
Need to prove that the present invention supports various clinical testing datas.By with clinical testing data classification, coding with excavate and handle, can obtain clinical testing data inner hiding knowledge and rule, thereby provide valuable reference for life engineering and pharmaceutical engineering.

Claims (6)

1. the cancer vaccine test figure based on data mining is encoded and disposal route, it is characterized in that: at first the cancer vaccine test figure is carried out signature analysis [1], carry out " integer segmentation label method " coding [2] then, carry out association rule mining at last and handle [3], wherein:
The cancer vaccine test figure is carried out signature analysis [1]: be meant according to the field attribute [4] of cancer vaccine test figure and analyze its value condition, and they are divided into different types, for encoding respectively according to value condition;
Carry out " integer segmentation label method " coding [2]: be meant signature analysis [1] result, adopt " integer segmentation label method " to encode according to the cancer vaccine test figure; " integer segmentation label method " coding [2] is coding of integer representation of 5 with a length, wherein, and preceding 3 bit-identify attributes, the afterwards value of 2 these attributes of bit-identify;
Carry out association rule mining and handle [3]: be meant " item " that will be considered as through the data of " integer segmentation label method " coding [2] in the data mining, excavate the incidence relation that exists between them by adding up the frequency that difference " item " occurs simultaneously; Wherein, " item " is the base unit in the data mining, is used for identifying the value of commodity or attribute.
2. a kind of cancer vaccine test figure coding and disposal route based on data mining according to claim 1 is characterized in that: the cancer vaccine test figure is carried out signature analysis [1] be meant according to cancer vaccine test figure field attribute [4] and analyze its value condition and classify according to value condition; Its concrete steps are: " determining the type of attribute value ", " statistical attribute is got the number of different value ", " attribute is divided into different processing types ".
3. a kind of cancer vaccine test figure coding and disposal route according to claim 1 based on data mining, it is characterized in that: " integer segmentation label method " coding [2] is a kind of coding method, length of its employing is coding of integer representation of 5, wherein, preceding 3 bit-identify attributes, span is 000~999, can identify 1000 different attributes altogether; The value condition of back 2 these attributes of bit-identify, span is 00~99, can identify 100 kinds of different value condition altogether; The two is multiplied each other, and promptly 1000*100=100000 can get the maximum occurrences situation that " integer segmentation label method " can be represented.
4. a kind of cancer vaccine test figure coding and disposal route according to claim 1 based on data mining, it is characterized in that: association rule mining is handled the coded file that [3] are used for managing cancer vaccine test data; Association rule mining is handled [3] and is made up of " the cancer vaccine test figure is read in internal memory ", " setting up the internal memory chained list ", " generating a frequent collection ", " being combined into K item Candidate Set ", " generating the K item frequently collects ", " generation correlation rule " step.
5. a kind of cancer vaccine test figure coding and disposal route according to claim 1 based on data mining, it is characterized in that: field attribute [4] be the cancer vaccine test figure intrinsic test attribute, can understand the each side information of cancer vaccine test by field attribute [4].
6. a kind of cancer vaccine test figure coding and disposal route according to claim 1 based on data mining, it is characterized in that: all code storage that process " integer segmentation label method " coding [2] back forms are in text, transaction record of each line display of this document, every row is made up of some integers, cut apart with ", " between integer and the integer.
CN201110074609XA 2011-03-28 2011-03-28 Cancer vaccine trial data encoding and processing method based on data mining Pending CN102156825A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201110074609XA CN102156825A (en) 2011-03-28 2011-03-28 Cancer vaccine trial data encoding and processing method based on data mining

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201110074609XA CN102156825A (en) 2011-03-28 2011-03-28 Cancer vaccine trial data encoding and processing method based on data mining

Publications (1)

Publication Number Publication Date
CN102156825A true CN102156825A (en) 2011-08-17

Family

ID=44438319

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201110074609XA Pending CN102156825A (en) 2011-03-28 2011-03-28 Cancer vaccine trial data encoding and processing method based on data mining

Country Status (1)

Country Link
CN (1) CN102156825A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104662908A (en) * 2012-09-24 2015-05-27 高通股份有限公司 Depth map coding
CN105303045A (en) * 2015-10-27 2016-02-03 中国石油天然气股份有限公司 Linear data association rule mining method for long-distance pipeline
CN112016270A (en) * 2020-09-08 2020-12-01 中国物品编码中心 Chinese-sensible code logistics information coding method, device and equipment

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
尹云飞等: "双重区间值聚类挖掘模型", 《广西师范大学学报(自然科学版)》 *
苏毅娟等: "一种改进的频繁集挖掘方法", 《广西师范大学学报(自然科学版)》 *
钟智等: "软件系统层次的数据挖掘方法", 《计算机科学》 *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104662908A (en) * 2012-09-24 2015-05-27 高通股份有限公司 Depth map coding
CN104662908B (en) * 2012-09-24 2018-03-30 高通股份有限公司 Depth map decodes
CN105303045A (en) * 2015-10-27 2016-02-03 中国石油天然气股份有限公司 Linear data association rule mining method for long-distance pipeline
CN105303045B (en) * 2015-10-27 2018-05-04 中国石油天然气股份有限公司 A kind of long distance pipeline linear data association rule mining method
CN112016270A (en) * 2020-09-08 2020-12-01 中国物品编码中心 Chinese-sensible code logistics information coding method, device and equipment
CN112016270B (en) * 2020-09-08 2024-04-02 中国物品编码中心 Logistics information coding method, device and equipment of Chinese-character codes

Similar Documents

Publication Publication Date Title
Bartram et al. Untidy data: The unreasonable effectiveness of tables
CN104685467B (en) It is represented graphically programmed attribute
Pérez et al. A data preparation methodology in data mining applied to mortality population databases
CN103365978A (en) Traditional Chinese medicine data mining method based on LDA (Latent Dirichlet Allocation) topic model
US20090043733A1 (en) Systems and methods for efficiently storing, retrieving and querying data structures in a relational database system
AU2012327168B2 (en) Amethod and structure for managing multiple electronic forms and their records using a static database
CN113378011B (en) Construction method and system of complex product assembly digital twin body
CN106897285A (en) Data Elements extract analysis system and Data Elements extract analysis method
CN109299154A (en) A kind of data-storage system and method for big data
Kim et al. Progress of technological innovation of the United States’ shale petroleum industry based on patent data association rules
Gao et al. A visualized analysis of the research current hotspots and trends on innovation chain based on the knowledge map
Widanagamaachchi et al. Interactive visualization and exploration of patient progression in a hospital setting
Shan et al. Dynamic top-k interesting subgraph query on large-scale labeled graphs
Papadakis et al. Efficient entity resolution methods for heterogeneous information spaces
CN102156825A (en) Cancer vaccine trial data encoding and processing method based on data mining
CN106095859A (en) Various dimensions Chinese medicine acupuncture association rule mining method based on OLAM
Hajij et al. An efficient data retrieval parallel reeb graph algorithm
Wang et al. Measuring technology complementarity between enterprises with an hLDA topic model
Chen et al. Adaptive spatio-temporal query strategies in blockchain
CN101650750B (en) Rapid and intuitive interpretive system and method for petroleum well logging data
Abdullah et al. Visualizing the construction of incremental disorder trie itemset data structure (DOSTrieIT) for frequent pattern tree (FP-Tree)
Leung et al. Visualization and visual knowledge discovery from big uncertain data
Isa et al. Business Intelligence for Analyzing Department Unit Performance in eProcurement System
CN101131699A (en) Construction method for structure tree with genetic information
Gantner A spatiotemporal ontology for the administrative units of Switzerland

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C12 Rejection of a patent application after its publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20110817