WO2024161278A1 - A system and method for element traceability - Google Patents

A system and method for element traceability Download PDF

Info

Publication number
WO2024161278A1
WO2024161278A1 PCT/IB2024/050803 IB2024050803W WO2024161278A1 WO 2024161278 A1 WO2024161278 A1 WO 2024161278A1 IB 2024050803 W IB2024050803 W IB 2024050803W WO 2024161278 A1 WO2024161278 A1 WO 2024161278A1
Authority
WO
WIPO (PCT)
Prior art keywords
nucleotide
given
sources
sequences
source
Prior art date
Application number
PCT/IB2024/050803
Other languages
French (fr)
Inventor
Ciaran Meghen
Ian William RICHARDSON
Yuan FU
Stephen David Edward PARK
Mari Janika Higgins
Mohammad Adib MAKROONI
Original Assignee
Identigen Limited
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Identigen Limited filed Critical Identigen Limited
Publication of WO2024161278A1 publication Critical patent/WO2024161278A1/en

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/20Allele or variant detection, e.g. single nucleotide polymorphism [SNP] detection
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding

Definitions

  • the present invention relates to the field of element traceability.
  • Element traceability is an umbrella term encompassing systems and methods for identifying, tracking, and/or tracing items, features, or information as it moves through a process (e.g., tracing raw materials as they move through a production process, during which they are used to produce a product).
  • Traceability inter alia may serve as a critical tool for operationalizing standards and regulations to improve product safety control (e.g., by enabling public and private sector actors to verify that products meet market and/or regulatory requirements) and may assist with responding to safety breaches.
  • a system for determining whether a given source contributed to the formation of a given product comprising a processing circuitry configured to: obtain: (i) a first nucleotide-based sequence originated from the given source, (ii) a first collection of nucleotide-based sequences, each originated from a source of a plurality of sources being part of a group of individuals, not all of whom serve as sources, (iii) a second collection of nucleotide-based sequences, each originated from the given product, and (iv) a second nucleotide-based sequence originated from a source known not to be included in the given product, being part of said group of individuals; calculate a first distance associated with the given source, wherein the first distance is composed of: (i) a distance of the first nucleotide-based sequence from the first collection of nucleotide- based sequences, and (ii) a distance of the first nucle
  • the first and second collections of nucleotide-based sequences are each composed of single-nucleotide polymorphism DNA sequences.
  • a system for detecting a non-compliance in a given product produced from a group of sources comprising a processing circuitry configured to: obtain: (i) a set of nucleotide-based sequences from each given source of the group of sources, wherein the set of nucleotide-based sequences includes a collection of nucleotide-based sequences common to all the sources of the group of sources, and (ii) the set of nucleotide- based sequences of each given source of the group of sources from the given product; for each given nucleotide-based sequence of the set of nucleotide-based sequences, perform the following: (a) generate a linear regression model based on an allele frequency of the given nucleotide-based sequence in each given source of the group of sources compared to an allele frequency of the given nucleotide-based sequence of each given source in the given product; (b) from the linear regression
  • the model’s random error is defined as a minimized sum of squares of differences between the allele frequency of the given nucleotide-based sequence in each given source of the group of sources and the allele frequency of the given nucleotide-based sequence of each given source in the given finished product.
  • the sets of nucleotide-based sequences are each composed of single-nucleotide polymorphism DNA sequences.
  • the non-compliance is also determined whenever the residuals of all the nucleotide-based sequences of the set of nucleotide-based sequences are not distributing normally.
  • the threshold is defined according to a tolerance threshold, and (ii) the tolerance threshold is a percentage of sources of the group of sources for which at least one nucleotide-based sequence of the set of nucleotide-based sequences is lacking.
  • a system for detecting whether a product, produced from a group of sources, contains substance derived from a sick source comprising a processing circuitry configured to: obtain: (i) a set of nucleotide-based sequences from each given source of the group of sources, wherein the set of nucleotide-based sequences includes a collection of nucleotide-based sequences common to all the sources of the group of sources, and (ii) the set of nucleotide-based sequences of each given source of the group of sources from the given dairy product; for each given nucleotide-based sequence of the set of nucleotide-based sequences, perform the following: (a) generate a linear regression model based on an allele frequency of the given nucleotide-based sequence in each given source of the group of sources compared to an allele frequency of the given nucleotide- based sequence of each given source in the given dairy product; (b)
  • the processing circuitry prior to the generating of the Q-Q plot the processing circuitry is configured to determine whether the residuals distribute normally, and upon the distribution being a non-normal distribution, the processing circuitry moves to the Q-Q plot generating step.
  • the sets of nucleotide-based sequences are each composed of single-nucleotide polymorphism DNA sequences.
  • the model’s random error is defined as a minimized sum of squares of differences between the allele frequency of the given nucleotide-based sequence in each given source of the group of sources and the allele frequency of the given nucleotide-based sequence of each given source in the given finished product.
  • a method for determining whether a given source contributed to the formation of a given product comprising: obtaining: (i) a first nucleotide-based sequence originated from the given source, (ii) a first collection of nucleotide-based sequences, each originated from a source of a plurality of sources being part of a group of individuals, not all of whom serve as sources, (iii) a second collection of nucleotide- based sequences, each originated from the given product, and (iv) a second nucleotide- based sequence originated from a source known not to be included in the given product, being part of said group of individuals; calculating a first distance associated with the given source, wherein the first distance is composed of: (i) a distance of the first nucleotide-based sequence from the first collection of nucleotide-based sequences, and (ii) a distance of the first nucleotide-
  • the first and second collections of nucleotide-based sequences are each composed of single-nucleotide polymorphism DNA sequences.
  • a method for detecting a non-compliance in a given product produced from a group of sources comprising: obtaining: (i) a set of nucleotide-based sequences from each given source of the group of sources, wherein the set of nucleotide- based sequences includes a collection of nucleotide-based sequences common to all the sources of the group of sources, and (ii) the set of nucleotide-based sequences of each given source of the group of sources from the given product; for each given nucleotide- based sequence of the set of nucleotide-based sequences, performing the following: (a) generating a linear regression model based on an allele frequency of the given nucleo
  • the model’s random error is defined as a minimized sum of squares of differences between the allele frequency of the given nucleotide-based sequence in each given source of the group of sources and the allele frequency of the given nucleotide-based sequence of each given source in the given finished product.
  • the sets of nucleotide-based sequences are each composed of single-nucleotide polymorphism DNA sequences.
  • the non-compliance is also determined whenever the residuals of all the nucleotide-based sequences of the set of nucleotide-based sequences are not distributing normally.
  • the threshold is defined according to a tolerance threshold, and (ii) the tolerance threshold is a percentage of sources of the group of sources for which at least one nucleotide-based sequence of the set of nucleotide-based sequences is lacking.
  • a method for detecting whether a product, produced from a group of sources, contains substance derived from a sick source comprising: obtaining: (i) a set of nucleotide-based sequences from each given source of the group of sources, wherein the set of nucleotide-based sequences includes a collection of nucleotide-based sequences common to all the sources of the group of sources, and (ii) the set of nucleotide-based sequences of each given source of the group of sources from the given dairy product; for each given nucleotide-based sequence of the set of nucleotide-based sequences, performing the following: (a) generating a linear regression model based on an allele frequency of the given nucleotide-based sequence in each given source of the group of sources compared to an allele frequency of the given nucleotide-based sequence of each given source in the given dairy product; (b) from the linear regression model
  • the processing circuitry prior to the generating of the Q-Q plot the processing circuitry is configured to determine whether the residuals distribute normally, and upon the distribution being a non-normal distribution, the processing circuitry moves to the Q-Q plot generating step.
  • the sets of nucleotide-based sequences are each composed of single-nucleotide polymorphism DNA sequences.
  • the model’s random error is defined as a minimized sum of squares of differences between the allele frequency of the given nucleotide-based sequence in each given source of the group of sources and the allele frequency of the given nucleotide-based sequence of each given source in the given finished product.
  • a non-transitory computer readable storage medium having computer readable program code embodied therewith, the computer readable program code, executable by at least one processor to perform a method for determining whether a given source contributed to the formation of a given product, the method comprising: obtaining: (i) a first nucleotide-based sequence originated from the given source, (ii) a first collection of nucleotide-based sequences, each originated from a source of a plurality of sources being part of a group of individuals, not all of whom serve as sources, (iii) a second collection of nucleotide-based sequences, each originated from the given product, and (iv) a second nucleotide-based sequence originated from a source known not to be included in the given product, being part of said group of individuals; calculating a first distance associated with the given source, wherein the first distance is composed of: (i) a distance of the first nucle
  • a non-transitory computer readable storage medium having computer readable program code embodied therewith, the computer readable program code, executable by at least one processor to perform a method for detecting a non-compliance in a given product produced from a group of sources, the method comprising: obtaining: (i) a set of nucleotide-based sequences from each given source of the group of sources, wherein the set of nucleotide-based sequences includes a collection of nucleotide-based sequences common to all the sources of the group of sources, and (ii) the set of nucleotide-based sequences of each given source of the group of sources from the given product; for each given nucleotide-based sequence of the set of nucleotide-based sequences, performing the following: (a) generating a linear regression model based on an allele frequency of the given nucleotide-based sequence in each given source of the group of sources
  • a non-transitory computer readable storage medium having computer readable program code embodied therewith, the computer readable program code, executable by at least one processor to perform a method for detecting whether a product, produced from a group of sources, contains substance derived from a sick source, the method comprising: obtaining: (i) a set of nucleotide-based sequences from each given source of the group of sources, wherein the set of nucleotide-based sequences includes a collection of nucleotide-based sequences common to all the sources of the group of sources, and (ii) the set of nucleotide-based sequences of each given source of the group of sources from the given dairy product; for each given nucleotide-based sequence of the set of nucleotide- based sequences, performing the following: (a) generating a linear regression model based on an allele frequency of the given nucleotide-based sequence in each given source
  • Fig- 1 is a schematic illustration of environment on which a system for element traceability operates, in accordance with the presently disclosed subject matter
  • Fig- 2 is a block diagram schematically illustrating one example of a system for element traceability, in accordance with the presently disclosed subject matter.
  • Fig- 3 is a flowchart illustrating an example of a sequence of operations carried out by a system for element traceability, in accordance with the presently disclosed subject matter
  • FIG. 4 is an illustration of an element traceability process, operated by a system for element traceability, in accordance with the presently disclosed subject matter
  • FIG. 5 is a flowchart illustrating another example of a sequence of operations carried out by a system for element traceability, in accordance with the presently disclosed subject matter
  • Fig. 6 is an illustration of an exemplary linear regression model of a given nucleotide-based sequence, produced by a system for element traceability, in accordance with the presently disclosed subject matter;
  • Fig- 7 is a flowchart illustrating yet another example of a sequence of operations carried out by a system for element traceability, in accordance with the presently disclosed subject matter.
  • Figs. 8A and 8B are illustrations of exemplary Q-Q (Quantile-Quantile) plots, produced by a system for element traceability, in accordance with the presently disclosed subject matter.
  • should be expansively construed to cover any kind of electronic device with data processing capabilities, including, by way of non-limiting example, a personal desktop/laptop computer, a server, a computing system, a communication device, a smartphone, a tablet computer, a smart television, a processor (e.g. digital signal processor (DSP), a microcontroller, a field programmable gate array (FPGA), an application specific integrated circuit (ASIC), etc.), a group of multiple physical machines sharing performance of various tasks, virtual servers co-residing on a single physical machine, any other electronic computing device, and/or any combination thereof.
  • DSP digital signal processor
  • FPGA field programmable gate array
  • ASIC application specific integrated circuit
  • non-transitory is used herein to exclude transitory, propagating signals, but to otherwise include any volatile or nonvolatile computer memory technology suitable to the application.
  • the phrase “for example,” “such as”, “for instance” and variants thereof describe non-limiting embodiments of the presently disclosed subject matter.
  • Reference in the specification to “one case”, “some cases”, “other cases” or variants thereof means that a particular feature, structure or characteristic described in connection with the embodiment s) is included in at least one embodiment of the presently disclosed subject matter.
  • the appearance of the phrase “one case”, “some cases”, “other cases” or variants thereof does not necessarily refer to the same embodiment s).
  • Fig. 1 illustrate a general schematic of the system architecture in accordance with an embodiment of the presently disclosed subject matter.
  • Each module in Fig- 2 can be made up of any combination of software, hardware and/or firmware that performs the functions as defined and explained herein.
  • the modules in Fig. 2 may be centralized in one location or dispersed over more than one location.
  • the system may comprise fewer, more, and/or different modules than those shown in Fig. 2. Any reference in the specification to a method should be applied mutatis mutandis to a system capable of executing the method and should be applied mutatis mutandis to a non-transitory computer readable medium that stores instructions that once executed by a computer result in the execution of the method.
  • Any reference in the specification to a system should be applied mutatis mutandis to a method that may be executed by the system and should be applied mutatis mutandis to a non-transitory computer readable medium that stores instructions that may be executed by the system.
  • Any reference in the specification to a non-transitory computer readable medium should be applied mutatis mutandis to a system capable of executing the instructions stored in the non-transitory computer readable medium and should be applied mutatis mutandis to method that may be executed by a computer that reads the instructions stored in the non-transitory computer readable medium.
  • FIG. 1 showing a schematic illustration of an environment on which a system for element traceability operates, in accordance with the presently disclosed subject matter.
  • environment 100 includes a process 102 composed of a plurality of stages, denoted “ 1" to “N” (N being an integer number representing any number of stages).
  • Process 102 which represents a route a given element (e.g., a product, an ingredient, a raw material, a part of a product, etc.) undergoes, may extend between an initial stage (stage “ 1 "), at which the given element may be derived from a source 104, and a final stage (stage "N”), at which the given element, either in itself or in the form of a product produced from it, reaches its designated destination (e.g., a selling point at which the given element is being used/exploited/consumed, etc.).
  • source 104 may be a single source (e.g., a single animal (such as a cow, a goat, and the like), a single human, etc.) or a group of sources (e.g., a herd of animals (such as a herd of cows, a herd of goats, and the like), a farm, a group of farms, a group of humans, etc.).
  • source 104 may be of different types, e.g., human type, animal type (such as a cow, a goat, and the like), plant type, etc., and may therefore require the adaptation of the given element's derivation method to the source's type.
  • the stages composing process 102 may be separated into two or more sub-routes, at the end of each of which the given element may be at a designated point.
  • process 102 may be separated into (i) a first sub-route, known as a production sub-route, during which the given element may undergo various production or manufacturing processes, e.g., pasteurization, sterilization, fermentation, milling, etc., under dedicated conditions, until it reaches its designated form (e.g., a product form), and (ii) a second sub-route, known as a transportation sub-route, during which the designated form of the given element undergoes various transportation processes, e.g., sorting, packing, shipping, and the like, at the end of which the designated form of the given element reaches its designated destination (e.g., the selling point at which the designated form of the given element is being used/exploited/consumed, etc.).
  • a first sub-route known as a production sub-route
  • a transportation sub-route during which the designated form
  • process 102 is a supply chain representing a route a dairy product (e.g., cheese, butter, yogurt, ice cream, milk, condensed and dried milk, etc.) or a part dairy product (i.e. products containing dairy components such as cheese, milk, and the like) undergoes.
  • dairy product e.g., cheese, butter, yogurt, ice cream, milk, condensed and dried milk, etc.
  • part dairy product i.e. products containing dairy components such as cheese, milk, and the like
  • Process 102 extends from an initial stage (stage “ 1 "), at which cows of a group of cows (represented by an image of a pair of cows) are milked for their raw milk, to a final stage (stage "N"), at which a dairy product, produced from the raw milk derived from the group of cows, reaches a dairy shop 106, at which it is sold.
  • the dairy product's route is composed of: (i) a dairy product production sub-route, extending from the initial stage (stage “ 1 ”) to a stage at which the dairy product is kept in bottles ready for sale (stage "3"), and (ii) a dairy product transportation sub-route, extending from a stage at which the bottles ready for sale are packed in boxes (stage "4"), to the final stage at which the bottles reach the dairy shop 106, ready to be purchased by an end user (stage "N").
  • the raw milk derived from the group of cows undergoes sterilization and pasteurization processes, at the end of which the now sterilized and pasteurized milk is kept in said bottles, ready for transportation.
  • said bottles are transported in designated boxes, loaded on designated trucks, to dairy shop 106, where end users may purchase them.
  • supply chain 102 and each of its sub-routes described above, may include fewer or additional stages than those described hereinbefore, depending on the number of steps involved in each of them. It is to be further of note that supply chain 102, and each of its sub-routes, may include different stages than the stages described hereinbefore, mutatis mutandis.
  • Fig- 2 is a block diagram schematically illustrating one example of a system for element traceability 200, in accordance with the presently disclosed subject matter.
  • system 200 can comprise a network interface 206.
  • the network interface 206 e.g., a network card, a Wi-Fi client, a Li-Fi client, 3G/4G client, or any other component
  • system 200 can receive, through network interface 206, one or more sets/collections of nucleotide-based sequences originated from one or more sources and/or one or more products.
  • System 200 can further comprise or be otherwise associated with a data repository 204 (e.g., a database, a storage system, a memory including Read Only Memory - ROM, Random Access Memory - RAM, or any other type of memory, etc.) configured to store data.
  • a data repository 204 e.g., a database, a storage system, a memory including Read Only Memory - ROM, Random Access Memory - RAM, or any other type of memory, etc.
  • data repository 204 e.g., a database, a storage system, a memory including Read Only Memory - ROM, Random Access Memory - RAM, or any other type of memory, etc.
  • data repository 204 e.g., a database, a storage system, a memory including Read Only Memory - ROM, Random Access Memory - RAM, or any other type of memory, etc.
  • data repository 204 e.g., a database, a storage system, a memory including Read Only Memory - ROM, Random Access Memory - RAM,
  • One or more second nucleotide-based sequences originated from one or more sources not intended to contribute to the formation of one or more finished products;
  • Data repository 204 can be further configured to enable retrieval and/or update and/or deletion of the stored data. It is to be noted that in some cases, data repository 204 can be distributed, while the system 200 has access to the information stored thereon, e.g., via a wired or wireless network to which system 200 is able to connect (utilizing its network interface 206).
  • System 200 further comprises processing circuitry 202.
  • Processing circuitry 202 can be one or more processing units (e.g., central processing units), microprocessors, microcontrollers (e.g., microcontroller units (MCUs)) or any other computing devices or modules, including multiple and/or parallel and/or distributed processing units, which are adapted to independently or cooperatively process data for controlling relevant system 200 resources and for enabling operations related to system’s 200 resources.
  • processing units e.g., central processing units
  • microprocessors e.g., microcontroller units (MCUs)
  • MCUs microcontroller units
  • the processing circuitry 202 comprises an element traceability module 208, configured to perform an element traceability process, as further detailed herein, inter alia with reference to Figs. 3, 5 and 7.
  • Fig. 3 there is shown a flowchart illustrating one example of operations carried out by the system for element traceability 200, in accordance with the presently disclosed subject matter.
  • system 200 can be configured to perform element traceability process 300, e.g., using element traceability module 208.
  • the element traceability process 300 is directed at detecting the presence of raw materials, derived from a given source, in a given finished product and, by that, to determine whether the given source contributed to the formation of the given finished product.
  • system 200 obtains: (i) a first nucleotide-based sequence originated from a given source (e.g., the vector of allele frequencies of the given source), (ii) a first collection of nucleotide-based sequences, each of which originated from a source of a plurality of sources (e.g., the vector of allele frequencies of the plurality of sources, f.e., the aggregated sum of the plurality of sources), optionally being part of a larger group of individuals not all of whom serve as sources for raw materials used to produce the given finished product, (iii) a second collection of nucleotide-based sequences, each of which originated from a given product (e.g., the vector of allele frequencies of the given product), and (iv) a second nucleotide-based sequence originated from a source known not to be included in the given product (e.g., the vector of allele frequencies of the source known not to be included in the given product), being part of said larger group of individuals (
  • Each nucleotide-based sequence of the nucleotide-based sequences of (i) to (iv) may be, in one example, a DNA sequence with a germline substitution of a single nucleotide at a specific position in the sequence (i.e., a single-nucleotide polymorphism (SNP) DNA sequence).
  • each nucleotide-based sequence may be a DNA sequence with a germline substitution of two or more nucleotides at specific positions in the sequence.
  • the plurality of sources from which the first collection of nucleotide-based sequences is obtained may be the group of sources mentioned in relation to Fig.
  • the second collection of nucleotide-based sequences may be nucleotide-based sequences obtained from the finished product mentioned in relation to Fig. 1.
  • the given source from which the first nucleotide-based sequence is acquired may or may not be a member of the group of sources from which the first collection of nucleotide-based sequences is obtained.
  • system 200 obtains: (i) an SNP sequence originated from a cow, denoted yj, (ii) a first collection of SNP sequences, each of which originated from a cow of a plurality of cows, denoted Popj, being part of a larger group of cows, (iii) a second collection of SNP sequences, each of which originated from a milk product, denoted Mj, and (iv) an SNP sequence originated from a cow known not to be included in the milk product , denoted y’j, being included in said larger group of cows.
  • System 200 calculates a first distance, associated with the given source, composed of (i) a distance of the first nucleotide-based sequence from the first collection of nucleotide-based sequences, and (ii) a distance of the first nucleotide-based sequence from the second collection of nucleotide-based sequences (block 304).
  • the distance of the first nucleotide-based sequence from the first and second collections may be, for example, Euclidean distance (though other forms of mathematical distancing may be applicable), directed to effectively measure the similarity or dissimilarity of an individual or a population from another individual or population. For example, in cases involving two individuals, "individual A” and "individual B", the genotypes of both individuals may be converted into a numeric form of 0, 0.5, or 1, which represents the number of minor alleles the individuals have at each genetic location divided by 2.
  • an absolute difference of the numeric genetic value at each genetic location is calculated, and the overall absolute differences at the various genetic locations are used to calculate a mean difference, representing a distance metric (ranging between 0 and 1) between "individual A” and "individual B".
  • a distance metric ranging between 0 and 1 between "individual A” and "individual B". The closer the distance metric is to 1 means that "individual A” and “individual B” are more different. The closer the distance metric is to 0 means that "individual A” and "individual B" are more similar.
  • system 200 calculates a first distance, associated with cow, yj.
  • the first distance is composed of (i) the distance of the SNP sequence originated from cow, yj, from the first collection of SNP sequences, originated from the plurality of cows, Popj, and (ii) the distance of the SNP sequence originated from cow, yj, from the second collection of SNP sequences, originated from the milk product, Mj.
  • System 200 then calculates a second distance, associated with the source known not included in the given product, composed of (i) a distance of the second nucleotide- based sequence from the first collection of nucleotide-based sequences, and (ii) a distance of the second nucleotide-based sequence from the second collection of nucleotide-based sequences (block 306).
  • system 200 calculates a second distance, associated with cow, y’j.
  • the second distance is composed of (i) the distance of the SNP sequence originated from cow, y’j, from the first collection of SNP sequences originated from the plurality of cows, denoted Popj, and (ii) the distance of the SNP sequence originated from cow, y’j, from the second collection of SNP sequences originated from the milk product, Mj.
  • system 200 determines that the given source contributed to the formation of the product (block 308).
  • system 200 determines that cow, yj contributed to the formation of milk product, M, as the difference between the first distance, associated with cow, yj, and the second distance, associated with, y’j, is above a predefined threshold.
  • a two sample t-test may be applied.
  • the second nucleotide-based sequence originating from a source known not to be included in the given product may not only be used to generate an optional multiple-dimensional test statistic, as explained hereinbefore, but may also be used as a control group to improve the sensitivity and power of said system.
  • the optional two-sample t-test may provide a more comprehensive understanding of the spread and overlap of the two distributions, as well as increased sensitivity to detecting true differences between the groups, from which greater insight into the presence or absence of an individual may be gained. Comparing two different conditions (presence vs absence in a given finished product) may make the analysis more robust to individual-specific variation and may potentially reduce the proneness to Type I and Type II errors.
  • comparing the differences in genotypes between two different individuals can provide more context and insight into the nature and significance of the differences observed.
  • FIG. 5 there is shown a flowchart illustrating another example of operations carried out by the system for element traceability 200, in accordance with the presently disclosed subject matter.
  • system 200 can be configured to perform element traceability process 500, e.g., using element traceability module 208.
  • the element traceability process 500 is directed at detecting non-compliance in a given finished product produced from a group of sources and, by that, to detect the presence of materials not supposed to be in the given finished product.
  • system 200 obtains: (i) a set of nucleotide-based sequences from each given source of a group of sources, and (ii) the set of nucleotide-based sequences of each given source of the group of sources from the given product (block 502).
  • the set of nucleotide-based sequences may include a collection of nucleotide-based sequences common to all sources of the group of sources.
  • Each nucleotide-based sequence may be a DNA sequence with a germline substitution of a single nucleotide at a specific position in the sequence (i.e., a single-nucleotide polymorphism (SNP) DNA sequence), a DNA sequence with a germline substitution of two or more nucleotides at specific positions in the sequence, a DNA sequence with a different number of copies of a specific segment of DNA (Copy number variation (CNV)), a DNA sequence with a different number of short tandem repeats (STRs), etc.
  • SNP single-nucleotide polymorphism
  • system 200 obtains: (i) a set of SNP sequences from each cow of a plurality of cows used as sources for raw materials from which a milk product was produced, and (iii) the set of SNP sequences of each cow of said plurality of cows from the produced milk product.
  • system 200 generates a multiple linear regression model based on an allele frequency of the given nucleotide-based sequence in each given source of the group of sources compared to an allele frequency of the given nucleotide-based sequence of each given source in the given finished product (block 504 (a))
  • Fig. 6 illustrates an exemplary simple linear regression model of a given nucleotide-based sequence from the set of nucleotide-based sequences, represented by graph 600.
  • Graph 600 consists of (i) an x-axis, denoted 602, representing the values of the allele frequency of the given nucleotide-based sequence in each source of the group of sources, and (ii) a y-axis, denoted 604, representing the values of the allele frequency of the given nucleotide-based sequence of each source of the group of sources in the given finished product.
  • graph 600 includes a plurality of points, each representing the meeting point of the values of the allele frequencies of the given nucleotide-based sequence for a given source of the group of sources, and a line 606, representing the underlying relationship between the two allele frequencies of the given nucleotide-based sequence.
  • Line 606 is associated with a linear equation consisting of a slope value and a y -intercept value.
  • system 200 For each given SNP sequence of the set of SNP sequences, system 200 generates a linear regression model, represented by a graph similar to graph 600, based on an allele frequency of the given SNP sequence in each given source of the group of sources compared to an allele frequency of the given SNP sequence of each given source in the produced milk product.
  • system 200 may implement other approaches such as Bayesian or non-parametric approaches.
  • system 200 obtains a residual, which is the model’s random error (e.g., the y-intercept value of the linear equation of the linear regression model) (block 504 (b)), and upon the residual exceeding a threshold, system 200 determines a non-compliance in the given finished product (block 504 (c)).
  • a residual which is the model’s random error (e.g., the y-intercept value of the linear equation of the linear regression model) (block 504 (b))
  • system 200 determines a non-compliance in the given finished product (block 504 (c)).
  • the model’s random error may be defined, for example, as a minimized sum of squares of differences between the allele frequency of the given nucleotide-based sequence in each given source of the group of sources and the allele frequency of the given nucleotide-based sequence of each given source in the given finished product.
  • the threshold may be a customized threshold for each given nucleotide-based sequence or a standard threshold suitable for all nucleotide-based sequences.
  • the threshold may be defined, for example, according to a tolerance threshold, being a percentage of sources of the group of sources for which at least one given nucleotide-based sequence of the set of nucleotide-based sequences is lacking.
  • a non-compliance may be determined whenever the distribution of the residuals of all the nucleotide-based sequences of the set of nucleotide-based sequences is not normal (in the mathematical sense).
  • system 200 obtains a respective residual, being the y-intercept value of the linear equation of the linear regression model, and compares it to a standard threshold. As one of the respective residuals, associated with a given SNP sequence, exceeds a predefined threshold, system 200 determines a non-compliance in the given finished product.
  • FIG. 7 there is shown a flowchart illustrating yet another example of operations carried out by the system for element traceability 200, in accordance with the presently disclosed subject matter.
  • system 200 can be configured to perform element traceability process 700, e.g., using element traceability module 208.
  • the element traceability process 700 is directed at detecting whether a finished product, produced from a group of sources, contains substances derived from one or more sources subjected to a specific condition (e.g., sickness, and the like).
  • system 200 obtains: (i) a set of nucleotide-based sequences from each given source of a group of sources, and (ii) the set of nucleotide-based sequences of each given source of the group of sources from the given finished product (block 702).
  • the set of nucleotide-based sequences may include a collection of nucleotide-based sequences common to all sources of the group of sources.
  • Each nucleotide-based sequence may be, in one example, a DNA sequence with a germline substitution of a single nucleotide at a specific position in the sequence (i.e., a single-nucleotide polymorphism (SNP) DNA sequence).
  • SNP single-nucleotide polymorphism
  • each nucleotide-based sequence may be a DNA sequence with a germline substitution of two or more nucleotides at specific positions in the sequence.
  • system 200 obtains: (i) a set of SNP sequences from each cow of a plurality of cows used as sources for raw materials from which a milk product was produced, and (iii) the set of SNP sequences of each cow of said plurality of cows from the produced milk product.
  • system 200 For each given nucleotide-based sequence of the set of nucleotide-based sequences, system 200 generates a linear regression model based on an allele frequency of the given nucleotide-based sequence in each given source of the group of sources compared to an allele frequency of the given nucleotide-based sequence of each given source in the given finished product (in a similar manner to the manner described in relation to Figs. 5 and 6) (block 704 (a)).
  • system 200 For each given SNP sequence of the set of SNP sequences, system 200 generates a linear regression model, represented by a graph similar to graph 600, based on an allele frequency of the given SNP sequence in each given source of the group of sources compared to an allele frequency of the given SNP sequence of each given source in the produced milk product.
  • system 200 obtains a residual, which may be, for example, the model’s random error (e.g., the y-intercept value of the linear equation of the linear regression model) (block 704 (b)).
  • the model’s random error may be defined, for example, as a minimized sum of squares of differences between the allele frequency of the given nucleotide-based sequence in each given source of the group of sources and the allele frequency of the given nucleotide-based sequence of each given source in the given finished product.
  • system 200 Based on the residuals of the nucleotide-based sequences of the set of nucleotide- based sequences, system 200 generates a Q-Q (Quantile-Quantile) plot (block 706).
  • Figs. 8A and 8B illustrate exemplary Q-Q plots 800 and 800', respectively, each representing the distribution of residuals of a set of nucleotide-based sequences.
  • Q-Q plots 800 and 800' are each composed of: (i) an x-axis, denoted 802, representing the actual values of the residuals of each nucleotide-based sequence, and (i) a y-axis, denoted 804, representing the expected values of the residuals of each nucleotide-based sequence.
  • Q-Q plots 800 and 800' include a plurality of points, each representing the meeting point of the actual and expected values of the residuals of each nucleotide-based sequence, and a line 806, representing the underlying relationship between the two residual values.
  • the majority of the plurality of points of Q-Q plot 800 are aligned along line 806, forming a normal distribution.
  • the plurality of points of Q-Q plot 800 form a deviation from line 806 at both of its edges, denoted 808a and 808b, forming a non-normal distribution.
  • system 200 determines that the product contains substance derived from a source subjected to a specific condition (e.g., sickness, and the like) (block 708).
  • a specific condition e.g., sickness, and the like
  • system 200 determines whether the residuals of the set of nucleotide-based sequences distribute normally, and upon the distribution being a non-normal distribution, system 200 moves to the Q-Q plot generating step.
  • the step of generating a Q-Q plot may be replaced with a comparison to a predefined threshold representing the maximum absolute value of the residuals.
  • the use of said predefined threshold may be combined with an adjusted R-squared Multiple Regression Analysis.
  • system 200 may be used in any field of use involving complex sample mixtures.
  • system 200 may be utilized to identify and/or trace contributors for research purposes (e.g., clinical trials, and the like).
  • system 200 may be utilized to identify and /or trace contributors to the manufacture of biopharmaceutical products, in which animal by-products from complex mixed sources may be used (for example, gelatin, bovine serum albumin (BSA), and the like).
  • animal by-products from complex mixed sources may be used (for example, gelatin, bovine serum albumin (BSA), and the like).
  • BSA bovine serum albumin
  • system can be implemented, at least partly, as a suitably programmed computer.
  • the presently disclosed subject matter contemplates a computer program being readable by a computer for executing the disclosed method.
  • the presently disclosed subject matter further contemplates a machine-readable memory tangibly embodying a program of instructions executable by the machine for executing the disclosed method.

Landscapes

  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Chemical & Material Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Medical Informatics (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Biophysics (AREA)
  • General Health & Medical Sciences (AREA)
  • Biotechnology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Theoretical Computer Science (AREA)
  • Organic Chemistry (AREA)
  • Evolutionary Biology (AREA)
  • Analytical Chemistry (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Genetics & Genomics (AREA)
  • Zoology (AREA)
  • Wood Science & Technology (AREA)
  • Molecular Biology (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Immunology (AREA)
  • Microbiology (AREA)
  • Evolutionary Computation (AREA)
  • Bioethics (AREA)
  • Epidemiology (AREA)
  • Software Systems (AREA)
  • Biochemistry (AREA)
  • General Engineering & Computer Science (AREA)
  • Public Health (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The presently disclosed subject matter aims to provide a system and method for element traceability. The system for element traceability includes a processing circuitry configured to perform at least one of: (i) detecting the presence of materials, derived from a given source, in a given product, (ii) detecting the presence of materials in the given product that were derived from sources subjected to a specific condition (e.g., sickness, and the like), and/or (iii) detecting the presence of materials not supposed to be in the given product.

Description

A SYSTEM AND METHOD FOR ELEMENT TRACEABILITY
TECHNICAL FIELD
The present invention relates to the field of element traceability.
BACKGROUND
Element traceability is an umbrella term encompassing systems and methods for identifying, tracking, and/or tracing items, features, or information as it moves through a process (e.g., tracing raw materials as they move through a production process, during which they are used to produce a product). Traceability inter alia may serve as a critical tool for operationalizing standards and regulations to improve product safety control (e.g., by enabling public and private sector actors to verify that products meet market and/or regulatory requirements) and may assist with responding to safety breaches.
Existing element traceability solutions remain insufficient, as they fail to provide adequate solutions for: (i) detecting the presence of materials, derived from a given source, in a given product, (ii) detecting the presence of materials in the given product that were derived from sources subjected to a specific condition (e.g., sickness, and the like), and/or (iii) detecting the presence of materials not supposed to be in the given product.
Thus, there is a need in the art for a new and improved system and method for element traceability.
GENERAL DESCRIPTION
In accordance with a first aspect of the presently disclosed subject matter, there is provided a system for determining whether a given source contributed to the formation of a given product, the system comprising a processing circuitry configured to: obtain: (i) a first nucleotide-based sequence originated from the given source, (ii) a first collection of nucleotide-based sequences, each originated from a source of a plurality of sources being part of a group of individuals, not all of whom serve as sources, (iii) a second collection of nucleotide-based sequences, each originated from the given product, and (iv) a second nucleotide-based sequence originated from a source known not to be included in the given product, being part of said group of individuals; calculate a first distance associated with the given source, wherein the first distance is composed of: (i) a distance of the first nucleotide-based sequence from the first collection of nucleotide- based sequences, and (ii) a distance of the first nucleotide-based sequence from the second collection of nucleotide-based sequences; calculate a second distance associated with the source known not to be included in the given product, wherein the second distance is composed of: (i) a distance of the second nucleotide-based sequence from the first collection of nucleotide-based sequences, and (ii) a distance of the second nucleotide-based sequence from the second collection of nucleotide-based sequences; and, upon a difference between the first distance and the second distance being above a threshold, determine that the given source contributed to the formation of the product.
In one embodiment of the presently disclosed subject matter and/or embodiments thereof, following the calculation of the first and second distances a two sample t-test is applied.
In one embodiment of the presently disclosed subject matter and/or embodiments thereof, the first and second collections of nucleotide-based sequences are each composed of single-nucleotide polymorphism DNA sequences.
In accordance with a second aspect of the presently disclosed subject matter, there is provided a system for detecting a non-compliance in a given product produced from a group of sources, the system comprising a processing circuitry configured to: obtain: (i) a set of nucleotide-based sequences from each given source of the group of sources, wherein the set of nucleotide-based sequences includes a collection of nucleotide-based sequences common to all the sources of the group of sources, and (ii) the set of nucleotide- based sequences of each given source of the group of sources from the given product; for each given nucleotide-based sequence of the set of nucleotide-based sequences, perform the following: (a) generate a linear regression model based on an allele frequency of the given nucleotide-based sequence in each given source of the group of sources compared to an allele frequency of the given nucleotide-based sequence of each given source in the given product; (b) from the linear regression model, obtain a residual, wherein the residual being the model’s random error; and, (c) upon the residual exceeding a threshold, determine a non-compliance in the given product.
In one embodiment of the presently disclosed subject matter and/or embodiments thereof, the model’s random error is defined as a minimized sum of squares of differences between the allele frequency of the given nucleotide-based sequence in each given source of the group of sources and the allele frequency of the given nucleotide-based sequence of each given source in the given finished product.
In one embodiment of the presently disclosed subject matter and/or embodiments thereof, the sets of nucleotide-based sequences are each composed of single-nucleotide polymorphism DNA sequences.
In one embodiment of the presently disclosed subject matter and/or embodiments thereof, the non-compliance is also determined whenever the residuals of all the nucleotide-based sequences of the set of nucleotide-based sequences are not distributing normally.
In one embodiment of the presently disclosed subject matter and/or embodiments thereof, (i) the threshold is defined according to a tolerance threshold, and (ii) the tolerance threshold is a percentage of sources of the group of sources for which at least one nucleotide-based sequence of the set of nucleotide-based sequences is lacking.
In accordance with a third aspect of the presently disclosed subject matter, there is provided a system for detecting whether a product, produced from a group of sources, contains substance derived from a sick source, the system comprising a processing circuitry configured to: obtain: (i) a set of nucleotide-based sequences from each given source of the group of sources, wherein the set of nucleotide-based sequences includes a collection of nucleotide-based sequences common to all the sources of the group of sources, and (ii) the set of nucleotide-based sequences of each given source of the group of sources from the given dairy product; for each given nucleotide-based sequence of the set of nucleotide-based sequences, perform the following: (a) generate a linear regression model based on an allele frequency of the given nucleotide-based sequence in each given source of the group of sources compared to an allele frequency of the given nucleotide- based sequence of each given source in the given dairy product; (b) from the linear regression model, obtain a residual, wherein the residual being the model’s random error; based on the residuals of the nucleotide-based sequences of the set of nucleotide-based sequences, generate a Q-Q plot; and, upon at least one edge of the Q-Q plot being nonlinear, determine that the product contains substance derived from an ill source.
In one embodiment of the presently disclosed subject matter and/or embodiments thereof, prior to the generating of the Q-Q plot the processing circuitry is configured to determine whether the residuals distribute normally, and upon the distribution being a non-normal distribution, the processing circuitry moves to the Q-Q plot generating step. In one embodiment of the presently disclosed subject matter and/or embodiments thereof, the sets of nucleotide-based sequences are each composed of single-nucleotide polymorphism DNA sequences.
In one embodiment of the presently disclosed subject matter and/or embodiments thereof, the model’s random error is defined as a minimized sum of squares of differences between the allele frequency of the given nucleotide-based sequence in each given source of the group of sources and the allele frequency of the given nucleotide-based sequence of each given source in the given finished product.
In accordance with a fourth aspect of the presently disclosed subject matter, there is provided a method for determining whether a given source contributed to the formation of a given product, the method comprising: obtaining: (i) a first nucleotide-based sequence originated from the given source, (ii) a first collection of nucleotide-based sequences, each originated from a source of a plurality of sources being part of a group of individuals, not all of whom serve as sources, (iii) a second collection of nucleotide- based sequences, each originated from the given product, and (iv) a second nucleotide- based sequence originated from a source known not to be included in the given product, being part of said group of individuals; calculating a first distance associated with the given source, wherein the first distance is composed of: (i) a distance of the first nucleotide-based sequence from the first collection of nucleotide-based sequences, and (ii) a distance of the first nucleotide-based sequence from the second collection of nucleotide-based sequences; calculating a second distance associated with the source not included in the plurality of sources, wherein the second distance is composed of: (i) a distance of the second nucleotide-based sequence from the first collection of nucleotide- based sequences, and (ii) a distance of the second nucleotide-based sequence from the second collection of nucleotide-based sequences; and, upon a difference between the first distance and the second distance being above a threshold, determining that the given source contributed to the formation of the product.
In one embodiment of the presently disclosed subject matter and/or embodiments thereof, following the calculation of the first and second distances a two sample t-test is applied.
In one embodiment of the presently disclosed subject matter and/or embodiments thereof, the first and second collections of nucleotide-based sequences are each composed of single-nucleotide polymorphism DNA sequences. In accordance with a fifth aspect of the presently disclosed subject matter, there is provided a method for detecting a non-compliance in a given product produced from a group of sources, the method comprising: obtaining: (i) a set of nucleotide-based sequences from each given source of the group of sources, wherein the set of nucleotide- based sequences includes a collection of nucleotide-based sequences common to all the sources of the group of sources, and (ii) the set of nucleotide-based sequences of each given source of the group of sources from the given product; for each given nucleotide- based sequence of the set of nucleotide-based sequences, performing the following: (a) generating a linear regression model based on an allele frequency of the given nucleotide- based sequence in each given source of the group of sources compared to an allele frequency of the given nucleotide-based sequence of each given source in the given product; (b) from the linear regression model, obtaining a residual, wherein the residual being the model’s random error; and, (c) upon the residual exceeding a threshold, determining a non-compliance in the given product.
In one embodiment of the presently disclosed subject matter and/or embodiments thereof, the model’s random error is defined as a minimized sum of squares of differences between the allele frequency of the given nucleotide-based sequence in each given source of the group of sources and the allele frequency of the given nucleotide-based sequence of each given source in the given finished product.
In one embodiment of the presently disclosed subject matter and/or embodiments thereof, the sets of nucleotide-based sequences are each composed of single-nucleotide polymorphism DNA sequences.
In one embodiment of the presently disclosed subject matter and/or embodiments thereof, the non-compliance is also determined whenever the residuals of all the nucleotide-based sequences of the set of nucleotide-based sequences are not distributing normally.
In one embodiment of the presently disclosed subject matter and/or embodiments thereof, (i) the threshold is defined according to a tolerance threshold, and (ii) the tolerance threshold is a percentage of sources of the group of sources for which at least one nucleotide-based sequence of the set of nucleotide-based sequences is lacking.
In accordance with a sixth aspect of the presently disclosed subject matter, there is provided a method for detecting whether a product, produced from a group of sources, contains substance derived from a sick source, the method comprising: obtaining: (i) a set of nucleotide-based sequences from each given source of the group of sources, wherein the set of nucleotide-based sequences includes a collection of nucleotide-based sequences common to all the sources of the group of sources, and (ii) the set of nucleotide-based sequences of each given source of the group of sources from the given dairy product; for each given nucleotide-based sequence of the set of nucleotide-based sequences, performing the following: (a) generating a linear regression model based on an allele frequency of the given nucleotide-based sequence in each given source of the group of sources compared to an allele frequency of the given nucleotide-based sequence of each given source in the given dairy product; (b) from the linear regression model, obtaining a residual, wherein the residual being the model’s random error; based on the residuals of the nucleotide-based sequences of the set of nucleotide-based sequences, generating a Q- Q plot; and, upon at least one edge of the Q-Q plot being non-linear, determining that the product contains substance derived from an ill source.
In one embodiment of the presently disclosed subject matter and/or embodiments thereof, prior to the generating of the Q-Q plot the processing circuitry is configured to determine whether the residuals distribute normally, and upon the distribution being a non-normal distribution, the processing circuitry moves to the Q-Q plot generating step.
In one embodiment of the presently disclosed subject matter and/or embodiments thereof, the sets of nucleotide-based sequences are each composed of single-nucleotide polymorphism DNA sequences.
In one embodiment of the presently disclosed subject matter and/or embodiments thereof, the model’s random error is defined as a minimized sum of squares of differences between the allele frequency of the given nucleotide-based sequence in each given source of the group of sources and the allele frequency of the given nucleotide-based sequence of each given source in the given finished product.
In accordance with a seventh aspect of the presently disclosed subject matter, there is provided a non-transitory computer readable storage medium having computer readable program code embodied therewith, the computer readable program code, executable by at least one processor to perform a method for determining whether a given source contributed to the formation of a given product, the method comprising: obtaining: (i) a first nucleotide-based sequence originated from the given source, (ii) a first collection of nucleotide-based sequences, each originated from a source of a plurality of sources being part of a group of individuals, not all of whom serve as sources, (iii) a second collection of nucleotide-based sequences, each originated from the given product, and (iv) a second nucleotide-based sequence originated from a source known not to be included in the given product, being part of said group of individuals; calculating a first distance associated with the given source, wherein the first distance is composed of: (i) a distance of the first nucleotide-based sequence from the first collection of nucleotide- based sequences, and (ii) a distance of the first nucleotide-based sequence from the second collection of nucleotide-based sequences; calculating a second distance associated with the source not included in the plurality of sources, wherein the second distance is composed of: (i) a distance of the second nucleotide-based sequence from the first collection of nucleotide-based sequences, and (ii) a distance of the second nucleotide- based sequence from the second collection of nucleotide-based sequences; and, upon a difference between the first distance and the second distance being above a threshold, determining that the given source contributed to the formation of the product.
In accordance with an eighth aspect of the presently disclosed subject matter, there is provided a non-transitory computer readable storage medium having computer readable program code embodied therewith, the computer readable program code, executable by at least one processor to perform a method for detecting a non-compliance in a given product produced from a group of sources, the method comprising: obtaining: (i) a set of nucleotide-based sequences from each given source of the group of sources, wherein the set of nucleotide-based sequences includes a collection of nucleotide-based sequences common to all the sources of the group of sources, and (ii) the set of nucleotide-based sequences of each given source of the group of sources from the given product; for each given nucleotide-based sequence of the set of nucleotide-based sequences, performing the following: (a) generating a linear regression model based on an allele frequency of the given nucleotide-based sequence in each given source of the group of sources compared to an allele frequency of the given nucleotide-based sequence of each given source in the given product; (b) from the linear regression model, obtaining a residual, wherein the residual being the model’s random error; and, (c) upon the residual exceeding a threshold, determining a non-compliance in the given product.
In accordance with a ninth aspect of the presently disclosed subject matter, there is provided a non-transitory computer readable storage medium having computer readable program code embodied therewith, the computer readable program code, executable by at least one processor to perform a method for detecting whether a product, produced from a group of sources, contains substance derived from a sick source, the method comprising: obtaining: (i) a set of nucleotide-based sequences from each given source of the group of sources, wherein the set of nucleotide-based sequences includes a collection of nucleotide-based sequences common to all the sources of the group of sources, and (ii) the set of nucleotide-based sequences of each given source of the group of sources from the given dairy product; for each given nucleotide-based sequence of the set of nucleotide- based sequences, performing the following: (a) generating a linear regression model based on an allele frequency of the given nucleotide-based sequence in each given source of the group of sources compared to an allele frequency of the given nucleotide-based sequence of each given source in the given dairy product; (b) from the linear regression model, obtaining a residual, wherein the residual being the model’s random error; based on the residuals of the nucleotide-based sequences of the set of nucleotide-based sequences, generating a Q-Q plot; and, upon at least one edge of the Q-Q plot being nonlinear, determining that the product contains substance derived from an ill source.
BRIEF DESCRIPTION OF THE DRAWINGS
In order to understand the presently disclosed subject matter and to see how it may be carried out in practice, the subj ect matter will now be described, by way of non-limiting examples only, with reference to the accompanying drawings, in which:
Fig- 1 is a schematic illustration of environment on which a system for element traceability operates, in accordance with the presently disclosed subject matter;
Fig- 2 is a block diagram schematically illustrating one example of a system for element traceability, in accordance with the presently disclosed subject matter; and,
Fig- 3 is a flowchart illustrating an example of a sequence of operations carried out by a system for element traceability, in accordance with the presently disclosed subject matter;
Fig. 4 is an illustration of an element traceability process, operated by a system for element traceability, in accordance with the presently disclosed subject matter;
Fig. 5 is a flowchart illustrating another example of a sequence of operations carried out by a system for element traceability, in accordance with the presently disclosed subject matter; Fig. 6 is an illustration of an exemplary linear regression model of a given nucleotide-based sequence, produced by a system for element traceability, in accordance with the presently disclosed subject matter;
Fig- 7 is a flowchart illustrating yet another example of a sequence of operations carried out by a system for element traceability, in accordance with the presently disclosed subject matter; and,
Figs. 8A and 8B are illustrations of exemplary Q-Q (Quantile-Quantile) plots, produced by a system for element traceability, in accordance with the presently disclosed subject matter.
DETAILED DESCRIPTION
In the following detailed description, numerous specific details are set forth in order to provide a thorough understanding of the presently disclosed subject matter. However, it will be understood by those skilled in the art that the presently disclosed subject matter may be practiced without these specific details. In other instances, well- known methods, procedures, and components have not been described in detail so as not to obscure the presently disclosed subject matter.
In the drawings and descriptions set forth, identical reference numerals indicate those components that are common to different embodiments or configurations.
Unless specifically stated otherwise, as apparent from the following discussions, it is appreciated that throughout the specification discussions utilizing terms such as “obtaining44, “calculating”, “determining44, “performing” “generating”, or the like, include action and/or processes of a computer that manipulate and/or transform data into other data, said data represented as physical quantities, e.g., such as electronic quantities, and/or said data representing the physical objects. The terms “computer”, “processor”, “processing resource”, “processing circuitry”, and “controller” should be expansively construed to cover any kind of electronic device with data processing capabilities, including, by way of non-limiting example, a personal desktop/laptop computer, a server, a computing system, a communication device, a smartphone, a tablet computer, a smart television, a processor (e.g. digital signal processor (DSP), a microcontroller, a field programmable gate array (FPGA), an application specific integrated circuit (ASIC), etc.), a group of multiple physical machines sharing performance of various tasks, virtual servers co-residing on a single physical machine, any other electronic computing device, and/or any combination thereof.
The operations in accordance with the teachings herein may be performed by a computer specially constructed for the desired purposes or by a general-purpose computer specially configured for the desired purpose by a computer program stored in a non- transitory computer readable storage medium. The term "non-transitory" is used herein to exclude transitory, propagating signals, but to otherwise include any volatile or nonvolatile computer memory technology suitable to the application.
As used herein, the phrase "for example," "such as", "for instance" and variants thereof describe non-limiting embodiments of the presently disclosed subject matter. Reference in the specification to "one case", "some cases", "other cases" or variants thereof means that a particular feature, structure or characteristic described in connection with the embodiment s) is included in at least one embodiment of the presently disclosed subject matter. Thus, the appearance of the phrase "one case", "some cases", "other cases" or variants thereof does not necessarily refer to the same embodiment s).
It is appreciated that, unless specifically stated otherwise, certain features of the presently disclosed subject matter, which are, for clarity, described in the context of separate embodiments, may also be provided in combination in a single embodiment. Conversely, various features of the presently disclosed subject matter, which are, for brevity, described in the context of a single embodiment, may also be provided separately or in any suitable sub-combination.
In embodiments of the presently disclosed subject matter, fewer, more and/or different stages than those shown in Figs. 3, 5, and 7 may be executed. In embodiments of the presently disclosed subject matter one or more stages illustrated in Figs. 3, 5, and 7 may be executed in a different order and/or one or more groups of stages may be executed simultaneously. Fig. 1 illustrate a general schematic of the system architecture in accordance with an embodiment of the presently disclosed subject matter. Each module in Fig- 2 can be made up of any combination of software, hardware and/or firmware that performs the functions as defined and explained herein. The modules in Fig. 2 may be centralized in one location or dispersed over more than one location. In other embodiments of the presently disclosed subject matter, the system may comprise fewer, more, and/or different modules than those shown in Fig. 2. Any reference in the specification to a method should be applied mutatis mutandis to a system capable of executing the method and should be applied mutatis mutandis to a non-transitory computer readable medium that stores instructions that once executed by a computer result in the execution of the method.
Any reference in the specification to a system should be applied mutatis mutandis to a method that may be executed by the system and should be applied mutatis mutandis to a non-transitory computer readable medium that stores instructions that may be executed by the system.
Any reference in the specification to a non-transitory computer readable medium should be applied mutatis mutandis to a system capable of executing the instructions stored in the non-transitory computer readable medium and should be applied mutatis mutandis to method that may be executed by a computer that reads the instructions stored in the non-transitory computer readable medium.
Any reference in the specification to "element traceability" should be applied mutatis mutandis to the ability to identify, track and trace products, ingredients, raw materials, parts of products, features, etc. as they move along a process, as detailed hereinafter in relation to Fig. 1.
Bearing this in mind, attention is drawn to Fig. 1, showing a schematic illustration of an environment on which a system for element traceability operates, in accordance with the presently disclosed subject matter.
As shown in the schematic illustration, environment 100 includes a process 102 composed of a plurality of stages, denoted " 1" to "N" (N being an integer number representing any number of stages). Process 102, which represents a route a given element (e.g., a product, an ingredient, a raw material, a part of a product, etc.) undergoes, may extend between an initial stage (stage " 1 "), at which the given element may be derived from a source 104, and a final stage (stage "N"), at which the given element, either in itself or in the form of a product produced from it, reaches its designated destination (e.g., a selling point at which the given element is being used/exploited/consumed, etc.).
It is to be of note that source 104 may be a single source (e.g., a single animal (such as a cow, a goat, and the like), a single human, etc.) or a group of sources (e.g., a herd of animals (such as a herd of cows, a herd of goats, and the like), a farm, a group of farms, a group of humans, etc.). In addition, it is to be of note that source 104 may be of different types, e.g., human type, animal type (such as a cow, a goat, and the like), plant type, etc., and may therefore require the adaptation of the given element's derivation method to the source's type.
The stages composing process 102 may be separated into two or more sub-routes, at the end of each of which the given element may be at a designated point. For example, process 102 may be separated into (i) a first sub-route, known as a production sub-route, during which the given element may undergo various production or manufacturing processes, e.g., pasteurization, sterilization, fermentation, milling, etc., under dedicated conditions, until it reaches its designated form (e.g., a product form), and (ii) a second sub-route, known as a transportation sub-route, during which the designated form of the given element undergoes various transportation processes, e.g., sorting, packing, shipping, and the like, at the end of which the designated form of the given element reaches its designated destination (e.g., the selling point at which the designated form of the given element is being used/exploited/consumed, etc.).
By way of a non-limiting example (presented merely for purposes of better understanding the disclosed subject matter and not in any way intended to limit its scope), as illustrated in Fig. 1, process 102 is a supply chain representing a route a dairy product (e.g., cheese, butter, yogurt, ice cream, milk, condensed and dried milk, etc.) or a part dairy product (i.e. products containing dairy components such as cheese, milk, and the like) undergoes. Process 102 extends from an initial stage (stage " 1 "), at which cows of a group of cows (represented by an image of a pair of cows) are milked for their raw milk, to a final stage (stage "N"), at which a dairy product, produced from the raw milk derived from the group of cows, reaches a dairy shop 106, at which it is sold.
The dairy product's route is composed of: (i) a dairy product production sub-route, extending from the initial stage (stage " 1 ") to a stage at which the dairy product is kept in bottles ready for sale (stage "3"), and (ii) a dairy product transportation sub-route, extending from a stage at which the bottles ready for sale are packed in boxes (stage "4"), to the final stage at which the bottles reach the dairy shop 106, ready to be purchased by an end user (stage "N").
During the dairy product production sub-route, the raw milk derived from the group of cows undergoes sterilization and pasteurization processes, at the end of which the now sterilized and pasteurized milk is kept in said bottles, ready for transportation. During the product transportation sub-route, said bottles are transported in designated boxes, loaded on designated trucks, to dairy shop 106, where end users may purchase them.
It is to be of note that supply chain 102, and each of its sub-routes described above, may include fewer or additional stages than those described hereinbefore, depending on the number of steps involved in each of them. It is to be further of note that supply chain 102, and each of its sub-routes, may include different stages than the stages described hereinbefore, mutatis mutandis.
Throughout process 102, described hereinbefore, fraud, malfunctions, mistakes, and the like, potentially affecting the composition of the designated form of the given element, may occur at any stage. The specific stage at which such situations (which may endanger the well-being of an end user) may occur may be challenging to detect, and as such, challenging to resolve. To ensure that the composition of the designated form of the given element includes only ingredients derived from sources intended to contribute to its composition, a system for element traceability of the presently disclosed subject matter operates, as will be described hereafter in reference to Figs. 3 and 5.
Attention is now drawn to a description of the components of a system for element traceability 200.
Fig- 2 is a block diagram schematically illustrating one example of a system for element traceability 200, in accordance with the presently disclosed subject matter.
In accordance with the presently disclosed subject matter, the system for element traceability 200 (also interchangeably referred to herein as “system 200”) can comprise a network interface 206. The network interface 206 (e.g., a network card, a Wi-Fi client, a Li-Fi client, 3G/4G client, or any other component), enables system 200 to communicate over a network with external systems and handles inbound and outbound communications from such systems. For example, system 200 can receive, through network interface 206, one or more sets/collections of nucleotide-based sequences originated from one or more sources and/or one or more products.
System 200 can further comprise or be otherwise associated with a data repository 204 (e.g., a database, a storage system, a memory including Read Only Memory - ROM, Random Access Memory - RAM, or any other type of memory, etc.) configured to store data. Some examples of data that can be stored in the data repository 204 include: • One or more first nucleotide-based sequences originated from one or more sources intended to contribute to the formation of one or more finished products;
• One or more second nucleotide-based sequences originated from one or more sources not intended to contribute to the formation of one or more finished products;
• One or more first distances associated with one or more given sources intended to contribute to the formation of one or more finished products;
• One or more second distances associated with one or more given sources not intended to contribute to the formation of one or more finished products;
• One or more linear regression models associated with one or more nucleotide-based sequences;
• One or more residuals, each associated with a linear regression model, representing the respective model's random error;
• One or more tolerance thresholds;
• One or more Q-Q (Quantile-Quantile) plots of residuals of respective linear regression models; etc.
Data repository 204 can be further configured to enable retrieval and/or update and/or deletion of the stored data. It is to be noted that in some cases, data repository 204 can be distributed, while the system 200 has access to the information stored thereon, e.g., via a wired or wireless network to which system 200 is able to connect (utilizing its network interface 206).
System 200 further comprises processing circuitry 202. Processing circuitry 202 can be one or more processing units (e.g., central processing units), microprocessors, microcontrollers (e.g., microcontroller units (MCUs)) or any other computing devices or modules, including multiple and/or parallel and/or distributed processing units, which are adapted to independently or cooperatively process data for controlling relevant system 200 resources and for enabling operations related to system’s 200 resources.
The processing circuitry 202 comprises an element traceability module 208, configured to perform an element traceability process, as further detailed herein, inter alia with reference to Figs. 3, 5 and 7. Turning to Fig. 3 there is shown a flowchart illustrating one example of operations carried out by the system for element traceability 200, in accordance with the presently disclosed subject matter.
Accordingly, the system for element traceability 200 (also interchangeably referred to hereafter as “system 200”) can be configured to perform element traceability process 300, e.g., using element traceability module 208. The element traceability process 300 is directed at detecting the presence of raw materials, derived from a given source, in a given finished product and, by that, to determine whether the given source contributed to the formation of the given finished product.
For this purpose, system 200 obtains: (i) a first nucleotide-based sequence originated from a given source (e.g., the vector of allele frequencies of the given source), (ii) a first collection of nucleotide-based sequences, each of which originated from a source of a plurality of sources (e.g., the vector of allele frequencies of the plurality of sources, f.e., the aggregated sum of the plurality of sources), optionally being part of a larger group of individuals not all of whom serve as sources for raw materials used to produce the given finished product, (iii) a second collection of nucleotide-based sequences, each of which originated from a given product (e.g., the vector of allele frequencies of the given product), and (iv) a second nucleotide-based sequence originated from a source known not to be included in the given product (e.g., the vector of allele frequencies of the source known not to be included in the given product), being part of said larger group of individuals (block 302).
It is to be of note that for cases relating to the vector of allele frequencies of (i) to (iv), we only require their genetic profile. More particularly, we only require the genetic profile of the given source, the genetic profile of the plurality of sources (a single vector, we do not need access to the genetic profiles of all sources of said plurality of sources), the genetic profile of the given product (again, a single vector), and the genetic profile of the individual known not to be in the given product but present in the group of individuals.
Each nucleotide-based sequence of the nucleotide-based sequences of (i) to (iv) may be, in one example, a DNA sequence with a germline substitution of a single nucleotide at a specific position in the sequence (i.e., a single-nucleotide polymorphism (SNP) DNA sequence). In another example, each nucleotide-based sequence may be a DNA sequence with a germline substitution of two or more nucleotides at specific positions in the sequence. In one non-limiting example, the plurality of sources from which the first collection of nucleotide-based sequences is obtained may be the group of sources mentioned in relation to Fig. 1, from which the raw materials used to produce the finished product were obtained, whereas the second collection of nucleotide-based sequences may be nucleotide-based sequences obtained from the finished product mentioned in relation to Fig. 1. In addition, the given source from which the first nucleotide-based sequence is acquired may or may not be a member of the group of sources from which the first collection of nucleotide-based sequences is obtained.
By way of a non-limiting example (presented merely for purposes of better understanding the disclosed subject matter and not in any way intended to limit its scope), as illustrated in Fig. 4, system 200 obtains: (i) an SNP sequence originated from a cow, denoted yj, (ii) a first collection of SNP sequences, each of which originated from a cow of a plurality of cows, denoted Popj, being part of a larger group of cows, (iii) a second collection of SNP sequences, each of which originated from a milk product, denoted Mj, and (iv) an SNP sequence originated from a cow known not to be included in the milk product , denoted y’j, being included in said larger group of cows.
System 200 calculates a first distance, associated with the given source, composed of (i) a distance of the first nucleotide-based sequence from the first collection of nucleotide-based sequences, and (ii) a distance of the first nucleotide-based sequence from the second collection of nucleotide-based sequences (block 304).
The distance of the first nucleotide-based sequence from the first and second collections may be, for example, Euclidean distance (though other forms of mathematical distancing may be applicable), directed to effectively measure the similarity or dissimilarity of an individual or a population from another individual or population. For example, in cases involving two individuals, "individual A" and "individual B", the genotypes of both individuals may be converted into a numeric form of 0, 0.5, or 1, which represents the number of minor alleles the individuals have at each genetic location divided by 2. Following the conversion, an absolute difference of the numeric genetic value at each genetic location is calculated, and the overall absolute differences at the various genetic locations are used to calculate a mean difference, representing a distance metric (ranging between 0 and 1) between "individual A" and "individual B". The closer the distance metric is to 1 means that "individual A" and "individual B" are more different. The closer the distance metric is to 0 means that "individual A" and "individual B" are more similar.
In accordance with our non-limiting example of Fig. 4, system 200 calculates a first distance, associated with cow, yj. The first distance is composed of (i) the distance of the SNP sequence originated from cow, yj, from the first collection of SNP sequences, originated from the plurality of cows, Popj, and (ii) the distance of the SNP sequence originated from cow, yj, from the second collection of SNP sequences, originated from the milk product, Mj.
System 200 then calculates a second distance, associated with the source known not included in the given product, composed of (i) a distance of the second nucleotide- based sequence from the first collection of nucleotide-based sequences, and (ii) a distance of the second nucleotide-based sequence from the second collection of nucleotide-based sequences (block 306).
In accordance with our non-limiting example of Fig. 4, system 200 calculates a second distance, associated with cow, y’j. The second distance is composed of (i) the distance of the SNP sequence originated from cow, y’j, from the first collection of SNP sequences originated from the plurality of cows, denoted Popj, and (ii) the distance of the SNP sequence originated from cow, y’j, from the second collection of SNP sequences originated from the milk product, Mj.
Upon a difference between the first distance and the second distance being above a threshold, system 200 determines that the given source contributed to the formation of the product (block 308).
In accordance with our non-limiting example of Fig. 4, system 200 determines that cow, yj contributed to the formation of milk product, M, as the difference between the first distance, associated with cow, yj, and the second distance, associated with, y’j, is above a predefined threshold.
In some cases, following the calculation of the first and second distances, in order to assess the likelihood that the given source's genomic profile is present in the given product, a two sample t-test may be applied.
It is important to note that the second nucleotide-based sequence originating from a source known not to be included in the given product may not only be used to generate an optional multiple-dimensional test statistic, as explained hereinbefore, but may also be used as a control group to improve the sensitivity and power of said system. In more detail, by incorporating variability from two groups, the optional two-sample t-test may provide a more comprehensive understanding of the spread and overlap of the two distributions, as well as increased sensitivity to detecting true differences between the groups, from which greater insight into the presence or absence of an individual may be gained. Comparing two different conditions (presence vs absence in a given finished product) may make the analysis more robust to individual-specific variation and may potentially reduce the proneness to Type I and Type II errors. In addition, comparing the differences in genotypes between two different individuals (against the plurality of sources and the given finished product) can provide more context and insight into the nature and significance of the differences observed.
Turning to Fig. 5 there is shown a flowchart illustrating another example of operations carried out by the system for element traceability 200, in accordance with the presently disclosed subject matter.
Accordingly, the system for element traceability 200 (also interchangeably referred to hereafter as “system 200”) can be configured to perform element traceability process 500, e.g., using element traceability module 208. The element traceability process 500 is directed at detecting non-compliance in a given finished product produced from a group of sources and, by that, to detect the presence of materials not supposed to be in the given finished product.
For this purpose, system 200 obtains: (i) a set of nucleotide-based sequences from each given source of a group of sources, and (ii) the set of nucleotide-based sequences of each given source of the group of sources from the given product (block 502). The set of nucleotide-based sequences may include a collection of nucleotide-based sequences common to all sources of the group of sources. Each nucleotide-based sequence may be a DNA sequence with a germline substitution of a single nucleotide at a specific position in the sequence (i.e., a single-nucleotide polymorphism (SNP) DNA sequence), a DNA sequence with a germline substitution of two or more nucleotides at specific positions in the sequence, a DNA sequence with a different number of copies of a specific segment of DNA (Copy number variation (CNV)), a DNA sequence with a different number of short tandem repeats (STRs), etc.
By way of a non-limiting example (presented merely for purposes of better understanding the disclosed subject matter and not in any way intended to limit its scope), system 200 obtains: (i) a set of SNP sequences from each cow of a plurality of cows used as sources for raw materials from which a milk product was produced, and (iii) the set of SNP sequences of each cow of said plurality of cows from the produced milk product.
Next, for the set of nucleotide-based sequences, system 200 generates a multiple linear regression model based on an allele frequency of the given nucleotide-based sequence in each given source of the group of sources compared to an allele frequency of the given nucleotide-based sequence of each given source in the given finished product (block 504 (a))
Multiple linear regression is a generalization of simple linear regression to the case of more than one independent variable, and a special case of general linear models, restricted to one dependent variable. Therefore, for a simplified explanation, Fig. 6 illustrates an exemplary simple linear regression model of a given nucleotide-based sequence from the set of nucleotide-based sequences, represented by graph 600. Graph 600 consists of (i) an x-axis, denoted 602, representing the values of the allele frequency of the given nucleotide-based sequence in each source of the group of sources, and (ii) a y-axis, denoted 604, representing the values of the allele frequency of the given nucleotide-based sequence of each source of the group of sources in the given finished product. In addition, graph 600 includes a plurality of points, each representing the meeting point of the values of the allele frequencies of the given nucleotide-based sequence for a given source of the group of sources, and a line 606, representing the underlying relationship between the two allele frequencies of the given nucleotide-based sequence. Line 606 is associated with a linear equation consisting of a slope value and a y -intercept value.
In accordance with our non-limiting example, for each given SNP sequence of the set of SNP sequences, system 200 generates a linear regression model, represented by a graph similar to graph 600, based on an allele frequency of the given SNP sequence in each given source of the group of sources compared to an allele frequency of the given SNP sequence of each given source in the produced milk product.
In some cases, instead of implementing the multiple linear regression model approach described hereinbefore, system 200 may implement other approaches such as Bayesian or non-parametric approaches.
Returning to Fig. 5, from the linear regression model, system 200 obtains a residual, which is the model’s random error (e.g., the y-intercept value of the linear equation of the linear regression model) (block 504 (b)), and upon the residual exceeding a threshold, system 200 determines a non-compliance in the given finished product (block 504 (c)).
In some cases, the model’s random error may be defined, for example, as a minimized sum of squares of differences between the allele frequency of the given nucleotide-based sequence in each given source of the group of sources and the allele frequency of the given nucleotide-based sequence of each given source in the given finished product.
In some cases, the threshold may be a customized threshold for each given nucleotide-based sequence or a standard threshold suitable for all nucleotide-based sequences. In addition, in some cases, the threshold may be defined, for example, according to a tolerance threshold, being a percentage of sources of the group of sources for which at least one given nucleotide-based sequence of the set of nucleotide-based sequences is lacking.
In some cases, in addition, or alternatively to the above, a non-compliance may be determined whenever the distribution of the residuals of all the nucleotide-based sequences of the set of nucleotide-based sequences is not normal (in the mathematical sense).
In accordance with our non-limiting example, for each SNP sequence of the set of SNP sequences, system 200 obtains a respective residual, being the y-intercept value of the linear equation of the linear regression model, and compares it to a standard threshold. As one of the respective residuals, associated with a given SNP sequence, exceeds a predefined threshold, system 200 determines a non-compliance in the given finished product.
Turning to Fig. 7 there is shown a flowchart illustrating yet another example of operations carried out by the system for element traceability 200, in accordance with the presently disclosed subject matter.
Accordingly, the system for element traceability 200 (also interchangeably referred to hereafter as “system 200”) can be configured to perform element traceability process 700, e.g., using element traceability module 208. The element traceability process 700 is directed at detecting whether a finished product, produced from a group of sources, contains substances derived from one or more sources subjected to a specific condition (e.g., sickness, and the like). For this purpose, system 200 obtains: (i) a set of nucleotide-based sequences from each given source of a group of sources, and (ii) the set of nucleotide-based sequences of each given source of the group of sources from the given finished product (block 702). The set of nucleotide-based sequences may include a collection of nucleotide-based sequences common to all sources of the group of sources. Each nucleotide-based sequence may be, in one example, a DNA sequence with a germline substitution of a single nucleotide at a specific position in the sequence (i.e., a single-nucleotide polymorphism (SNP) DNA sequence). In another example, each nucleotide-based sequence may be a DNA sequence with a germline substitution of two or more nucleotides at specific positions in the sequence.
By way of a non-limiting example (presented merely for purposes of better understanding the disclosed subject matter and not in any way intended to limit its scope), system 200 obtains: (i) a set of SNP sequences from each cow of a plurality of cows used as sources for raw materials from which a milk product was produced, and (iii) the set of SNP sequences of each cow of said plurality of cows from the produced milk product.
For each given nucleotide-based sequence of the set of nucleotide-based sequences, system 200 generates a linear regression model based on an allele frequency of the given nucleotide-based sequence in each given source of the group of sources compared to an allele frequency of the given nucleotide-based sequence of each given source in the given finished product (in a similar manner to the manner described in relation to Figs. 5 and 6) (block 704 (a)).
In accordance with our non-limiting example, for each given SNP sequence of the set of SNP sequences, system 200 generates a linear regression model, represented by a graph similar to graph 600, based on an allele frequency of the given SNP sequence in each given source of the group of sources compared to an allele frequency of the given SNP sequence of each given source in the produced milk product.
From the linear regression model of each given nucleotide-based sequence, system 200 obtains a residual, which may be, for example, the model’s random error (e.g., the y-intercept value of the linear equation of the linear regression model) (block 704 (b)). In some cases, the model’s random error may be defined, for example, as a minimized sum of squares of differences between the allele frequency of the given nucleotide-based sequence in each given source of the group of sources and the allele frequency of the given nucleotide-based sequence of each given source in the given finished product.
Based on the residuals of the nucleotide-based sequences of the set of nucleotide- based sequences, system 200 generates a Q-Q (Quantile-Quantile) plot (block 706). Figs. 8A and 8B illustrate exemplary Q-Q plots 800 and 800', respectively, each representing the distribution of residuals of a set of nucleotide-based sequences. Q-Q plots 800 and 800' are each composed of: (i) an x-axis, denoted 802, representing the actual values of the residuals of each nucleotide-based sequence, and (i) a y-axis, denoted 804, representing the expected values of the residuals of each nucleotide-based sequence. In addition, Q-Q plots 800 and 800' include a plurality of points, each representing the meeting point of the actual and expected values of the residuals of each nucleotide-based sequence, and a line 806, representing the underlying relationship between the two residual values. As seen in Fig. 8 A, the majority of the plurality of points of Q-Q plot 800 are aligned along line 806, forming a normal distribution. In contrast, as seen in Fig. 8B, the plurality of points of Q-Q plot 800 form a deviation from line 806 at both of its edges, denoted 808a and 808b, forming a non-normal distribution.
Returning to Fig. 7, upon at least one edge of the Q-Q plot being non-linear, system 200 determines that the product contains substance derived from a source subjected to a specific condition (e.g., sickness, and the like) (block 708).
In some cases, prior to the generating of the Q-Q plot, system 200 determines whether the residuals of the set of nucleotide-based sequences distribute normally, and upon the distribution being a non-normal distribution, system 200 moves to the Q-Q plot generating step.
In some cases, other methods, alternative to the Q-Q plot generation, may be used in order to determine that the product contains substance derived from a source subjected to a specific condition. In one example, the step of generating a Q-Q plot may be replaced with a comparison to a predefined threshold representing the maximum absolute value of the residuals. In another example, the use of said predefined threshold may be combined with an adjusted R-squared Multiple Regression Analysis.
It is to be noted that in addition to the use of system 200 described above, system 200 may be used in any field of use involving complex sample mixtures. In one example, system 200 may be utilized to identify and/or trace contributors for research purposes (e.g., clinical trials, and the like). In another example, system 200 may be utilized to identify and /or trace contributors to the manufacture of biopharmaceutical products, in which animal by-products from complex mixed sources may be used (for example, gelatin, bovine serum albumin (BSA), and the like).
It is to be noted, with reference to Figs. 3, 5 and 7, that some of the blocks can be integrated into a consolidated block or can be broken down to a few blocks and/or other blocks may be added. It is to be further noted that some of the blocks are optional. It should be also noted that whilst the flow diagram is described also with reference to the system elements that realizes them, this is by no means binding, and the blocks can be performed by elements other than those described herein.
It is to be understood that the presently disclosed subject matter is not limited in its application to the details set forth in the description contained herein or illustrated in the drawings. The presently disclosed subject matter is capable of other embodiments and of being practiced and carried out in various ways. Hence, it is to be understood that the phraseology and terminology employed herein are for the purpose of description and should not be regarded as limiting. As such, those skilled in the art will appreciate that the conception upon which this disclosure is based may readily be utilized as a basis for designing other structures, methods, and systems for carrying out the several purposes of the present presently disclosed subject matter.
It will also be understood that the system according to the presently disclosed subject matter can be implemented, at least partly, as a suitably programmed computer. Likewise, the presently disclosed subject matter contemplates a computer program being readable by a computer for executing the disclosed method. The presently disclosed subject matter further contemplates a machine-readable memory tangibly embodying a program of instructions executable by the machine for executing the disclosed method.

Claims

CLAIMS:
1. A system for determining whether a given source contributed to the formation of a given product, the system comprising a processing circuitry configured to: obtain: (i) a first nucleotide-based sequence originated from said given source, (ii) a first collection of nucleotide-based sequences, each originated from a source of a plurality of sources being part of a group of individuals, not all of whom serve as sources, (iii) a second collection of nucleotide-based sequences, each originated from said given product, and (iv) a second nucleotide-based sequence originated from a source known not to be included in the given product, being part of said group of individuals; calculate a first distance associated with said given source, wherein said first distance is composed of: (i) a distance of said first nucleotide-based sequence from said first collection of nucleotide-based sequences, and (ii) a distance of said first nucleotide-based sequence from said second collection of nucleotide-based sequences; calculate a second distance associated with said source not included in the plurality of sources, wherein said second distance is composed of: (i) a distance of said second nucleotide-based sequence from said first collection of nucleotide- based sequences, and (ii) a distance of said second nucleotide-based sequence from said second collection of nucleotide-based sequences; and, upon a difference between said first distance and said second distance being above a threshold, determine that said given source contributed to the formation of said product.
2. The system of claim 1, wherein following the calculation of the first and second distances a two sample t-test is applied.
3. The system of claim 1, wherein the first and second collections of nucleotide- based sequences are each composed of single-nucleotide polymorphism DNA sequences.
4. A system for detecting a non-compliance in a given product produced from a group of sources, the system comprising a processing circuitry configured to: obtain: (i) a set of nucleotide-based sequences from each given source of the group of sources, wherein the set of nucleotide-based sequences includes a collection of nucleotide-based sequences common to all the sources of the group of sources, and (ii) the set of nucleotide-based sequences of each given source of the group of sources from the given product; for each given nucleotide-based sequence of the set of nucleotide-based sequences, perform the following:
(a) generate a linear regression model based on an allele frequency of the given nucleotide-based sequence in each given source of the group of sources compared to an allele frequency of the given nucleotide-based sequence of each given source in the given product;
(b) from the linear regression model, obtain a residual, wherein said residual being the model’s random error; and,
(c) upon the residual exceeding a threshold, determine a non- compliance in the given product.
5. The system of claim 4, wherein the model’s random error is defined as a minimized sum of squares of differences between the allele frequency of the given nucleotide-based sequence in each given source of the group of sources and the allele frequency of the given nucleotide-based sequence of each given source in the given finished product.
6. The system of claim 4, wherein the sets of nucleotide-based sequences are each composed of single-nucleotide polymorphism DNA sequences.
7. The system of claim 4, wherein the non-compliance is also determined whenever the residuals of all the nucleotide-based sequences of the set of nucleotide-based sequences are not distributing normally.
8. The system of claim 4, wherein: (i) the threshold is defined according to a tolerance threshold, and (ii) the tolerance threshold is a percentage of sources of the group of sources for which at least one nucleotide-based sequence of the set of nucleotide-based sequences is lacking.
9. A system for detecting whether a product, produced from a group of sources, contains substance derived from a sick source, the system comprising a processing circuitry configured to: obtain: (i) a set of nucleotide-based sequences from each given source of the group of sources, wherein the set of nucleotide-based sequences includes a collection of nucleotide-based sequences common to all the sources of the group of sources, and (ii) the set of nucleotide-based sequences of each given source of the group of sources from the given dairy product; for each given nucleotide-based sequence of the set of nucleotide-based sequences, perform the following:
(a) generate a linear regression model based on an allele frequency of the given nucleotide-based sequence in each given source of the group of sources compared to an allele frequency of the given nucleotide-based sequence of each given source in the given dairy product;
(b) from the linear regression model, obtain a residual, wherein said residual being the model’s random error; based on the residuals of the nucleotide-based sequences of the set of nucleotide-based sequences, generate a Q-Q plot; and, upon at least one edge of the Q-Q plot being non-linear, determine that the product contains substance derived from an ill source.
10. The system of claim 9, wherein prior to the generating of the Q-Q plot the processing circuitry is configured to determine whether the residuals distribute normally, and upon the distribution being a non-normal distribution, the processing circuitry moves to the Q-Q plot generating step.
11. The system of claim 9, wherein the sets of nucleotide-based sequences are each composed of single-nucleotide polymorphism DNA sequences.
12. The system of claim 9, wherein the model’s random error is defined as a minimized sum of squares of differences between the allele frequency of the given nucleotide-based sequence in each given source of the group of sources and the allele frequency of the given nucleotide-based sequence of each given source in the given finished product.
13. A method for determining whether a given source contributed to the formation of a given product, the method comprising: obtaining: (i) a first nucleotide-based sequence originated from said given source, (ii) a first collection of nucleotide-based sequences, each originated from a source of a plurality of sources being part of a group of individuals, not all of whom serve as sources, (iii) a second collection of nucleotide-based sequences, each originated from said given product, and (iv) a second nucleotide-based sequence originated from a source known not to be included in the given product, being part of said group of individuals; calculating a first distance associated with said given source, wherein said first distance is composed of: (i) a distance of said first nucleotide-based sequence from said first collection of nucleotide-based sequences, and (ii) a distance of said first nucleotide-based sequence from said second collection of nucleotide-based sequences; calculating a second distance associated with said source not included in the plurality of sources, wherein said second distance is composed of: (i) a distance of said second nucleotide-based sequence from said first collection of nucleotide-based sequences, and (ii) a distance of said second nucleotide-based sequence from said second collection of nucleotide-based sequences; and, upon a difference between said first distance and said second distance being above a threshold, determining that said given source contributed to the formation of said product.
14. The method of claim 13, wherein following the calculation of the first and second distances .
15. The method of claim 13, wherein the first and second collections of nucleotide- based sequences are each composed of single-nucleotide polymorphism DNA sequences.
16. A method for detecting a non-compliance in a given product produced from a group of sources, the method comprising: obtaining: (i) a set of nucleotide-based sequences from each given source of the group of sources, wherein the set of nucleotide-based sequences includes a collection of nucleotide-based sequences common to all the sources of the group of sources, and (ii) the set of nucleotide-based sequences of each given source of the group of sources from the given product; for each given nucleotide-based sequence of the set of nucleotide-based sequences, performing the following:
(a) generating a linear regression model based on an allele frequency of the given nucleotide-based sequence in each given source of the group of sources compared to an allele frequency of the given nucleotide-based sequence of each given source in the given product;
(b) from the linear regression model, obtaining a residual, wherein said residual being the model’s random error; and,
(c) upon the residual exceeding a threshold, determining a non- compliance in the given product.
17. The method of claim 16, wherein the model’s random error is defined as a minimized sum of squares of differences between the allele frequency of the given nucleotide-based sequence in each given source of the group of sources and the allele frequency of the given nucleotide-based sequence of each given source in the given finished product
18. The method of claim 16, wherein the sets of nucleotide-based sequences are each composed of single-nucleotide polymorphism DNA sequences.
19. The method of claim 16, wherein the non-compliance is also determined whenever the residuals of all the nucleotide-based sequences of the set of nucleotide-based sequences are not distributing normally.
20. The method of claim 16, wherein: (i) the threshold is defined according to a tolerance threshold, and (ii) the tolerance threshold is a percentage of sources of the group of sources for which at least one nucleotide-based sequence of the set of nucleotide-based sequences is lacking.
21. A method for detecting whether a product, produced from a group of sources, contains substance derived from a sick source, the method comprising: obtaining: (i) a set of nucleotide-based sequences from each given source of the group of sources, wherein the set of nucleotide-based sequences includes a collection of nucleotide-based sequences common to all the sources of the group of sources, and (ii) the set of nucleotide-based sequences of each given source of the group of sources from the given dairy product; for each given nucleotide-based sequence of the set of nucleotide-based sequences, performing the following:
(a) generating a linear regression model based on an allele frequency of the given nucleotide-based sequence in each given source of the group of sources compared to an allele frequency of the given nucleotide-based sequence of each given source in the given dairy product;
(b) from the linear regression model, obtaining a residual, wherein said residual being the model’s random error; based on the residuals of the nucleotide-based sequences of the set of nucleotide-based sequences, generating a Q-Q plot; and, upon at least one edge of the Q-Q plot being non-linear, determining that the product contains substance derived from an ill source.
22. The method of claim 21, wherein prior to the generating of the Q-Q plot the processing circuitry is configured to determine whether the residuals distribute normally, and upon the distribution being a non-normal distribution, the processing circuitry moves to the Q-Q plot generating step.
23. The method of claim 21, wherein the sets of nucleotide-based sequences are each composed of single-nucleotide polymorphism DNA sequences.
24. The method of claim 21, wherein the model’s random error is defined as a minimized sum of squares of differences between the allele frequency of the given nucleotide-based sequence in each given source of the group of sources and the allele frequency of the given nucleotide-based sequence of each given source in the given finished product.
25. A non-transitory computer readable storage medium having computer readable program code embodied therewith, the computer readable program code, executable by at least one processor to perform a method for determining whether a given source contributed to the formation of a given product, the method comprising: obtaining: (i) a first nucleotide-based sequence originated from said given source, (ii) a first collection of nucleotide-based sequences, each originated from a source of a plurality of sources being part of a group of individuals, not all of whom serve as sources, (iii) a second collection of nucleotide-based sequences, each originated from said given product, and (iv) a second nucleotide-based sequence originated from a source known not to be included in the given product, being part of said group of individuals; calculating a first distance associated with said given source, wherein said first distance is composed of: (i) a distance of said first nucleotide-based sequence from said first collection of nucleotide-based sequences, and (ii) a distance of said first nucleotide-based sequence from said second collection of nucleotide-based sequences; calculating a second distance associated with said source not included in the plurality of sources, wherein said second distance is composed of: (i) a distance of said second nucleotide-based sequence from said first collection of nucleotide-based sequences, and (ii) a distance of said second nucleotide-based sequence from said second collection of nucleotide-based sequences; and, upon a difference between said first distance and said second distance being above a threshold, determining that said given source contributed to the formation of said product.
26. A non-transitory computer readable storage medium having computer readable program code embodied therewith, the computer readable program code, executable by at least one processor to perform a method for detecting a non- compliance in a given product produced from a group of sources, the method comprising: obtaining: (i) a set of nucleotide-based sequences from each given source of the group of sources, wherein the set of nucleotide-based sequences includes a collection of nucleotide-based sequences common to all the sources of the group of sources, and (ii) the set of nucleotide-based sequences of each given source of the group of sources from the given product; for each given nucleotide-based sequence of the set of nucleotide-based sequences, performing the following:
(a) generating a linear regression model based on an allele frequency of the given nucleotide-based sequence in each given source of the group of sources compared to an allele frequency of the given nucleotide-based sequence of each given source in the given product;
(b) from the linear regression model, obtaining a residual, wherein said residual being the model’s random error; and,
(c) upon the residual exceeding a threshold, determining a non- compliance in the given product.
27. A non-transitory computer readable storage medium having computer readable program code embodied therewith, the computer readable program code, executable by at least one processor to perform a method for detecting whether a product, produced from a group of sources, contains substance derived from a sick source, the method comprising: obtaining: (i) a set of nucleotide-based sequences from each given source of the group of sources, wherein the set of nucleotide-based sequences includes a collection of nucleotide-based sequences common to all the sources of the group of sources, and (ii) the set of nucleotide-based sequences of each given source of the group of sources from the given dairy product; for each given nucleotide-based sequence of the set of nucleotide-based sequences, performing the following:
(a) generating a linear regression model based on an allele frequency of the given nucleotide-based sequence in each given source of the group of sources compared to an allele frequency of the given nucleotide-based sequence of each given source in the given dairy product;
(b) from the linear regression model, obtaining a residual, wherein said residual being the model’s random error; based on the residuals of the nucleotide-based sequences of the set of nucleotide-based sequences, generating a Q-Q plot; and, upon at least one edge of the Q-Q plot being non-linear, determining that the product contains substance derived from an ill source.
PCT/IB2024/050803 2023-01-30 2024-01-29 A system and method for element traceability WO2024161278A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
EP23154074 2023-01-30
EP23154074.1 2023-01-30

Publications (1)

Publication Number Publication Date
WO2024161278A1 true WO2024161278A1 (en) 2024-08-08

Family

ID=85150984

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/IB2024/050803 WO2024161278A1 (en) 2023-01-30 2024-01-29 A system and method for element traceability

Country Status (1)

Country Link
WO (1) WO2024161278A1 (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2015051163A2 (en) * 2013-10-04 2015-04-09 Sequenom, Inc. Methods and processes for non-invasive assessment of genetic variations
US20170206311A1 (en) * 2008-07-23 2017-07-20 The Translational Genomics Research Institute Method of characterizing sequences from genetic material samples
US20170306390A1 (en) * 2001-02-02 2017-10-26 Mark W. Perlin Method and System for DNA Mixture Analysis
US20210024995A1 (en) * 2018-03-26 2021-01-28 Université de Liège Methods Involving Nucleic Acid Analysis of Milk

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170306390A1 (en) * 2001-02-02 2017-10-26 Mark W. Perlin Method and System for DNA Mixture Analysis
US20170206311A1 (en) * 2008-07-23 2017-07-20 The Translational Genomics Research Institute Method of characterizing sequences from genetic material samples
WO2015051163A2 (en) * 2013-10-04 2015-04-09 Sequenom, Inc. Methods and processes for non-invasive assessment of genetic variations
US20210024995A1 (en) * 2018-03-26 2021-01-28 Université de Liège Methods Involving Nucleic Acid Analysis of Milk

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
COPPIETERS WOUTER ET AL: "SNP-based quantitative deconvolution of biological mixtures: application to the detection of cows with subclinical mastitis by whole-genome sequencing of tank milk", 26 June 2020 (2020-06-26), pages 1 - 7, XP093061634, Retrieved from the Internet <URL:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7462076/pdf/1201.pdf> [retrieved on 20230706], DOI: 10.1101/gr.256172.119 *
GRAVERSEN THERESE ET AL: "Computational aspects of DNA mixture analysis", STATISTICS AND COMPUTING, SPRINGER US, NEW YORK, vol. 25, no. 3, 20 February 2014 (2014-02-20), pages 527 - 541, XP035479862, ISSN: 0960-3174, [retrieved on 20140220], DOI: 10.1007/S11222-014-9451-7 *
MARTINS FELIPE BITENCOURT ET AL: "A Semi-Automated SNP-Based Approach for Contaminant Identification in Biparental Polyploid Populations of Tropical Forage Grasses", FRONTIERS IN PLANT SCIENCE, vol. 12, 22 October 2021 (2021-10-22), XP093062002, DOI: 10.3389/fpls.2021.737919 *

Similar Documents

Publication Publication Date Title
Lehmann et al. Future internet and the agri-food sector: State-of-the-art in literature and research
Akl et al. Addressing dichotomous data for participants excluded from trial analysis: a guide for systematic reviewers
Du et al. Economics of agricultural supply chain design: A portfolio selection approach
Larson et al. How much is that in dog years? The advent of canine population genomics
Boichard et al. Design of a bovine low-density SNP array optimized for imputation
Desjardins et al. Carbon footprint of beef cattle
Sauer et al. The empirical identification of heterogeneous technologies and technical change
Salines et al. Pig movements in France: Designing network models fitting the transmission route of pathogens
Saltykova et al. Comparison of SNP-based subtyping workflows for bacterial isolates using WGS data, applied to Salmonella enterica serotype Typhimurium and serotype 1, 4,[5], 12: i:-
Gilbert et al. QST–FST comparisons with unbalanced half‐sib designs
CN113888064B (en) Product quality tracing method, device and equipment in food and beverage industry
Adam et al. Enhancing food safety, product quality, and value-added in food supply chains using whole-chain traceability
Gebreyesus et al. Supervised learning techniques for dairy cattle body weight prediction from 3D digital images
Somenzi et al. Identification of ancestry informative marker (AIM) panels to assess hybridisation between feral and domestic sheep
Chattaway et al. Genomic approaches used to investigate an atypical outbreak of Salmonella Adjame
Cid-Garcia et al. Exact solutions for the 2d-strip packing problem using the positions-and-covering methodology
WO2024161278A1 (en) A system and method for element traceability
Garre et al. On the use of in-silico simulations to support experimental design: A case study in microbial inactivation of foods
Vasylieva et al. Application of the marketing mix to the world export of animal products
Schopen et al. Whole genome scan to detect quantitative trait loci for bovine milk protein composition
Norström et al. An adjusted likelihood ratio approach analysing distribution of food products to assist the investigation of foodborne outbreaks
Ma et al. Multiple attribute decision making model and application to food safety risk evaluation
Yun et al. Machine learning-enabled prediction of antimicrobial resistance in foodborne pathogens
Jensen et al. Food safety regulation and private standards in China
Michelacci et al. European Union Reference Laboratories support the National food, feed and veterinary Reference Laboratories with rolling out whole genome sequencing in Europe

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 24702863

Country of ref document: EP

Kind code of ref document: A1