IL307661A

IL307661A - Multi-channel protein voxelization to predict variant pathogenicity using deep convolutional neural networks

Info

Publication number: IL307661A
Application number: IL307661A
Authority: IL
Original assignee: Illumina Inc; Illumina Cambridge Ltd
Priority date: 2021-04-15
Filing date: 2022-04-14
Publication date: 2023-12-01
Also published as: CA3215520A1; EP4323991A1; JP2024514894A; WO2022221593A1; MX2023012227A; MX2023012226A; CA3215514A1; IL307667A; WO2022221591A1; EP4323989A1; JP2024513995A; KR20230170680A; BR112023021266A2; KR20230170679A; BR112023021343A2; AU2022259667A1; AU2022258691A1

Claims

1.Claims 1. A system comprising: a voxelizer that accesses a three-dimensional structure of a reference amino acid sequence of a protein, and fits a three-dimensional grid of voxels on atoms in the three-dimensional structure on an amino acid-basis to generate amino acid-wise distance channels, wherein each of the amino acid-wise distance channels has a three-dimensional distance value for each voxel in the three-dimensional grid of voxels, and wherein the three-dimensional distance value specifies a distance from a corresponding voxel in the three-dimensional grid of voxels to atoms of a corresponding reference amino acid in the reference amino acid sequence; an alternative allele encoder that encodes an alternative allele amino acid to each voxel in the three-dimensional grid of voxels, wherein the alternative allele amino acid is a three- dimensional representation of a one-hot encoding of a variant amino acid expressed by a variant nucleotide; an evolutionary conservation encoder that encodes an evolutionary conservation sequence to each voxel in the three-dimensional grid of voxels, wherein the evolutionary conservation sequence is a three-dimensional representation of amino acid-specific conservation frequencies across a plurality of species, and wherein the amino acid-specific conservation frequencies are selected in dependence upon amino acid proximity to the corresponding voxel; and a convolutional neural network configured to: apply three-dimensional convolutions to a tensor that includes the amino acid-wise distance channels encoded with the alternative allele amino acid and respective evolutionary conservation sequences, and determine a pathogenicity of the variant nucleotide based at least in part on the tensor.

2. The system of claim 1, wherein the voxelizer centers the three-dimensional grid of voxels on an alpha-carbon atom of respective residues of reference amino acids in the reference amino acid sequence.

3. The system of claim 2, wherein the voxelizer centers the three-dimensional grid of voxels on an alpha-carbon atom of a residue of a particular reference amino acid positioned at the variant amino acid.

4. The system of claim 3, further configured to encode, in the tensor, a directionality of the reference amino acids in the reference amino acid sequence and a position of the particular reference amino acid by multiplying, with a directionality parameter, three-dimensional distance values for those reference amino acids that precede the particular reference amino acid.

5. The system of any of claims 1-4, wherein the distances from corresponding voxels to atoms are nearest-atom distances from corresponding voxel centers in the three-dimensional grid of voxels to nearest atoms of the corresponding reference amino acids.

6. The system of any of claims 2-5, wherein the reference amino acids have alpha- carbon atoms, wherein the distances from corresponding voxels to the atoms are nearest-alpha- carbon atom distances from corresponding voxel centers to nearest alpha-carbon atoms of the corresponding reference amino acids.

7. The system of any of claims 1-6, wherein the reference amino acids have beta- carbon atoms, wherein the distances from corresponding voxels to atoms are nearest-beta-carbon atom distances from corresponding voxel centers to nearest beta-carbon atoms of the corresponding reference amino acids.

8. The system of any of claims 1-6, wherein the reference amino acids have backbone atoms, wherein the distances from corresponding voxels to atoms are nearest-backbone atom distances from corresponding voxel centers to nearest backbone atoms of the corresponding reference amino acids.

9. The system of any of claims 1-6, wherein the reference amino acids have sidechain atoms, wherein the distances from corresponding voxels to atoms are nearest-sidechain atom distances from corresponding voxel centers to nearest sidechain atoms of the corresponding reference amino acids.

10. The system of any of claims 1-9, further configured to encode, in the tensor, a nearest atom channel that specifies a distance from each voxel to a nearest atom, wherein the nearest atom is selected irrespective of an amino acid to which the nearest atom belongs and atomic elements of the amino acid.

11. The system of any of claims 1-10, further comprising a reference allele encoder that voxel-wise encodes a reference allele amino acid to each voxel in the three-dimensional grid of voxels.

12. The system of claim 11, wherein the reference allele amino acid is a three- dimensional representation of a one-hot encoding of a reference amino acid that experiences the variant amino acid.

13. The system of any of claims 1-12, wherein the amino acid-specific conservation frequencies specify conservation levels of respective amino acids across the plurality of species.

14. The system of any of claims 1-13, further comprising an annotations encoder that voxel-wise encodes one or more annotation channels to each voxel in the three-dimensional grid of voxels, and wherein the one or more annotation channels are three-dimensional representations of a one-hot encoding of residue annotations.

15. The system of any of claims 1-14, further comprising a structure confidence encoder that voxel-wise encodes one or more structure confidence channels to each voxel in the three-dimensional grid of voxels, and wherein the one or more structure confidence channels are three-dimensional representations of confidence scores that specify quality of respective residue structures.

16. A computer-implemented method comprising: accessing a three-dimensional structure of a reference amino acid sequence of a protein, and fitting a three-dimensional grid of voxels on atoms in the three-dimensional structure on an amino acid-basis to generate amino acid-wise distance channels, wherein each of the amino acid-wise distance channels has a three-dimensional distance value for each voxel in the three-dimensional grid of voxels, and wherein the three-dimensional distance value specifies a distance from a corresponding voxel in the three-dimensional grid of voxels to atoms of a corresponding reference amino acid in the reference amino acid sequence; encoding an alternative allele channel to each voxel in the three-dimensional grid of voxels, wherein the alternative allele channel is a three-dimensional representation of a one-hot encoding of a variant amino acid expressed by a variant nucleotide; encoding an evolutionary conservation channel to each sequence of three-dimensional distance values across the amino acid-wise distance channels on a voxel position-basis, wherein the evolutionary conservation channel is a three-dimensional representation of amino acid-specific conservation frequencies across a plurality of species, and wherein the amino acid-specific conservation frequencies are selected in dependence upon amino acid proximity to the corresponding voxel; applying three-dimensional convolutions to a tensor that includes the amino acid-wise distance channels encoded with the alternative allele channel and respective evolutionary conservation channels; and determining a pathogenicity of the variant nucleotide based at least in part on the tensor.

17. The computer-implemented method of claim 16, further comprising: selecting a nearest atom to the corresponding voxel across the reference amino acids and atom categories, selecting pan-amino acid conservation frequencies for a residue of a reference amino acid that includes the nearest atom, and using a three-dimensional representation of the pan-amino acid conservation frequencies as the evolutionary conservation channel.

18. The computer-implemented method of claim 17, wherein the pan-amino acid conservation frequencies are configured for a particular position of the residue as observed in the plurality of species.

19. The computer-implemented method of any of claims 16, further comprising: selecting respective nearest atoms to the corresponding voxel in respective reference amino acids, selecting respective per-amino acid conservation frequencies for respective residues of the respective reference amino acids that include the respective nearest atoms, and using a three-dimensional representation of the respective per-amino acid conservation frequencies as the evolutionary conservation channel.

20. A non-transitory computer readable medium storing instructions that, when executed by at least a processor, cause a system to performing actions comprising: accessing a three-dimensional structure of a reference amino acid sequence of a protein, and fitting a three-dimensional grid of voxels on atoms in the three-dimensional structure on an amino acid-basis to generate amino acid-wise distance channels, wherein each of the amino acid-wise distance channels has a three-dimensional distance value for each voxel in the three-dimensional grid of voxels, and wherein the three-dimensional distance value specifies a distance from a corresponding voxel in the three-dimensional grid of voxels to atoms of a corresponding reference amino acid in the reference amino acid sequence; encoding an alternative allele channel to each voxel in the three-dimensional grid of voxels, wherein the alternative allele channel is a three-dimensional representation of a one-hot encoding of a variant amino acid expressed by a variant nucleotide; encoding an evolutionary conservation channel to each sequence of three-dimensional distance values across the amino acid-wise distance channels on a voxel position-basis, wherein the evolutionary conservation channel is a three-dimensional representation of amino acid-specific conservation frequencies across a plurality of species, and wherein the amino acid-specific conservation frequencies are selected in dependence upon amino acid proximity to the corresponding voxel; applying three-dimensional convolutions to a tensor that includes the amino acid-wise distance channels encoded with the alternative allele channel and respective evolutionary conservation channels; and determining a pathogenicity of the variant nucleotide based at least in part on the tensor.