JP7411126B2

JP7411126B2 - Multitemporal CT image classification system and construction method based on spatiotemporal attention model

Info

Publication number: JP7411126B2
Application number: JP2023007862A
Authority: JP
Inventors: 聞▲タオ▼ 朱; 元鋒呉; 梦凡薛; 浩東江
Original assignee: 之江実験室
Priority date: 2022-06-15
Filing date: 2023-01-23
Publication date: 2024-01-10
Anticipated expiration: 2043-01-23
Also published as: CN114758032B; CN114758032A; JP2023183367A

Description

本発明は、医用画像処理技術分野に関し、特に、時空間的アテンションモデルに基づく多時相ＣＴ画像分類システム及び構築方法に関する。 The present invention relates to the field of medical image processing technology, and particularly to a multi-temporal CT image classification system and construction method based on a spatio-temporal attention model.

ＣＴ（ＣｏｍｐｕｔｅｄＴｏｍｏｇｒａｐｈｙ）は、コンピュータ断層撮影であり、精密に平行化されたＸ線ビーム、γ線、超音波などを用いて、非常に高感度な検出器と共に人体のある部位を次々に断面的にスキャンするものであり、スキャン時間が短く、画像が鮮明であるなどの特徴を有し、治療方法の改良につれて、ＣＴ画像スキャンは様々な腫瘍（例えば、肝がん）の診断で普及しており、腫瘍の部位、大きさ及び範囲を素早く見つけることができ、病変部の壊死、出血などの変化の有無を直接観察でき、腫瘍の転移の有無も検出できるため、腫瘍の検出率を高めている。 CT (Computed Tomography) is a type of computer tomography that uses precisely collimated X-ray beams, gamma rays, ultrasound waves, etc., to take cross-sectional images of certain parts of the human body one after another using extremely sensitive detectors. CT image scanning has the characteristics of short scanning time and clear images, and as treatment methods have improved, CT image scanning has become popular for diagnosing various tumors (for example, liver cancer). The site, size and range of the tumor can be quickly found, the presence or absence of changes such as necrosis or bleeding in the lesion can be directly observed, and the presence or absence of tumor metastasis can be detected, increasing the tumor detection rate. There is.

単純ＣＴスキャンは病変を素早く見つけ、さらには一部の疾患を検出できるが、血管奇形、早期がん、転移性腫瘍などの一部の病変は単純ＣＴスキャンでは診断できない。病変の表示率を高め、病巣の範囲と臨床病期を決定するために、造影ＣＴスキャン（Ｃｏｎｔｒａｓｔ‐ＥｎｈａｎｃｅｄＣＴ、ＣＥＣＴ）が必要となる。例えば脳ＣＴ検査の場合、単純ＣＴ診断の正確率は８２％であり、造影ＣＴスキャンの正確率は９２～９５％に上がっていることから、造影ＣＴは診断率向上に役立つことが分かる。造影ＣＴスキャンでは一般的に造影剤を静脈注射し、現在一般的な静脈注射方法には２つがあり、１つは人力手押し注射であり、もう１つは高圧注射器を用いた注射である。造影剤を注射すると、造影ＣＴは単純ＣＴより多くの情報を提供することができ、動脈相、門脈相、遅発相の血流を観察できるため、診断には非常に役立つ。異なるサブタイプの腫瘍の治療方法もそれぞれ異なり、現在、多時相造影ＣＴは腫瘍のサブタイプの術前診断で重要なツールとなっている。 Although plain CT scans can quickly locate lesions and even detect some diseases, some lesions, such as vascular malformations, early cancers, and metastatic tumors, cannot be diagnosed with plain CT scans. Contrast-enhanced CT (CECT) is required to enhance the visualization of the lesion and determine its extent and clinical stage. For example, in the case of brain CT examination, the accuracy rate of simple CT diagnosis is 82%, and the accuracy rate of contrast-enhanced CT scan has increased to 92-95%, which shows that contrast-enhanced CT is useful for improving the diagnostic rate. In contrast-enhanced CT scans, a contrast agent is generally injected intravenously, and there are currently two common intravenous injection methods: one is manual injection, and the other is injection using a high-pressure syringe. When a contrast agent is injected, contrast-enhanced CT can provide more information than plain CT and can observe blood flow in the arterial phase, portal venous phase, and delayed phase, which is very useful for diagnosis. Treatment methods for different subtypes of tumors are different, and currently, multi-phase contrast-enhanced CT is an important tool for preoperative diagnosis of tumor subtypes.

ディープラーニングの医用画像処理への応用も１つの大きな方向性であり、機械学習にそれを導入することで、その本来の目標である人工知能に一層近づけ、サンプルデータの内在する法則及び表現レベルを学習し、これらの学習プロセスで得られる情報はテキスト、画像、音声などのデータの解釈に大いに役立つ。その最終的な目標は、機械が人間のように分析・学習能力を有し、テキスト、画像、音声などのデータを認識できるようにすることである。ディープラーニングは複雑な機械学習アルゴリズムであり、音声・画像認識で収めている効果は、これまでの関連技術をはるかに超えており、検索技術、データマイニング、機械学習、機械翻訳、自然言語処理、マルチメディア学習、音声、推薦、個別化技術、及び他の関連分野でいずれも多くの成果を上げている。ディープラーニングは、機械が視聴覚や思考などの人間の活動を模倣するようにし、多くの複雑なパターン認識の難題を解決しており、人工知能の関連技術に大きな進歩をもたらしている。ディープラーニングの発展につれて、畳み込みニューラルネットワークの更新は繰り返され、画像認識の分野でますます多くの応用が実現され、あまりに人力の介入を必要とせず、画像特徴を自動的に抽出でき、学習能力が高いなどの利点が、特にがんの分類や病変の検出などの医用画像分析タスクで競争力の高い性能を示している。 The application of deep learning to medical image processing is also a major direction, and by introducing it to machine learning, we can get closer to the original goal of artificial intelligence, and understand the inherent laws and expression level of sample data. The information gained through these learning processes is highly useful in interpreting data such as text, images, and audio. The ultimate goal is for machines to have the ability to analyze and learn like humans, and to be able to recognize data such as text, images, and audio. Deep learning is a complex machine learning algorithm, and the effects it has achieved in voice and image recognition far exceed those of related technologies to date. They have all achieved much success in multimedia learning, audio, recommendation, personalization technology, and other related fields. Deep learning enables machines to imitate human activities such as audiovisual and thinking, solving many complex pattern recognition challenges, and is bringing major advances in related technologies of artificial intelligence. With the development of deep learning, the updating of convolutional neural networks is repeated, and more and more applications are realized in the field of image recognition, which does not require too much human intervention, can automatically extract image features, and improves its learning ability. Its advantages include high competitive performance, especially in medical image analysis tasks such as cancer classification and lesion detection.

しかしながら、悪性腫瘍の判定と診断は依然として困難であり、術前の誤診は治療方針の決定を誤らせる可能性があり、腫瘍イメージングレポートとデータシステムの複雑化に伴い、大規模な実践での実施は難しくなることで、仕事の効率を高めるため、計算による意思決定支援ツールの臨床的ニーズの拡大が必要となり、従来の畳み込みニューラルネットワークはＣＴ画像の局所特徴抽出に一定の優位性があり、病巣の状況を素早く確認することができるが、造影ＣＴの複数の時相（ｐｈａｓｅ）の画像を利用できないため、時間的な情報のつながりが弱まり、情報の利用が不完全になり、最終的な診断結果に影響を与える。 However, the determination and diagnosis of malignant tumors remains difficult, preoperative misdiagnosis can lead to poor treatment decisions, and the increasing complexity of tumor imaging reporting and data systems makes implementation in large-scale practices difficult. As the difficulty increases, the clinical need for computational decision support tools has expanded to improve work efficiency, and traditional convolutional neural networks have certain advantages in local feature extraction of CT images, and Although it is possible to quickly check the situation, since images of multiple phases of contrast-enhanced CT cannot be used, the temporal information linkage is weakened, the information is not fully utilized, and the final diagnosis result is affect.

中国特許出願ＣＮ１１０４４３２６８Ａはディープラーニングに基づく肝がんＣＴ画像を良性・悪性に分類する方法を開示しており、当該方法は、既存のＲｅｓｎｅｔ３４ネットワークモデルを基に設計・改良を行い、患者の肝臓情報の最大のスライスを選んで、データ処理と強調により、モデルに入れて分類を行う。しかしながら、ＣＴ画像は３Ｄであるため、当該方法で抽出された空間的特徴は不完全であり、しかも多時相ＣＴ画像の状況が考慮されていないため、患者の複数の時相の病変を効果的に組み合わせて処理することができず、診断結果の正確さと精度は低下している。 Chinese patent application CN110443268A discloses a method for classifying liver cancer CT images into benign and malignant based on deep learning.The method is designed and improved based on the existing Resnet34 network model, and uses patient liver information. Select the largest slice of the data and put it into the model for classification through data processing and enhancement. However, since CT images are 3D, the spatial features extracted by this method are incomplete, and the situation of multi-temporal CT images is not taken into account, so lesions in multiple temporal phases of a patient can be treated effectively. The accuracy and precision of diagnostic results are reduced.

そこで、上記の課題に対しては、多時相のＣＴを組み合わせて処理して、分類の精度と速度を高められる方法が必要である。既存の医用画像処理方法及びディープラーニングの発展の現状を踏まえ、アテンションメカニズム及びｔｒａｎｓｆｏｒｍｅｒによって構成されたエンコーダの使用が考えられ、そのうちアテンションメカニズムは単純スキャン相ＣＴ画像と造影ＣＴ画像の時間的なつながりを強化することができ、ｔｒａｎｓｆｏｒｍｅｒはもともと２０１７年に自然言語処理（ＮａｔｕｒａｌＬａｎｇｕａｇｅＰｒｏｃｅｓｓｉｎｇ、ＮＬＰ）分野で提案されたモデルであり、２０２０年に初めて視覚分野で用いられ、ＮＬＰ類似しており、画像をシリアル化し、画像分類タスクをうまく実行することができ、最終的な分類結果は最適な畳み込みニューラルネットワークにも負けず、また、必要な計算リソースは大幅に低減しており、分類の効率及び正確率を高めている。 Therefore, in order to solve the above-mentioned problem, there is a need for a method that can process multi-temporal CT in combination to improve the accuracy and speed of classification. Based on the current state of development of existing medical image processing methods and deep learning, it is conceivable to use an encoder configured with an attention mechanism and a transformer. A transformer is a model originally proposed in the natural language processing (NLP) field in 2017, and was first used in the visual field in 2020. It is similar to NLP and is used to serialize images. can successfully perform the image classification task, the final classification result is comparable to the best convolutional neural network, and the required computational resources are significantly reduced, improving the classification efficiency and accuracy rate. It's increasing.

本発明は、通常のＣＴスキャン及び造影ＣＴスキャンの時に患者の病巣構造に大きな変化がないことを考慮して、時空間的アテンションモデルに基づく多時相ＣＴ画像分類システム及び構築方法を提案し、従来の畳み込みニューラルネットワークに基づいて、多時相ＣＴ画像を組み合わせて処理できないという問題を解決する。 The present invention proposes a multi-temporal CT image classification system and construction method based on a spatiotemporal attention model, taking into account that there is no major change in the lesion structure of a patient during normal CT scans and contrast-enhanced CT scans, Based on the conventional convolutional neural network, the problem that multi-temporal CT images cannot be combined and processed is solved.

本発明では、まず専門の医用画像科医師が多時相ＣＴ画像をラベル付けし、次に画像の前処理を行って、病巣部分を分割し、画像サイズをモデルの入力に適合するように調整し、データを強調し、埋め込み層を構築し、入力は通常の単純スキャンＣＴ画像及び造影剤注射後の多時相造影ＣＴ画像であり、出力は通常の単純スキャンＣＴ画像及び造影剤注射後の多時相造影ＣＴ画像の埋め込みベクトルであり、空間的アテンションネットワークを構築し、当該ネットワークモデルの入力は上記のＣＴ画像の埋め込みベクトルであり、通常の単純スキャンＣＴ画像及び造影剤注射後の多時相ＣＴ画像の空間的特徴をそれぞれ出力することができ、さらに上記の空間的特徴を組み合わせて、時間的アテンションネットワークを構築し、当該ネットワークモデルの入力は組み合わせられた空間的特徴であり、時間的特徴と空間的特徴を組み合わせたベクトルを出力することができ、さらに分類層を通じて最終的な分類結果を出力し、最後にラベルと計算して損失を得て、トレーニングと最適化を繰り返すことで損失を最小にし、最適な分類モデルを得て、時空間的アテンションモデルに基づく多時相ＣＴ画像分類システムとする。 In the present invention, a professional medical imaging doctor first labels a multi-phase CT image, then pre-processes the image to segment the lesion and adjust the image size to fit the model input. The input is a normal simple scan CT image and a multi-phase contrast enhanced CT image after contrast injection, and the output is a normal simple scan CT image and a multi-phase contrast CT image after contrast injection. It is an embedding vector of a multi-phase contrast-enhanced CT image, and a spatial attention network is constructed, and the input of the network model is the embedding vector of the above CT image. The spatial features of phase CT images can be output respectively, and the above spatial features can be combined to construct a temporal attention network, and the input of the network model is the combined spatial features, and the temporal attention network is constructed by combining the above spatial features. It can output a vector that combines features and spatial features, and then outputs the final classification result through a classification layer, and finally calculates the label to obtain the loss, and repeats training and optimization to calculate the loss. is minimized, an optimal classification model is obtained, and a multi-temporal CT image classification system based on the spatiotemporal attention model is obtained.

本発明で採用される技術的解決手段は、具体的に次のとおりである。
時空間的アテンションモデルに基づく多時相ＣＴ画像分類システムであって、
分類される患者のｓ個の時相のＣＴ画像を取得するためのデータ取得ユニットと、
ｓ個の第１埋め込み層ネットワークを含む第１埋め込み層ネットワークユニットであって、第１埋め込み層ネットワークは、それぞれ、各時相のＣＴ画像を複数の画像ブロックに分割し、各画像ブロックをそれぞれ画像ブロックベクトルに展開し、全ての画像ブロックベクトルとクラスラベルベクトルを合わせた後に同次元の位置ベクトルと加算して対応する時相のＣＴ画像の埋め込みベクトルを得る、第１埋め込み層ネットワークユニットと、
ｓ個の空間的アテンションネットワークを含む空間的アテンションユニットであって、各空間的アテンションネットワークは、Ｌ１層の第１マルチヘッドアテンションネットワークＭＳＡと、Ｌ１層の第１多層パーセプトロンと、１層の第１正規化層とを含み、Ｌ１層の第１マルチヘッドアテンションネットワークＭＳＡとＬ１層の第１多層パーセプトロンは順にインターリーブ接続され、前記第１マルチヘッドアテンションネットワークＭＳＡは、複数の自己アテンションモジュールＳＡと、１つのスプライシング層とを含み、自己アテンションモジュールＳＡは、正規化された入力ベクトルをクエリ行列Ｑ_１ｉ、キーワード行列Ｋ_１ｉ及び値行列Ｖ_１ｉの３つの異なる行列に変換し、クエリ行列Ｑ_１ｉ、キーワード行列Ｋ_１ｉ及び値行列Ｖ_１ｉの３つの異なる行列に基づいて入力ベクトル中の各ベクトル間のアテンション関数を生成するために用いられ、ｉ＝１，２…は、空間的アテンションユニット中のｉ番目の自己アテンションモジュールＳＡを表し、スプライシング層は、各自己アテンションモジュールＳＡの出力するアテンション関数をスプライシングして最終的な空間的アテンション関数を得るために用いられ、最終的な空間的アテンション関数と入力ベクトルを加算するものを次の層の第１多層パーセプトロンに対応する入力ベクトルとし（当該ネットワークは、マルチヘッドアテンションモジュールにより異なるベクトル間のつながりを互いに比較し、重要な部分を強化することができる）、前記第１多層パーセプトロンは正規化された入力ベクトルを符号化してその入力ベクトルと加算し、加算結果を次の層の第１マルチヘッドアテンションネットワークＭＳＡに対応する入力とし、１層目の第１マルチヘッドアテンションネットワークＭＳＡの入力ベクトルは埋め込みベクトルであり、第１正規化層は、最終層の第１多層パーセプトロンの出力するベクトルとその入力ベクトルを加算して得られたベクトルの第１次元ベクトルを正規化し、対応する時相のＣＴ画像の空間的特徴とする、空間的アテンションユニットと、
１つの第２埋め込み層ネットワークを含む第２埋め込み層ネットワークユニットであって、ｓ個の空間的アテンションネットワークの出力するｓ個の対応する時相のＣＴ画像の空間的特徴を合わせた後にクラスラベルベクトルと合わせて埋め込み層ベクトルを得るための第２埋め込み層ネットワークユニット、
１つの時間的アテンションネットワークを含む時間的アテンションユニットであって、時間的アテンションネットワークは、Ｌ２層の第２マルチヘッドアテンションネットワークＭＳＡと、Ｌ２層の第２多層パーセプトロンと、１層の第２正規化層とを含み、Ｌ２層の第２マルチヘッドアテンションネットワークＭＳＡとＬ２層の第２多層パーセプトロンは順にインターリーブ接続され、前記第２マルチヘッドアテンションネットワークＭＳＡは、複数の自己アテンションモジュールＳＡと、１つのスプライシング層とを含み、自己アテンションモジュールＳＡは、正規化された入力ベクトルをクエリ行列Ｑ_２ｊ、キーワード行列Ｋ_２ｊ及び値行列Ｖ_２ｊの３つの異なる行列に変換し、クエリ行列Ｑ_２ｊ、キーワード行列Ｋ_２ｊ及び値行列Ｖ_２ｊの３つの異なる行列に基づいて入力ベクトル中の各ベクトル間のアテンション関数を生成するために用いられ、スプライシング層は、各自己アテンションモジュールＳＡの出力するアテンション関数をスプライシングして最終的な時間的アテンション関数を得るために用いられ、ｊ＝１，２…は、時間的アテンションユニット中のｊ番目の自己アテンションモジュールＳＡを表し、最終的な時間的アテンション関数と入力ベクトルを加算するものを次の層の第２多層パーセプトロンに対応する入力ベクトルとし、前記第２多層パーセプトロンは正規化された入力ベクトルを符号化してその入力ベクトルと加算し、加算結果を次の層の第２マルチヘッドアテンションネットワークＭＳＡに対応する入力とし、１層目の第２マルチヘッドアテンションネットワークＭＳＡの入力ベクトルは第２埋め込み層ネットワークユニットの出力する埋め込み層ベクトルであり、第２正規化層は、最終層の第２多層パーセプトロンの出力するベクトルとその入力ベクトルを加算して得られたベクトルの第１次元ベクトルを正規化し、空間的特徴及び時間的特徴を有するベクトルを得る、時間的アテンションユニットと、
分類層を含む分類層ユニットであって、空間的特徴及び時間的特徴を有するベクトルに基づいて分類結果を得るための分類層ユニットと、を含む。 The technical solution adopted in the present invention is specifically as follows.
A multi-temporal CT image classification system based on a spatio-temporal attention model, comprising:
a data acquisition unit for acquiring CT images of s time phases of a patient to be classified;
A first embedded layer network unit including s first embedded layer networks, each of which divides a CT image of each time phase into a plurality of image blocks, and divides each image block into an image block. a first embedding layer network unit that expands into block vectors, combines all image block vectors and class label vectors, and then adds them to a position vector of the same dimension to obtain an embedding vector of a CT image of a corresponding time phase;
a spatial attention unit comprising s spatial attention networks, each spatial attention network comprising a first multi-head attention network MSA in the L1 layer, a first multi-layer perceptron in the L1 layer, and a first multi-head perceptron in the L1 layer; a normalization layer, a first multi-head attention network MSA of the L1 layer and a first multi-layer perceptron of the L1 layer are interleaved in turn, and the first multi-head attention network MSA includes a plurality of self-attention modules SA; The self-attention module SA transforms the normalized input vector into three different matrices: a query matrix Q _1i , a keyword matrix K _1i and a value matrix V _1i , a query matrix Q _1i , a keyword matrix It is used to generate the attention function between each vector in the input vectors based on three different matrices of K _1i and value matrix V _1i , where i=1, 2... is the i-th one in the spatial attention unit. Representing a self-attention module SA, the splicing layer is used to splice the output attention function of each self-attention module SA to obtain the final spatial attention function, and the splicing layer is used to splice the output attention function of each self-attention module SA to obtain the final spatial attention function, and the splicing layer is used to splice the output attention function of each self-attention module SA to obtain the final spatial attention function. Let what is added be the input vector corresponding to the first multilayer perceptron of the next layer (the network can compare the connections between different vectors with each other by the multi-head attention module and strengthen the important parts), and The first multi-layer perceptron encodes the normalized input vector and adds it to the input vector, and uses the addition result as an input corresponding to the first multi-head attention network MSA of the next layer. The input vector of the attention network MSA is an embedded vector, and the first normalization layer normalizes the first dimension vector of the vector obtained by adding the vector output from the first multilayer perceptron in the final layer and its input vector. , a spatial attention unit that is a spatial feature of a CT image of a corresponding time phase;
A second embedding layer network unit including one second embedding layer network, wherein the class label vector is obtained after combining the spatial features of s corresponding temporal CT images output from the s spatial attention networks. a second embedding layer network unit for obtaining an embedding layer vector together with
A temporal attention unit including one temporal attention network, wherein the temporal attention network includes a second multi-head attention network MSA in the L2 layer, a second multi-layer perceptron in the L2 layer, and a second normalization in the first layer. a second multi-head attention network MSA of the L2 layer and a second multi-layer perceptron of the L2 layer are interleaved in sequence, and the second multi-head attention network MSA includes a plurality of self-attention modules SA and a splicing The self-attention module SA transforms the normalized input vector into three different matrices: a query matrix Q _2j , a keyword matrix K _2j and a value matrix V _2j , a query matrix Q _2j , a keyword matrix K _2j The splicing layer is used to generate the attention function between each vector in the input vectors based on three different matrices of and value matrix _V2j , and the splicing layer splices the attention function output from each self-attention module SA to form the final is used to obtain the final temporal attention function, where j = 1, 2... represents the j-th self-attention module SA in the temporal attention unit, and add the final temporal attention function and the input vector. is an input vector corresponding to the second multilayer perceptron of the next layer, and the second multilayer perceptron encodes the normalized input vector and adds it to the input vector, and adds the addition result to the second multilayer perceptron of the next layer. The input vector of the second multi-head attention network MSA in the first layer is the embedding layer vector output from the second embedding layer network unit, and the second normalization layer is the input vector of the second multi-head attention network MSA in the first layer. a temporal attention unit that normalizes the first dimension vector of the vector obtained by adding the output vector of the second multilayer perceptron and its input vector to obtain a vector having spatial characteristics and temporal characteristics;
A classification layer unit including a classification layer for obtaining a classification result based on a vector having spatial and temporal features.

さらに、ｓは２以上であり、ｓ個の時相のＣＴ画像は、具体的に、単純スキャン相ＣＴ画像と、動脈相ＣＴ画像と、門脈相ＣＴ画像と、遅発相ＣＴ画像との少なくとも２つを含む。 Furthermore, s is 2 or more, and the CT images of s time phases are specifically a simple scan phase CT image, an arterial phase CT image, a portal venous phase CT image, and a delayed phase CT image. Contains at least two.

さらに、前記埋め込みベクトルは、具体的に、
Ｘ_０＝［Ｘ_{ｃｌａｓｓ}；Ｘ^１ _ｐ；Ｘ^２ _ｐ…Ｘ^Ｎ _ｐ］＋Ｘ_ｐｏｓであり、
ただし、Ｘ_{ｃｌａｓｓ}はクラスラベルベクトルを表し、Ｘ_ｐｏｓは位置ベクトルを表し、Ｘ_ｐは線形化後の画像ブロックベクトルを表し、Ｎは分割後の画像ブロックの数を表す。 Furthermore, the embedding vector specifically includes:
X ₀ = [X _class ; X ¹ _p ; X ² _p ...X ^N _p ]+X _pos ,
However, X _class represents a class label vector, X _pos represents a position vector, X _p represents an image block vector after linearization, and N represents the number of image blocks after division.

さらに、クエリ行列Ｑ_１ｉ、キーワード行列Ｋ_１ｉ及び値行列Ｖ_１ｉの３つの異なる行列に基づいて入力ベクトル中の各ベクトル間のアテンション関数を生成することは、具体的に、
であり、
ただし、ｄ_ｋはキーワード行列Ｋ_１ｉ中の各キーワードベクトルｋの次元を表し、ｓｏｆｔｍａｘ（）はｓｏｆｔｍａｘ関数である。 Furthermore, generating the attention function between each vector in the input vectors based on three different matrices, query matrix Q _1i , keyword matrix K _1i and value matrix V _1i specifically,
and
However, _dk represents the dimension of each keyword vector k in the keyword matrix _K1i , and softmax() is a softmax function.

同様に、クエリ行列Ｑ_２ｊ、キーワード行列Ｋ_２ｊ及び値行列Ｖ_２ｊの３つの異なる行列に基づいて入力ベクトル中の各ベクトル間のアテンション関数を生成することは、具体的に、
であり、
ただし、ｄ_ｋはキーワード行列Ｋ_２ｊ中の各キーワードベクトルｋの次元を表し、ｓｏｆｔｍａｘ（）はｓｏｆｔｍａｘ関数である。 Similarly, generating the attention function between each vector in the input vectors based on three different matrices: query matrix Q _2j , keyword matrix K _2j and value matrix V _2j specifically,
and
However, _dk represents the dimension of each keyword vector k in the keyword matrix _K2j , and softmax() is a softmax function.

さらに、前記第１マルチヘッドアテンションネットワークＭＳＡ、第２マルチヘッドアテンションネットワークＭＳＡの入力ベクトルは、
であり、
ＬＮは正規化方法を表し、ｘ_ｌは第１マルチヘッドアテンションネットワークＭＳＡ又は第２マルチヘッドアテンションネットワークＭＳＡの入力ベクトルを表し、ＭＬＰ（）は対応する第１多層パーセプトロン又は第２多層パーセプトロンの出力を表し、ｘ’_ｌ－１はｌ－１層目の第１多層パーセプトロン又は第２多層パーセプトロンの入力ベクトルを表す。 Furthermore, the input vectors of the first multi-head attention network MSA and the second multi-head attention network MSA are:
and
LN represents the normalization method, x _l represents the input vector of the first multi-head attention network MSA or the second multi-head attention network MSA, and MLP() represents the output of the corresponding first multi-layer perceptron or second multi-layer perceptron. where x' _l-1 represents the input vector of the first multilayer perceptron or the second multilayer perceptron of the l-1th layer.

さらに、前記第１多層パーセプトロン、第２多層パーセプトロンの入力ベクトルは、
であり、
ＬＮは正規化方法を表し、ｘ’_ｌは第１多層パーセプトロン又は第２多層パーセプトロンの入力ベクトルを表し、ＭＳＡ（）は対応する第１マルチヘッドアテンションネットワークＭＳＡ又は第２マルチヘッドアテンションネットワークＭＳＡの出力を表し、ｘ_ｌはｌ層目の第１マルチヘッドアテンションネットワークＭＳＡ又は第２マルチヘッドアテンションネットワークＭＳＡの入力ベクトルを表す。 Furthermore, the input vectors of the first multilayer perceptron and the second multilayer perceptron are as follows:
and
LN represents the normalization method, _x'l represents the input vector of the first multi-layer perceptron or the second multi-layer perceptron, and MSA() represents the output of the corresponding first multi-head attention network MSA or second multi-head attention network MSA. , and x _l represents the input vector of the first multi-head attention network MSA or the second multi-head attention network MSA of the l-th layer.

時空間的アテンションモデルに基づく多時相ＣＴ画像分類システムの構築方法であって、
サンプルを収集してデータセットを構築するステップであって、前記データセットの各サンプルは１人の患者のｓ個の時相のＣＴ画像を含むステップと、
上記時空間的アテンションモデルに基づく多時相ＣＴ画像分類システムを構築し、データセット中の各サンプルをシステムの入力として、システムの出力する分類結果と分類ラベルとの誤差を最小にすることを目標としてトレーニングし、前記時空間的アテンションモデルに基づく多時相ＣＴ画像分類システムを得るステップと、を含む。 A method for constructing a multi-temporal CT image classification system based on a spatio-temporal attention model, the method comprising:
Collecting samples to construct a dataset, each sample of the dataset including s temporal CT images of one patient;
The goal is to construct a multi-temporal CT image classification system based on the above spatio-temporal attention model, and to minimize the error between the classification result output by the system and the classification label, using each sample in the dataset as input to the system. and obtaining a multi-temporal CT image classification system based on the spatio-temporal attention model.

本発明の有益な効果は次のとおりである。
（１）本発明は、空間的アテンションネットワーク及び時間的アテンションネットワークの２種類のアテンションネットワークを含む、時空間的アテンションモデルに基づく多時相ＣＴ画像分類システムを提案する。空間的アテンションネットワークはＣＴ画像の空間的特徴を抽出することができ、時間的アテンションネットワークは異なる時相のＣＴ画像間のつながりを抽出することができ、各時相のＣＴ間のグローバルなアテンションを強化する。 The beneficial effects of the present invention are as follows.
(1) The present invention proposes a multi-temporal CT image classification system based on a spatio-temporal attention model, which includes two types of attention networks: a spatial attention network and a temporal attention network. The spatial attention network can extract the spatial features of CT images, and the temporal attention network can extract the connections between CT images of different temporal phases, and the global attention between CT images of each temporal phase can be extracted. Strengthen.

（２）本発明は、多時相のＣＴ画像に基づいて診断する必要のある様々な疾患に普遍性を有し、異なる時相の病巣の特徴をより効果的に利用して、時間的なつながりを強化し、従来の畳み込みニューラルネットワークをメインモデルとした設計を捨て、アテンションメカニズムにより、より多くの計算を重要な領域に投入し、注目すべき目標のより多くの詳細情報を得ることができ、これによって他の無用な情報を抑制し、計算の冗長性と遅延を低減し、ＣＴ画像の診断をより短時間で実現しやすく、診断の精度を高くし、診断の効果をより安定的にする。 (2) The present invention has universality to various diseases that need to be diagnosed based on multi-phase CT images, and more effectively utilizes the characteristics of lesions in different temporal phases to By strengthening connections and abandoning the traditional convolutional neural network-based design, the attention mechanism allows you to put more computations into important areas and get more details about the target you want to focus on. , This suppresses other unnecessary information, reduces calculation redundancy and delay, makes it easier to diagnose CT images in a shorter time, increases diagnosis accuracy, and makes the diagnosis effect more stable. do.

本発明の時空間的アテンションモデルに基づく多時相ＣＴ画像分類システムの構造図である。FIG. 2 is a structural diagram of a multi-temporal CT image classification system based on the spatio-temporal attention model of the present invention. 本発明の時空間的アテンションモデルに基づく多時相ＣＴ画像分類システムによる分類のフローチャートである。2 is a flowchart of classification performed by the multi-temporal CT image classification system based on the spatio-temporal attention model of the present invention. 本発明の時空間的アテンションモデルに基づく肝がんの多時相ＣＴ画像分類システムの構築方法のフローチャートである。1 is a flowchart of a method for constructing a multi-temporal CT image classification system for liver cancer based on the spatio-temporal attention model of the present invention.

例示的な実施例をここで詳細に説明し、その例示は添付の図面に示される。以下の説明が図面に言及している場合、特に断りのない限り、異なる図面の同じ番号は、同じ又は類似の要素を示す。以下の例示的な実施例に記載の実施形態は、本発明と一致する全ての実施形態を表すわけではない。それどころか、それらは、添付の特許請求の範囲に詳述されているような、本発明のいくつかの態様と一致する装置及び方法の例に過ぎない。 Exemplary embodiments will now be described in detail, examples of which are illustrated in the accompanying drawings. Where the following description refers to drawings, the same numbers in different drawings indicate the same or similar elements, unless stated otherwise. The embodiments described in the illustrative examples below do not represent all embodiments consistent with the present invention. On the contrary, they are merely examples of apparatus and methods consistent with certain aspects of the invention, as detailed in the appended claims.

本発明で使用される用語は、特定の実施例を説明するためのものに過ぎず、本発明を限定するものではない。 The terminology used in the present invention is for the purpose of describing particular embodiments only and is not intended to limit the invention.

本発明及び添付の特許請求の範囲で使用される単数形「一種」、「前記」及び「当該」は、文脈が明らかに他の意味を示さない限り、複数形も含むことを意図している。また、本明細書において使用される用語「及び／又は」は、１つ又は複数の関連する列挙された項目の任意の又は全ての可能な組み合わせを指し、包含することを理解されたい。 As used in this invention and the appended claims, the singular forms "a", "a", and "the" are intended to include the plural as well, unless the context clearly dictates otherwise. . It should also be understood that the term "and/or" as used herein refers to and encompasses any or all possible combinations of one or more of the associated listed items.

なお、本発明では「第１」、「第２」、「第３」などの用語で様々な情報を説明するかもしれないが、これらの情報はこれらの用語に限定されない。これらの用語は同じタイプの情報を互いに区別するためだけに用いられる。例えば、本発明の範囲から逸脱しない限り、第１情報は第２情報とも呼ばれてもよいし、同様に、第２情報は第１情報と呼ばれてもよい。文脈によって、ここで使用される言葉「もし」は、「…とき」又は「…と」又は「決定に応じて」と解釈されてもよい。 Note that in the present invention, various information may be explained using terms such as "first," "second," and "third," but these information are not limited to these terms. These terms are only used to distinguish information of the same type from each other. For example, first information may also be referred to as second information, and similarly, second information may be referred to as first information without departing from the scope of the invention. Depending on the context, the word "if" as used herein may be interpreted as "when" or "with" or "as determined."

本発明の趣旨は、時空間的アテンションモデルに基づく多時相ＣＴ画像分類システム及び構築方法を提案することであり、従来の畳み込みニューラルネットワークに基づいて、多時相ＣＴ画像を組み合わせて処理できないという問題を解決する。なお、本発明の多時相ＣＴ画像は、臨床で通常どおりスキャンするＣＴ画像及び造影剤注射後にスキャンする造影ＣＴ画像を含み、通常どおりスキャンするＣＴ画像は単純スキャン相ＣＴ画像であり、造影剤注射後にスキャンする造影ＣＴ画像は動脈相、門脈相、遅発相のＣＴ画像を含む。 The purpose of the present invention is to propose a multi-temporal CT image classification system and construction method based on a spatio-temporal attention model, and to solve the problem that multi-temporal CT images cannot be combined and processed based on a conventional convolutional neural network. Solve a problem. Note that the multi-phase CT images of the present invention include CT images scanned normally in clinical practice and contrast-enhanced CT images scanned after contrast agent injection; the CT images scanned normally are simple scan phase CT images; Contrast CT images scanned after injection include arterial phase, portal venous phase, and delayed phase CT images.

本発明の時空間的アテンションモデルに基づく多時相ＣＴ画像分類システムは、図１に示されるとおり、以下を含む。
データ取得ユニットであって、分類される患者のｓ個の時相のＣＴ画像を取得するために用いられ、
ｓ個の第１埋め込み層ネットワークを含む第１埋め込み層ネットワークユニットであって、第１埋め込み層ネットワークは、それぞれ、各時相のＣＴ画像を複数の画像ブロックに分割し、各画像ブロックをそれぞれ画像ブロックベクトルに展開し、全ての画像ブロックベクトルとクラスラベルベクトルを合わせた後に同次元の位置ベクトルと加算して対応する時相のＣＴ画像の埋め込みベクトルを得るために用いられ、各時相のＣＴ画像のサイズは
であり、Ｈ、Ｗは一枚のＣＴ画像の長さ、幅であり、ＣはＣＴ画像の層数である。分割後の画像ブロックのサイズはＰ×Ｐ×Ｃであり、Ｐは分割後の画像ブロックの長さ及び幅であり、各画像ブロックは畳み込み層に通じて画像ブロックベクトルに展開され、埋め込みベクトルＸ_０として線形投影され、埋め込みベクトルＸ_０は、
Ｘ_０＝［Ｘ_{ｃｌａｓｓ}；Ｘ^１ _ｐ；Ｘ^２ _ｐ…Ｘ^Ｎ _ｐ］＋Ｘ_ｐｏｓ，Ｘ_ｐ∈Ｒ^１×Ｄ，Ｘ_ｐｏｓ∈Ｒ^{（１＋Ｎ）×Ｄ} （１）
であり、
ただし、Ｘ_{ｃｌａｓｓ}はクラスラベルベクトルを表し、Ｘ_ｐｏｓは位置ベクトルを表し、Ｘ_ｐは線形化後の画像ブロックベクトルを表し、Ｎは分割後の画像ブロックの数を表し、Ｎ＝ＨＷ／Ｐ^２である。Ｄは畳み込み層の畳み込みカーネルの数であり、畳み込み層を通過した画像ブロックベクトルと学習可能なクラスラベルベクトルを合わせることで、ラベルベクトル全体の表現情報を集めることができ、さらに学習可能な同次元の位置ベクトルと加算すると、データ情報を強調することができる。 The multi-temporal CT image classification system based on the spatio-temporal attention model of the present invention, as shown in FIG. 1, includes the following:
a data acquisition unit, which is used to acquire s time phase CT images of a patient to be classified;
A first embedded layer network unit including s first embedded layer networks, each of which divides a CT image of each time phase into a plurality of image blocks, and divides each image block into an image block. It is expanded into a block vector, combined with all image block vectors and class label vectors, and then added to the position vector of the same dimension to obtain the embedded vector of the CT image of the corresponding time phase. The size of the image is
, H and W are the length and width of one CT image, and C is the number of layers of the CT image. The size of the image block after division is P×P×C, P is the length and width of the image block after division, each image block is expanded into an image block vector through a convolution layer, and the embedding vector ₀ and the embedding vector X ₀ is
_X ₀ ⁼ _[ ^X _class _; _X ¹ _p ^; _{_} ^_
and
However, X _class represents a class label vector, X _pos represents a position vector, X _p represents an image block vector after linearization, N represents the number of image blocks after division, and N=HW/P ² It is. D is the number of convolution kernels in the convolution layer, and by combining the image block vector that has passed through the convolution layer and the learnable class label vector, it is possible to collect the expression information of the entire label vector, and furthermore, it is possible to collect the expression information of the entire label vector. Data information can be emphasized by adding it to the position vector of .

前記多時相ＣＴ画像分類システムは、ｓ個の空間的アテンションネットワークを含む空間的アテンションユニットも含み、各空間的アテンションネットワークは、Ｌ１層の第１マルチヘッドアテンションネットワークＭＳＡと、Ｌ１層の第１多層パーセプトロンと、１層の第１正規化層とを含み、Ｌ１層の第１マルチヘッドアテンションネットワークＭＳＡとＬ１層の第１多層パーセプトロンは順にインターリーブ接続され、前記第１マルチヘッドアテンションネットワークＭＳＡは、複数の自己アテンションモジュールＳＡと、１つのスプライシング層とを含み、自己アテンションモジュールＳＡは、正規化された入力ベクトルをクエリ行列Ｑ_１ｉ、キーワード行列Ｋ_１ｉ及び値行列Ｖ_１ｉの３つの異なる行列に変換するために用いられ、具体的には、まず入力ベクトルをクエリベクトルｑ、キーワードベクトルｋ及び値ベクトルｖの３つの異なるベクトルに変換し、そのうちクエリベクトルｑは他のベクトルとマッチングするために用いられ、キーワードベクトルｋはマッチングされ、値ベクトルｖは抽出される情報を表し、ｑ、ｋ、ｖの３種類のベクトルは学習可能な行列と入力ベクトルを乗算して得られる。埋め込みベクトルは多次元であることを考慮して、グローバルな視点から表すと、
Ｑ_１ｉ＝ＸＷ_１ｉ ^Ｑ，Ｋ_１ｉ＝ＸＷ_１ｉ ^Ｋ，Ｖ_１ｉ＝ＸＷ_１ｉ ^Ｖ（２）
となり、
ただし、Ｗ_１ｉ ^Ｑ、Ｗ_１ｉ ^Ｋ、Ｗ_１ｉ ^Ｖはｉ番目のトレーニング可能な重み行列を表し、Ｘは入力ベクトルを表す。 The multi-temporal CT image classification system also includes a spatial attention unit including s spatial attention networks, each spatial attention network comprising a first multi-head attention network MSA of the L1 layer and a first multi-head attention network MSA of the L1 layer. The first multi-head attention network MSA of the L1 layer and the first multi-layer perceptron of the L1 layer are sequentially interleaved, and the first multi-head attention network MSA includes a multi-layer perceptron and a first normalization layer. It includes a plurality of self-attention modules SA and one splicing layer, and the self-attention module SA transforms the normalized input vector into three different matrices: a query matrix Q _1i , a keyword matrix K _1i and a value matrix V _1i Specifically, the input vector is first converted into three different vectors: a query vector q, a keyword vector k, and a value vector v, of which the query vector q is used to match other vectors. , the keyword vector k is matched, the value vector v represents the information to be extracted, and three types of vectors, q, k, and v, are obtained by multiplying the learnable matrix and the input vector. Considering that the embedding vector is multidimensional, expressed from a global perspective,
Q _1i =XW _1i ^Q , K _1i =XW _1i ^K , V _1i =XW _1i ^V (2)
Then,
Here, W _1i ^Q , W _1i ^K , W _1i ^V represent the i-th trainable weight matrix, and X represents the input vector.

クエリ行列Ｑ_１ｉ、キーワード行列Ｋ_１ｉ及び値行列Ｖ_１ｉの３つの異なる行列に基づいて入力ベクトル中の各ベクトル間のアテンション関数を生成し、具体的には、クエリベクトルｑと各キーワードベクトルｋのドット積を求め、ドット積をキーワードベクトルｋの次元の平方根で割り、ｓｏｆｔｍａｘ層を通じて値ベクトルｖと乗算して和を求め、ｓｏｆｔｍａｘ関数は入力値を区間（０，１）にマッピングするために用いられる。入力ベクトル間のアテンション関数の算式は、
（３）
であり、
ただし、ｄ_ｋはキーワード行列Ｋ_１ｉ中の各キーワードベクトルｋの次元を表し、ｓｏｆｔｍａｘ（）はｓｏｆｔｍａｘ関数であり、ｈｅａｄ_１ｉはｉ番目の自己アテンションモジュールＳＡの出力を表す。 An attention function between each vector in the input vectors is generated based on three different matrices: a query matrix Q _1i , a keyword matrix K _1i , and a value matrix V _1i , and specifically, an attention function between each vector in the input vectors is generated. Find the dot product, divide the dot product by the square root of the dimension of the keyword vector k, and multiply by the value vector v through the softmax layer to find the sum, and the softmax function is used to map the input value to the interval (0, 1). It will be done. The formula for the attention function between input vectors is
(3)
and
Here, d _k represents the dimension of each keyword vector k in the keyword matrix K _1i , softmax( ) is a softmax function, and head _1i represents the output of the i-th self-attention module SA.

スプライシング層は、各自己アテンションモジュールＳＡの出力するアテンション関数をスプライシングして最終的な空間的アテンション関数を得るために用いられ、
ＭＳＡ（）＝Ｃｏｎｃａｔ（ｈｅａｄ_１１，…，ｈｅａｄ_１ｉ，…）Ｗ_１ ^Ｏ（４）
と表され、
ＭＳＡ（）は空間的アテンションネットワークの出力であり、Ｗ_１ ^Ｏはトレーニング可能な重み行列である。 The splicing layer is used to splice the attention function output from each self-attention module SA to obtain a final spatial attention function,
MSA()=Concat(head ₁₁ ,..., head _1i ,...)W ₁ ^O (4)
It is expressed as,
MSA( ) is the output of the spatial attention network and W ₁ ^O is the trainable weight matrix.

当該ネットワークは、マルチヘッドアテンションモジュールにより異なるベクトル間のつながりを互いに比較し、重要な部分を強化することができる。第１マルチヘッドアテンションネットワークＭＳＡを基に第１多層パーセプトロンＭＬＰを使用し、ＭＬＰは非線形層としてＧｅｌｕ関数を有する多層パーセプトロンを表し、Ｇｅｌｕ関数は、高性能のニューラルネットワーク活性化関数であり、その非線形変化は予想に合致するランダム正則変換方式であるためである。具体的には、最終的な空間的アテンション関数と入力ベクトルを加算するものを次の層の第１多層パーセプトロンに対応する入力ベクトルとし、
（５）
ＬＮは正規化方法を表し、ｘ’_ｌは第１多層パーセプトロンの入力ベクトルを表し、ＭＳＡ（）は第１マルチヘッドアテンションネットワークの出力を表し、ｘ_ｌはｌ層目の第１マルチヘッドアテンションネットワークの入力ベクトルを表す。 The network can compare the connections between different vectors with each other and strengthen the important parts by the multi-head attention module. The first multi-layer perceptron MLP is used based on the first multi-head attention network MSA, MLP represents a multi-layer perceptron with Gelu function as a nonlinear layer, Gelu function is a high-performance neural network activation function, and its nonlinear This is because the change is a random regular transformation method that matches expectations. Specifically, the sum of the final spatial attention function and the input vector is set as the input vector corresponding to the first multilayer perceptron of the next layer,
(5)
LN represents the normalization method, x' _l represents the input vector of the first multi-layer perceptron, MSA() represents the output of the first multi-head attention network, x _l represents the l-th layer first multi-head attention network represents the input vector of .

前記第１多層パーセプトロンは正規化された入力ベクトルを符号化してその入力ベクトルと加算し、加算結果を次の層の第１マルチヘッドアテンションネットワークＭＳＡに対応する入力ベクトルとし、
（６）
ＭＬＰ（）は第１多層パーセプトロンの出力を表し、ｘ’_ｌ－１はｌ－１層目の第１多層パーセプトロンの入力ベクトルを表す。 The first multi-layer perceptron encodes the normalized input vector and adds it to the input vector, and uses the addition result as an input vector corresponding to the first multi-head attention network MSA of the next layer;
(6)
MLP() represents the output of the first multilayer perceptron, and x' _l-1 represents the input vector of the first multilayer perceptron at the l-1th layer.

１層目の第１マルチヘッドアテンションネットワークＭＳＡの入力ベクトルは埋め込みベクトルであり、即ちｘ_１＝Ｘ_０であり、第１正規化層は、最終層の第１多層パーセプトロンの出力するベクトルとその入力ベクトルを加算して得られたベクトルの第１次元ベクトルを正規化し、対応する時相のＣＴ画像の空間的特徴とするために用いられ、
（７）
ｘ^０ _Ｌは全ての符号化層を通過した後のｘ_Ｌの第１次元のデータを表し、Ｌ＝２Ｌ１である。 The input vector of the first multi-head attention network MSA in the first layer is an embedding vector, that is, x ₁ ₌ It is used to normalize the first dimension vector of the vector obtained by adding the vectors and use it as a spatial feature of the CT image of the corresponding time phase,
(7)
x ⁰ _L represents the first dimension data of x _L after passing through all encoding layers, and L=2L1.

単純スキャン相、動脈相、門脈相及び遅発相のＣＴ画像に対して、それぞれ、対応する単純スキャン相、動脈相、門脈相及び遅発相のＣＴ画像の空間的特徴を得る。
１つの第２埋め込み層ネットワークを含む第２埋め込み層ネットワークユニットであって、ｓ個の空間的アテンションネットワークの出力するｓ個の対応する時相のＣＴ画像の空間的特徴を合わせた後にクラスラベルベクトルと合わせて埋め込み層ベクトルｘを得るために用いられ、
ｘ＝［Ｘ_{ｃｌａｓｓ}；ｘ_{ｓｐａｃｅ}］，ｘ_{ｓｐａｃｅ}∈Ｒ^ｓ×Ｄ，Ｘ_{ｃｌａｓｓ}∈Ｒ^１×Ｄ（８）
ただし、ｘ_{ｓｐａｃｅ}は合わせられた空間的特徴を表す。 For the CT images of the simple scan phase, arterial phase, portal venous phase, and delayed phase, spatial features of the corresponding CT images of the simple scan phase, arterial phase, portal venous phase, and delayed phase are obtained, respectively.
A second embedding layer network unit including one second embedding layer network, wherein the class label vector is obtained after combining the spatial features of s corresponding temporal CT images output from the s spatial attention networks. is used to obtain the embedding layer vector x,
x = [X _class ; x _space ], x _space ∈R ^s×D , X _class ∈R ^1×D (8)
However, x _space represents the combined spatial feature.

前記多時相ＣＴ画像分類システムは、１つの時間的アテンションネットワークを含む時間的アテンションユニットも含み、時間的アテンションネットワークの構造及び機能は空間的アテンションネットワークの構造と同じであり、具体的には、Ｌ２層の第２マルチヘッドアテンションネットワークＭＳＡと、Ｌ２層の第２多層パーセプトロンと、１層の第２正規化層とを含み、Ｌ２層の第２マルチヘッドアテンションネットワークＭＳＡとＬ２層の第２多層パーセプトロンは順にインターリーブ接続され、前記第２マルチヘッドアテンションネットワークＭＳＡは、複数の自己アテンションモジュールＳＡと、１つのスプライシング層とを含み、自己アテンションモジュールＳＡは、式（２）に従って、正規化された入力ベクトルをクエリ行列Ｑ_２ｊ、キーワード行列Ｋ_２ｊ及び値行列Ｖ_２ｊの３つの異なる行列に変換し、クエリ行列Ｑ_２ｊ、キーワード行列Ｋ_２ｊ及び値行列Ｖ_２ｊの３つの異なる行列に基づいて、式（３）に従って入力ベクトル中の各ベクトル間のアテンション関数を生成し、ｊは時間的アテンションユニット中の自己アテンションモジュールＳＡのインデックスであり、スプライシング層は式（４）に従って各自己アテンションモジュールＳＡの出力するアテンション関数をスプライシングして最終的な時間的アテンション関数を得て、式（５）に従って最終的な時間的アテンション関数と入力ベクトルを加算するものを次の層の第２多層パーセプトロンに対応する入力ベクトルとし、式（６）に従って、前記第２多層パーセプトロンは正規化された入力ベクトルを符号化してその入力ベクトルと加算し、加算結果を次の層の第２マルチヘッドアテンションネットワークＭＳＡに対応する入力ベクトルとし、１層目の第２マルチヘッドアテンションネットワークＭＳＡの入力ベクトルは第２埋め込み層ネットワークユニットの出力する埋め込み層ベクトルであり、第２正規化層は、最終層の第２多層パーセプトロンの出力するベクトルとその入力ベクトルを加算して得られたベクトルの第１次元ベクトルを正規化し、空間的特徴及び時間的特徴を有するベクトルｘ_ｔｉｍｅを得るために用いられ、
分類層Ｗを含む分類層ユニットであって、空間的特徴及び時間的特徴を有するベクトルに基づいて分類結果Ｐｒｏｂを得るために用いられ、
Ｐｒｏｂ＝Ｗ（ｘ_ｔｉｍｅ ^Ｔ）（９）
Ｐｒｏｂ∈Ｒ^Ｃはクラスの確率分布を表し、Ｃはクラスの総数を表す。 The multi-temporal CT image classification system also includes a temporal attention unit including one temporal attention network, the structure and function of the temporal attention network are the same as the structure of the spatial attention network, specifically: The second multi-head attention network MSA of the L2 layer and the second multi-layer of the L2 layer include a second multi-head attention network MSA of the L2 layer, a second multi-layer perceptron of the L2 layer and a second normalization layer of one layer. The perceptrons are interleaved in sequence, and the second multi-head attention network MSA includes a plurality of self-attention modules SA and one splicing layer, and the self-attention module SA has a normalized input according to equation (2). Convert the vector into three different matrices: query matrix Q _2j , keyword matrix K _2j and value matrix V _2j , and based on the three different matrices: query matrix Q _2j , keyword matrix K _2j and value matrix V _2j , formulate the equation ( 3), generate the attention function between each vector in the input vectors, j is the index of the self-attention module SA in the temporal attention unit, and the splicing layer generates the output of each self-attention module SA according to equation (4). Splicing the attention function to obtain the final temporal attention function and adding the final temporal attention function and the input vector according to equation (5) is the input vector corresponding to the second multilayer perceptron of the next layer. According to equation (6), the second multi-layer perceptron encodes the normalized input vector, adds it to the input vector, and adds the addition result to the input vector corresponding to the second multi-head attention network MSA in the next layer. The input vector of the second multi-head attention network MSA in the first layer is the embedding layer vector output from the second embedding layer network unit, and the second normalization layer is the vector output from the second multi-layer perceptron in the final layer. is used to normalize the first dimension vector of the vector obtained by adding the input vector and obtain a vector x _time having spatial characteristics and temporal characteristics,
a classification layer unit including a classification layer W, which is used to obtain a classification result Prob based on a vector having spatial features and temporal features;
Prob=W(x _time ^T ) (9)
ProbεR ^C represents the probability distribution of classes, and C represents the total number of classes.

図２は、本発明の時空間的アテンションモデルに基づく多時相ＣＴ画像分類システムによる分類のフローチャートであり、具体的には次のとおりである。
データ取得ユニットにより取得された分類される患者のｓ個の時相のＣＴ画像を第１埋め込み層ネットワークユニットに入力し、各第１埋め込み層ネットワークは、対応する単一の時相のＣＴ画像を複数の画像ブロックに分割し、各画像ブロックをそれぞれ画像ブロックベクトルに展開し、全ての画像ブロックベクトルとクラスラベルベクトルを合わせた後に同次元の位置ベクトルと加算して対応する時相のＣＴ画像の埋め込みベクトルを得て、
得られた対応する時相のＣＴ画像の埋め込みベクトルを空間的アテンションユニット中の対応する空間的アテンションネットワークに入力して対応する時相のＣＴ画像の空間的特徴を得て、
ｓ個の空間的アテンションネットワークの出力するｓ個の対応する時相のＣＴ画像の空間的特徴を第２埋め込み層ネットワークユニットに入力し、ｓ個の対応する時相のＣＴ画像の空間的特徴を合わせた後、クラスラベルベクトルと重ねて埋め込み層ベクトルを構成し、
埋め込み層ベクトルを時間的アテンションユニットに入力して、空間的特徴及び時間的特徴を有するベクトルを得て、最後に、得られた空間的特徴及び時間的特徴を有するベクトルを分類層ユニットに入力して、最終的な分類結果が出力される。 FIG. 2 is a flowchart of classification performed by the multi-temporal CT image classification system based on the spatio-temporal attention model of the present invention, and the details are as follows.
The CT images of s temporal phases of the patient to be classified acquired by the data acquisition unit are input to the first embedding layer network unit, and each first embedding layer network receives the CT images of the corresponding single temporal phase. Divide into multiple image blocks, develop each image block into an image block vector, combine all image block vectors and class label vectors, and then add them to the position vector of the same dimension to calculate the CT image of the corresponding time phase. Get the embedding vector,
inputting the obtained embedding vector of the CT image of the corresponding temporal phase into the corresponding spatial attention network in the spatial attention unit to obtain the spatial features of the CT image of the corresponding temporal phase;
The spatial features of the s CT images of the corresponding temporal phases output by the s spatial attention networks are input to the second embedding layer network unit, and the spatial features of the CT images of the s corresponding temporal phases are input to the second embedding layer network unit. After combining, configure the embedding layer vector by overlapping it with the class label vector,
Input the embedding layer vector into the temporal attention unit to obtain the vector with spatial features and temporal features, and finally input the obtained vector with spatial features and temporal features into the classification layer unit. The final classification result is then output.

本発明のシステムは、異なる種類の腫瘍又はサブタイプのＣＴ画像における違いに基づいてＣＴ画像の分類を実現し、さらに、腫瘍の病型／病期の診断上の分類を実現するものである。本発明のシステムは、２種類以上の腫瘍の分類に用いることができ、具体的に、システムの構築方法によって決定される。例えば、肝がんは一般的に原発性及び続発性の２種類に分けられる。原発性肝臓悪性腫瘍は肝臓の上皮又は間葉組織から発生するものであり、続発性（転移性とも呼ばれる）肝がんとは、全身の複数の臓器に由来する悪性腫瘍が肝臓に浸潤するものを指す。一般的には、胃、胆道、膵臓、結腸・直腸、卵巣、子宮、肺、乳腺などの臓器の悪性腫瘍の肝転移が多く見られる。 The system of the present invention realizes classification of CT images based on differences in CT images of different types of tumors or subtypes, and further realizes diagnostic classification of tumor types/stages. The system of the present invention can be used to classify two or more types of tumors, and is specifically determined by how the system is constructed. For example, liver cancer is generally divided into two types: primary and secondary. Primary liver malignant tumors arise from the epithelial or mesenchymal tissues of the liver, and secondary (also called metastatic) liver cancers are malignant tumors originating from multiple organs throughout the body that invade the liver. refers to Generally, liver metastasis from malignant tumors in organs such as the stomach, biliary tract, pancreas, colorectum, ovary, uterus, lungs, and mammary glands is frequently observed.

図３に示すのは、本発明の時空間的アテンションモデルに基づく肝がん多時相ＣＴ画像分類システムの構築方法のフローチャートであり、当該方法は、具体的には以下を含む。
（１）サンプルを収集してデータセットを構築し、前記データセットの各サンプルは１人の患者のｓ個の時相の肝がんＣＴ画像を含み、
肝がんＣＴ画像に対して肝細胞がん及び肝内胆管細胞がんの二項分類を例にとると、肝細胞がん（ｈｅｐａｔｏｃｅｌｌｕｌａｒｃａｒｃｉｎｏｍａ、ＨＣＣ）は死亡率の高い原発性肝がんであり、肝内胆管細胞がん（ｉｎｔｒａｈｅｐａｔｉｃｃｈｏｌａｎｇｉｏｃａｒｃｉｎｏｍａ、ＩＣＣ）とは、二次胆管及びその枝の上皮から発生する腺がんを指し、肝細胞がんに次ぐ発生率を有する原発性肝臓悪性腫瘍である。合計４００件のサンプルが収集されており、そのうちＨＣＣサンプルは２００件、ＩＣＣサンプルは２００件があり、全てのサンプルのラベル付けは専門の医用画像科医師によって実施され、具体的には次のとおりである。
（１．１）最初に、病院から肝がん患者の単純スキャン相肝臓ＣＴ画像及び造影ＣＴ画像（動脈相、門脈相、遅発相の肝臓ＣＴ画像）を収集し、データスクリーニングにより完全な研究情報を有する患者データを選択し、データマスキング技術により患者個人のプライベートな情報を除去し、患者のプライバシーを保護し、データの機密性を向上させるために役立ち、最終的にＨＣＣ及びＩＣＣ患者から合計４００件の肝臓ＣＴ画像及び対応する肝機能検査報告を収集し、そのうちＨＣＣ患者は２００件、ＩＣＣ患者は２００件があり、属するクラスに従ってラベル付けし、ＨＣＣ患者は１とラベル付けし、ＩＣＣ患者は０とラベル付けする。 FIG. 3 is a flowchart of a method for constructing a liver cancer multi-temporal CT image classification system based on the spatiotemporal attention model of the present invention, and specifically includes the following steps.
(1) collecting samples to construct a dataset, each sample of the dataset including liver cancer CT images of s time phases of one patient;
Taking the binary classification of hepatocellular carcinoma and intrahepatic cholangiocellular carcinoma for liver cancer CT images as an example, hepatocellular carcinoma (HCC) is a primary liver cancer with a high mortality rate. Intrahepatic cholangiocarcinoma (ICC) refers to adenocarcinoma that arises from the epithelium of secondary bile ducts and their branches, and is a primary liver malignant tumor with an incidence rate second only to hepatocellular carcinoma. . A total of 400 samples were collected, including 200 HCC samples and 200 ICC samples, and the labeling of all samples was performed by professional medical imaging physicians, specifically as follows: It is.
(1.1) First, we collect simple scan phase liver CT images and contrast-enhanced CT images (arterial phase, portal venous phase, and delayed phase liver CT images) of liver cancer patients from hospitals, and perform complete data screening. Select patient data with research information, use data masking technology to remove patient's personal private information, help protect patient privacy and improve data confidentiality, and ultimately from HCC and ICC patients. A total of 400 liver CT images and corresponding liver function test reports were collected, of which 200 were for HCC patients and 200 were for ICC patients, and they were labeled according to the class they belong to, HCC patients were labeled as 1, and ICC patients were labeled as 1. Label patient 0.

（１．２）専門の医用画像科医師は４つの時相の肝臓ＣＴ画像中の病巣部分をラベル付けし、分割して、データセットを構築する。 (1.2) A specialized medical imaging doctor labels and divides the focal areas in the liver CT images of four time phases to construct a data set.

さらに、患者の個人差により検査科医師は異なる患者に異なるスキャン回数を設定する可能性があるため、オリジナルのＣＴ画像中のスライス数は異なり、研究の便宜上、各時相のＣＴ画像のサイズ及び数は一律に定義される。本実施例では、各サンプルの肝臓ＣＴ画像のサイズを６４×１２８×１２８×４とし、ここで６４は各時相の肝臓ＣＴ画像の層数を表し、１２８、１２８は各肝臓ＣＴ画像の長さ及び幅を表し、４は４つの時相を表し、
さらに、データを強調し、データが不十分な場合に、データからより多くの価値を生み出し、入力はデータの前処理を完了された４時相肝臓ＣＴ画像であり、ランダムな回転、ランダムな反転などの操作を行って、データセットのサンプルを補足する。 In addition, due to individual differences in patients, the examining physician may set different number of scans for different patients, so the number of slices in the original CT image will be different, and for the convenience of research, the size and size of CT images in each time phase and Numbers are uniformly defined. In this example, the size of the liver CT image of each sample is 64 x 128 x 128 x 4, where 64 represents the number of layers of the liver CT image of each time phase, and 128 and 128 are the length of each liver CT image. 4 represents the four time phases,
In addition, to enhance the data and create more value from the data when the data is insufficient, the input is a 4-phase liver CT image that has completed data preprocessing, random rotation, random inversion. Supplement the dataset with samples.

（２）データ取得ユニットと、第１埋め込み層ネットワークユニットと、空間的アテンションユニットと、第２埋め込み層ネットワークユニットと、時間的アテンションユニットと、分類層ユニットとを含む上記時空間的アテンションモデルに基づく多時相ＣＴ画像分類システムを構築し、データセット中の各サンプルをシステムの入力として、システムの出力する分類結果と分類ラベルとの誤差を最小にすることを目標としてトレーニングし、バイナリクロスエントロピー損失関数でシステムの出力する分類結果と分類ラベルとの誤差の計算を例にとると、次のとおりに表す。
Ｌｏｓｓ＝－ｙｌｏｇ（Ｐｒｏｂ）－（１－ｙ）ｌｏｇ（１－Ｐｒｏｂ）（１０）
ただし、ｙ∈｛０，１｝であり、０はＩＣＣ患者を表し、１はＨＣＣ患者を表す。 (2) Based on the above spatiotemporal attention model including a data acquisition unit, a first embedded layer network unit, a spatial attention unit, a second embedded layer network unit, a temporal attention unit, and a classification layer unit. A multi-temporal CT image classification system is constructed, each sample in the dataset is used as input to the system, and trained with the goal of minimizing the error between the classification result output by the system and the classification label, and binary cross-entropy loss is applied. Taking as an example the calculation of the error between the classification result output by the system and the classification label using a function, it is expressed as follows.
Loss=-ylog(Prob)-(1-y)log(1-Prob) (10)
where y∈{0,1}, where 0 represents an ICC patient and 1 represents an HCC patient.

確率的勾配降下アルゴリズムを用いてシステム全体を最適化し、目標は最小の誤差損失を見つけて、最終的に最適な分類モデルを得ることである。本実施例では、Ａｄａｍ確率的最適化アルゴリズムを用いて勾配の逆伝播と最適化を行い、学習率は０．０００１に設定し、最終的に肝細胞がん及び肝内胆管細胞がんの二項分類を実現する時空間的アテンションモデルに基づく多時相ＣＴ画像分類システムを得る。 A stochastic gradient descent algorithm is used to optimize the entire system, and the goal is to find the minimum error loss and finally obtain the optimal classification model. In this example, the Adam stochastic optimization algorithm is used to perform gradient backpropagation and optimization, the learning rate is set to 0.0001, and the final result is two-dimensional hepatocellular carcinoma and intrahepatic cholangiocellular carcinoma. A multitemporal CT image classification system based on a spatiotemporal attention model that realizes term classification is obtained.

本発明の方法は多時相のＣＴ画像に基づいて診断する必要のある様々な疾患に普遍性を有し、異なる時相の病巣の特徴をより効果的に利用して、時間的なつながりを強化し、従来の畳み込みニューラルネットワークをメインモデルとした設計を捨て、アテンションメカニズムにより、より多くの計算を重要な領域に投入し、注目すべき目標のより多くの詳細情報を得ることができ、これによって他の無用な情報を抑制し、計算の冗長性と遅延を低減し、ＣＴ画像の診断をより短時間で実現しやすく、診断の精度を高くし、診断の効果をより安定的にする。 The method of the present invention is universal to various diseases that need to be diagnosed based on multi-phase CT images, and more effectively utilizes the characteristics of lesions in different temporal phases to establish temporal connections. By strengthening and abandoning the traditional convolutional neural network design as the main model, the attention mechanism allows you to put more computations into important areas and get more details of the target you want to focus on. This suppresses other unnecessary information, reduces calculation redundancy and delay, makes it easier to diagnose CT images in a shorter time, improves the accuracy of diagnosis, and makes the effect of diagnosis more stable.

なお、本発明に記載されている実施例は、本発明の実施形態に対する限定ではなく、本発明を明瞭に説明するために挙げた例に過ぎない。当業者は、上記の説明を踏まえて他の様々な形式の修正又は変更を行うことができる。ここでは全ての実施形態を挙げる必要はなく、そうすることもできない。本発明の趣旨と原則内の修正、同等な置換、改良など、いずれも本発明の特許請求の範囲に含まれる。本発明で主張する保護範囲は、特許請求の範囲の内容に従うものとし、明細書の発明を実施するための形態などの記載は特許請求の範囲の内容を解釈するために用いてもよい。そこから生まれる自明な変化又は変更は依然として本発明の保護範囲に含まれる。 Note that the examples described in the present invention are not limitations on the embodiments of the present invention, but are merely examples given to clearly explain the present invention. Various other types of modifications or changes may occur to those skilled in the art in light of the above description. It is not necessary or possible to list all embodiments here. All modifications, equivalent substitutions, improvements, etc. within the spirit and principles of the present invention are included within the scope of the claims of the present invention. The scope of protection claimed by the present invention shall be in accordance with the content of the claims, and the description of the detailed description and the like in the specification may be used to interpret the content of the claims. Obvious changes or modifications arising therefrom still fall within the protection scope of the present invention.

Claims

A multi-temporal CT image classification system based on a spatiotemporal attention model that classifies tumors by obtaining spatial and temporal features from CT images, the system comprising:
A data acquisition unit for acquiring CT images of s temporal phases of a patient to be classified, where s is 2 or more, and the CT images of s temporal phases are a simple scan phase CT image and a contrast-enhanced CT image. a data acquisition unit including at least two of an arterial phase CT image in CT, a portal venous phase CT image in contrast-enhanced CT, and a delayed phase CT image in contrast-enhanced CT;
A first embedded layer network unit including s first embedded layer networks, wherein each of the s first embedded layer networks divides a CT image of each time phase into a plurality of image blocks, and divides each image block into a plurality of image blocks. Each block is expanded by a convolution layer to obtain an image block vector after linearization, and after combining all the image block vectors and class token vectors, they are added to the position vector of the same dimension to obtain a CT image of the corresponding time phase. Obtain the embedding vector, where the size of the CT image for each time phase is
, H and W are the length and width of one CT image, C is the number of layers of the CT image, the size of the image block after division is P×P×C, and P is the width after division. _is the length and width _of the image _block , _and ^the _embedding vector is _X ₀ = ^[ ^X _class ^; ^(1+N)×D , where X _class represents the class token vector, X _pos represents the position vector, X _p represents the image block vector after linearization, R represents the set of all real numbers, and N represents the division a first embedding layer network unit representing the number of subsequent image blocks, N=HW/P ² , and D is the number of convolution kernels of the convolution layer;
a spatial attention unit comprising s spatial attention networks, each of the s spatial attention networks comprising: a first multi-head attention network MSA in the L1 layer; a first multi-layer perceptron in the L1 layer; a first normalization layer of layers, L1 is any positive integer, a first multi-head attention network MSA of the L1 layer and a first multilayer perceptron of the L1 layer are interleaved in sequence;
The first multi-head attention network MSA includes a plurality of self-attention modules SA and one splicing layer, and the self-attention module SA receives a normalized input vector from a query matrix Q _1i , a keyword matrix K _1i and a value. It is used to transform the input vector into three different matrices of matrix V _1i , query vector q, keyword vector k and value vector v, among which query vector q is matched with other vectors. The keyword vector k is used for matching, the value vector v represents the information to be extracted, the three types of vectors q, k, and v are obtained by multiplying the input vector by the learnable matrix, and the embedding vector Considering that is multidimensional, expressed from a global perspective,
Q _1i =XW _1i ^Q , K _1i =XW _1i ^K , V _1i =XW _1i ^V (2),
where W _1i ^Q , W _1i ^K , W _1i ^V represent the i-th trainable weight matrix, X represents the input vector,
Generate an attention function between each vector in the input vectors based on three different matrices: a query matrix Q _1i , a keyword matrix K _1i , and a value matrix V _1i , calculate a dot product of the query vector q and each keyword vector k, The dot product is divided by the square root of the dimension of the keyword vector k and multiplied by the value vector v through a softmax layer to find the sum.The softmax function is used to map the input value to the interval (0,1), and the The formula (3) of the attention function is
(3),
where dk represents the dimension of each keyword vector k in the keyword matrix K _1i , softmax() is the softmax function, head _1i represents the output of the i-th self-attention module SA,
The splicing layer is used to splice the attention functions output from each self-attention module SA to obtain the final spatial attention function, and the splicing layer is expressed by the following equation (4),
MSA()=Concat(head ₁₁ ,..., head _1i ,...)W ₁ ^O (4)
where MSA() is the output of the spatial attention network, W ₁ ^O is the trainable weight matrix,
(5)
Here, LN represents the normalization method, x′ _l represents the input vector of the first multi-layer perceptron, MSA() represents the output of the first multi-head attention network, and x _l represents the first multi-head attention network of the l-th layer. represents the input vector of the head attention network,
The sum of the final spatial attention function and the input vector is the input vector corresponding to the first multilayer perceptron of the next layer,
(6)
Here, MLP() represents the output of the first multilayer perceptron, x'l _-1 represents the input vector of the first multilayer perceptron of the l-1th layer,
The first multi-layer perceptron encodes the normalized input vector and adds it to the input vector, and uses the addition result as an input vector corresponding to the first multi-head attention network MSA of the next layer;
The input vector of the first multi-head attention network MSA in the first layer is an embedded vector, and the first normalization layer is a vector obtained by adding the output vector of the first multi-layer perceptron in the final layer and its input vector. a spatial attention unit that normalizes a first dimension vector of and obtains spatial features of a CT image of a corresponding time phase;
a second embedding layer network unit including one second embedding layer network, the second embedding layer network unit forming a class token after combining spatial features of s corresponding temporal CT images output from the s spatial attention networks; Together with the vector, the embedding layer vector x is obtained using the following formula, x=[X _class ; _xspace ], _xspace ∈R ^s×D , X _class ∈R ^1×D (7)
where x _space represents the combined spatial feature, a second embedded layer network unit;
A temporal attention unit comprising one temporal attention network, the temporal attention network comprising a second multi-head attention network MSA in the L2 layer, a second multi-layer perceptron in the L2 layer, and a second normal in one layer. a second multi-head attention network MSA of the L2 layer and a second multilayer perceptron of the L2 layer are interleaved in sequence, L2 is an arbitrary positive integer;
The second multi-head attention network MSA includes a plurality of self-attention modules SA and one splicing layer, and the self-attention module SA has the following formula (8):
Q _2i = XW _2i ^Q , K _2i = XW _2i ^K , V _2i = XW _2i ^V (8) Spread the normalized input vector into three different matrices: query matrix Q _2j , keyword matrix K _2j and value matrix V _2j Based on three different matrices: query matrix Q _2j , keyword matrix K _2j and value matrix V _2j , according to the following formula (9),
head _2i =Attention _2i (Q _2i ,K _2i ,V _2i )=Softmax(Q _2i K _2i ^T /√d _k )V _2i (9) Generate the attention function between each vector in the input vectors, where j is the time is the index of the self-attention module SA in the target attention unit, and the splicing layer is calculated according to the following formula (10).
MSA()=Concat( _head21 ,..., _head2i ,...) _W2O ⁽ 10) The attention function output from each self-attention module SA is spliced to obtain the final temporal attention function, and the following formula ( According to 11),
x′ _l =MSA(LN(x _l ))+x _l , l=1...L2 (11)
Here, LN represents the normalization method, x′ _l represents the input vector of the second multi-layer perceptron, MSA() represents the output of the second multi-head attention network, and x _l represents the second multi- head attention network of the l-th layer. represents the input vector of the head attention network,
The sum of the final temporal attention function and the input vector is set as the input vector corresponding to the second multilayer perceptron of the next layer, and according to the following equation (12),
x _l =M LP (LN(x' _l-1 )) + x' _l-1 , l=2... L2 (12)
Here, MLP() represents the output of the second multilayer perceptron, x'l _-1 represents the input vector of the second multilayer perceptron of the l-1th layer,
The second multi-layer perceptron encodes the normalized input vector, adds it to the input vector, uses the addition result as an input vector corresponding to the second multi-head attention network MSA of the next layer, and uses the second multi-head attention network MSA of the first layer as the input vector. The input vector of the multi-head attention network MSA is the embedding layer vector output from the second embedding layer network unit, and the second normalization layer adds the vector output from the second multilayer perceptron in the final layer and the input vector. a temporal attention unit that normalizes the first dimension vector of the obtained vector to obtain a vector x _time having spatial characteristics and temporal characteristics;
A classification layer unit including a classification layer W, which is used to obtain a classification result Prob according to the following formula (13) based on a vector having spatial characteristics and temporal characteristics,
Prob=W(x _time ^T ) (13)
x _time ^T is a vector having the spatial features and temporal features, Prob∈R ^C represents the probability distribution of classes, R represents the set of all real numbers, C represents the total number of classes, a classification layer unit A multi-temporal CT image classification system based on a spatio-temporal attention model, comprising:

Collecting samples to construct a dataset, each sample of the dataset including s temporal phases of CT images of one patient, s being 2 or more, and s temporal phases; The CT image includes at least two of a simple scan phase CT image, an arterial phase CT image in contrast-enhanced CT, a portal venous phase CT image in contrast-enhanced CT, and a delayed phase CT image in contrast-enhanced CT;
A multi-temporal CT image classification system based on the spatio-temporal attention model according to claim 1 is constructed, and each sample in the dataset is input to the system to minimize the error between the classification result output by the system and the classification label. A multi-temporal CT image classification system based on a spatio-temporal attention model, comprising the step of training with the goal of obtaining a multi-temporal CT image classification system based on the spatio-temporal attention model. How to build.