JP7268530B2

JP7268530B2 - Mass spectrometry data processing method, mass spectrometry data processing system, and mass spectrometry data processing program

Info

Publication number: JP7268530B2
Application number: JP2019145984A
Authority: JP
Inventors: 達樹大久保; 賢志山田
Original assignee: Shimadzu Corp
Current assignee: Shimadzu Corp
Priority date: 2019-08-08
Filing date: 2019-08-08
Publication date: 2023-05-08
Anticipated expiration: 2039-08-08
Also published as: JP2021025953A

Description

本発明は、質量分析データ処理方法、質量分析データ処理システム、及び質量分析データ処理プログラムに関する。 The present invention relates to a mass spectrometry data processing method, a mass spectrometry data processing system, and a mass spectrometry data processing program.

質量分析装置のイオン化法の一つとしてマトリックス支援レーザ脱離イオン化（Matrix Assisted Laser Desorption/Ionization；MALDI）法がよく知られている。MALDI法は、レーザ光を吸収しにくい試料、又はタンパク質等のレーザ光で損傷を受けやすい試料を分析するために、レーザ光を吸収し易く且つイオン化し易い物質をマトリクスとして試料に予め混合しておき、これにレーザ光を照射することで試料をイオン化する手法である。特にMALDIイオン源を用いた質量分析装置（以下、MALDI-MSとよぶ）は、分子量の大きな高分子化合物をあまり開裂させることなく分析することが可能であり、しかも微量分析にも好適であることから、生命科学などの分野で広範に利用されている。 Matrix Assisted Laser Desorption/Ionization (MALDI) is well known as one of ionization methods for mass spectrometers. In the MALDI method, in order to analyze samples that do not easily absorb laser light or that are easily damaged by laser light, such as proteins, substances that easily absorb and ionize laser light are premixed with the sample as a matrix. In this method, a sample is ionized by irradiating it with a laser beam. In particular, a mass spectrometer using a MALDI ion source (hereafter referred to as MALDI-MS) is capable of analyzing high-molecular-weight compounds with little cleavage, and is also suitable for microanalysis. Since then, it has been widely used in fields such as life science.

また、近年ではMALDI-MSによって得られたマススペクトルに機械学習を適用することによって未知試料の判別を行う試みが進められている（例えば、特許文献１を参照）。機械学習は、多種多様である大量のデータの中から規則性を見出し、それを利用してデータの予測、判別、又は回帰を行うために有用な手法の一つであり、大別して教師あり学習と教師なし学習がある。例えば、微生物をMALDI-MSで分析した結果に基づいて当該微生物の種類（例えば、種、亜種、株、又はタイプなど）を判別しようとする場合、予め種々の微生物について多数の質量分析データを集めておき、それらのデータを学習データ（教師データ又は訓練データともいう）とする教師あり学習を行って、未知微生物の種類を判別するための判別モデルを構築する。 In recent years, attempts have been made to discriminate unknown samples by applying machine learning to mass spectra obtained by MALDI-MS (see, for example, Patent Document 1). Machine learning is one of the useful methods for discovering regularity from a large amount of diverse data and using it for prediction, discrimination, or regression of data. and unsupervised learning. For example, when trying to discriminate the type of microorganism (e.g., species, subspecies, strain, or type) based on the results of analyzing microorganisms by MALDI-MS, a large amount of mass spectrometry data for various microorganisms is prepared in advance. Collected data are used as learning data (also referred to as teacher data or training data) to perform supervised learning to construct a discriminant model for discriminating the types of unknown microorganisms.

特開2018-155522号公報JP 2018-155522 A 特開2010-205460号公報Japanese Patent Application Laid-Open No. 2010-205460

しかしながら、高精度な判別モデルを構築するためには、多数の学習データを収集する必要がある。そのためには、多数回の質量分析を行う必要があるため、多くの労力とコストが掛かるという問題があった。 However, in order to construct a highly accurate discriminant model, it is necessary to collect a large amount of learning data. For that purpose, it is necessary to perform mass spectrometry many times, and there is a problem that much labor and cost are required.

本発明は上記の点に鑑みてなされたものであり、その目的とするところは、高精度な判別モデルを構築するために必要な多量の学習データを、少ない質量分析回数で得ることのできる質量分析データ処理方法、質量分析データ処理システム、及び質量分析データ処理プログラムを提供することにある。 The present invention has been made in view of the above points, and its object is to obtain a large amount of learning data necessary for constructing a highly accurate discriminant model with a small number of mass spectrometry. An object of the present invention is to provide an analysis data processing method, a mass spectrometry data processing system, and a mass spectrometry data processing program.

上記課題を解決するために成された本発明に係る質量分析データ処理方法は、
レーザイオン化による試料のイオン化を行う質量分析装置において既知試料に対する複数回のレーザ光照射を行い、該複数回のレーザ光照射の各々において前記既知試料から発生するイオンのm/zと強度との関係を示すスペクトルである複数のプロファイルデータを取得し、
前記複数のプロファイルデータを、各グループに一つ以上のプロファイルデータが含まれるように複数のグループに振り分け、
前記複数のグループの各々について、該グループに含まれる前記一つ以上のプロファイルデータに基づいて前記既知試料に由来するピークのm/zと該ピークの強度とを記載したピークリストを生成し、
前記ピークリスト及び前記既知試料の種類に関する情報を学習データとして、未知試料を判別するための判別モデルを生成するものである。 The mass spectrometry data processing method according to the present invention, which has been made to solve the above problems,
A relationship between m/z and intensity of ions generated from the known sample in each of the plurality of times of laser light irradiation in a mass spectrometer that ionizes the sample by laser ionization. Acquire multiple profile data that are spectra showing
sorting the plurality of profile data into a plurality of groups such that each group contains one or more profile data;
For each of the plurality of groups, generating a peak list that describes the m/z of the peaks derived from the known sample and the intensities of the peaks based on the one or more profile data included in the group;
A discriminant model for discriminating an unknown sample is generated by using the peak list and information on the type of the known sample as learning data.

上記課題を解決するために成された本発明に係る質量分析データ処理システムは、
レーザイオン化による試料のイオン化を行う質量分析装置において既知試料に対する複数回のレーザ光照射を行って取得された、該複数回のレーザ光照射の各々において前記既知試料から発生するイオンのm/zと強度との関係を示すスペクトルである複数のプロファイルデータを取得するプロファイルデータ取得部と、
前記複数のプロファイルデータを、各グループに一つ以上のプロファイルデータが含まれるように複数のグループに振り分けるグループ化部と、
前記複数のグループの各々について、該グループに含まれる前記一つ以上のプロファイルデータに基づいて前記既知試料に由来するピークのm/zと該ピークの強度とを記載したピークリストを生成するピークリスト生成部と、
前記ピークリスト及び前記既知試料の種類に関する情報を学習データとして、未知試料を判別するための判別モデルを生成する判別モデル生成部と、
を備えるものである。 The mass spectrometry data processing system according to the present invention, which was made to solve the above problems,
m/z of ions generated from the known sample at each of the plurality of laser light irradiations, obtained by irradiating the known sample with the laser light multiple times in a mass spectrometer that ionizes the sample by laser ionization; and a profile data acquisition unit that acquires a plurality of profile data, which are spectra showing a relationship with intensity;
a grouping unit that sorts the plurality of profile data into a plurality of groups such that each group includes one or more profile data;
A peak list for generating a peak list describing the m/z of the peak derived from the known sample and the intensity of the peak for each of the plurality of groups based on the one or more profile data included in the group. a generator;
a discriminant model generation unit that generates a discriminant model for discriminating an unknown sample using the peak list and information about the type of the known sample as learning data;
is provided.

上記課題を解決するために成された本発明に係る質量分析データ処理プログラムは、コンピュータを、前記質量分析データ処理システムの各部として機能させるものである。 A mass spectrometry data processing program according to the present invention, which has been made to solve the above problems, causes a computer to function as each part of the mass spectrometry data processing system.

上記本発明に係る質量分析データ処理方法、質量分析データ処理システム、及び質量分析データ処理プログラムでは、一つの試料に対する多数回のレーザ光照射に伴って得られたプロファイルデータを複数のグループに分割し、グループ毎に一つのピークリストを生成する。これにより、一つの試料に対する質量分析で得られるピークリストの数を増やすことができる。その結果、高精度な判別モデルを構築するために必要な多量の学習データを、少ない質量分析回数で得ることが可能となる。 In the mass spectrometry data processing method, the mass spectrometry data processing system, and the mass spectrometry data processing program according to the present invention, profile data obtained by irradiating one sample with laser light many times is divided into a plurality of groups. , to generate one peak list for each group. This makes it possible to increase the number of peak lists obtained by mass spectrometry for one sample. As a result, it is possible to obtain a large amount of learning data necessary for constructing a highly accurate discriminant model with a small number of mass spectrometry runs.

本発明の一実施形態に係る質量分析データ処理システムの要部構成を示すブロック図。1 is a block diagram showing the main configuration of a mass spectrometry data processing system according to one embodiment of the present invention; FIG. 同実施形態における質量分析データの処理手順を示すフローチャート。4 is a flowchart showing a procedure for processing mass spectrometry data in the same embodiment;

以下、本発明を実施するための形態について図面を参照しつつ説明する。図１は、本発明の一実施形態に係る質量分析データ処理システム１０の要部構成を示すブロック図である。 EMBODIMENT OF THE INVENTION Hereinafter, it demonstrates, referring drawings for the form for implementing this invention. FIG. 1 is a block diagram showing the essential configuration of a mass spectrometry data processing system 10 according to one embodiment of the present invention.

このシステム１０は、図示しないMALDI-MSによる試料の分析によって得られた質量分析データを処理するものであって、学習データ生成部２０と、判別モデル生成部３０と、判別部４０と、データ記憶部５０と、マウス等のポインティングデバイス及びキーボード等を含む入力部６０と、液晶ディスプレイ等の表示装置を含む表示部７０とを備えている。 This system 10 processes mass spectrometry data obtained by analyzing a sample by MALDI-MS (not shown), and includes a learning data generation unit 20, a discrimination model generation unit 30, a discrimination unit 40, and data storage. 50, an input unit 60 including a pointing device such as a mouse and a keyboard, and a display unit 70 including a display device such as a liquid crystal display.

学習データ生成部２０は、既知試料（例えば属する株が既知である微生物）をMALDI-MSで分析して得られた質量分析データに所定の処理を施すことによって、機械学習に用いるための学習データを生成するものである。学習データ生成部２０は、プロファイルデータ取得部２１、グループ化部２２、及びピークリスト生成部２３を含んでいる。 The learning data generation unit 20 generates learning data for use in machine learning by performing predetermined processing on mass spectrometry data obtained by analyzing a known sample (for example, a microorganism whose strain is known) by MALDI-MS. is generated. The learning data generation unit 20 includes a profile data acquisition unit 21 , a grouping unit 22 and a peak list generation unit 23 .

判別モデル生成部３０は、学習データ生成部２０で生成された複数の学習データを用いて、未知試料（例えば属する株が不明である微生物）を判別するための判別モデルを生成するものである。 The discriminant model generation unit 30 uses a plurality of learning data generated by the learning data generation unit 20 to generate a discriminant model for discriminating an unknown sample (for example, a microorganism whose strain is unknown).

判別部４０は、未知試料をMALDI-MSで分析して得られた質量分析データを前記判別モデルに適用することによって、該未知試料の種類（例えば前記微生物が属する株）を判別するものである。判別部４０は、未知サンプルデータ取得部４１と、判別実行部４２とを備えている。 The discrimination unit 40 discriminates the type of the unknown sample (for example, the strain to which the microorganism belongs) by applying mass spectrometry data obtained by analyzing the unknown sample by MALDI-MS to the discrimination model. . The determination unit 40 includes an unknown sample data acquisition unit 41 and a determination execution unit 42 .

学習データ生成部２０、判別モデル生成部３０、及び判別部４０の実体は、コンピュータ（パーソナルコンピュータ又はそれよりも高性能なコンピュータ）であり、該コンピュータに予めインストールされた専用のデータ処理ソフトウェアを該コンピュータ上で動作させることにより、前記各部の機能が実現される。データ記憶部５０は、前記コンピュータに内蔵された又は前記コンピュータに直接接続された記憶装置によるものとするほか、例えば、前記コンピュータからインターネット等を介してアクセス可能である別のコンピュータシステム上に存在する、つまりはクラウドコンピューティングにおける記憶装置などを利用してもよい。 The entity of the learning data generation unit 20, the discriminant model generation unit 30, and the discrimination unit 40 is a computer (personal computer or a computer with higher performance than that), and dedicated data processing software pre-installed in the computer is used. The functions of the respective units are realized by operating them on a computer. The data storage unit 50 may be a storage device built into the computer or directly connected to the computer, or may exist on another computer system accessible from the computer via the Internet or the like, for example. , that is, a storage device or the like in cloud computing may be used.

また、本実施形態に係るシステム１０は、学習データ生成部２０、判別モデル生成部３０、及び判別部４０の機能を複数のコンピュータに分担させるものとすることもできる。具体的には、例えば、学習データ生成部２０及び判別モデル生成部３０の機能を一台のコンピュータに割り当て、判別部４０の機能をそれとは別のコンピュータに割り当てることが考えられる。 In addition, the system 10 according to the present embodiment can also share the functions of the learning data generating unit 20, the discriminant model generating unit 30, and the discriminating unit 40 among a plurality of computers. Specifically, for example, it is conceivable to assign the functions of the learning data generator 20 and the discriminant model generator 30 to one computer, and assign the function of the discriminator 40 to another computer.

続いて、本実施形態に係るシステム１０における処理の特徴について説明する。 Next, features of processing in the system 10 according to this embodiment will be described.

一般的に、MALDI-MSでは、一つの試料に対して、レーザ光照射によるイオンの生成→生成したイオンの分離及び検出、というプロセスが多数回（例えば120回）繰り返し実行されて、多数のプロファイルデータが生成される（特許文献２など参照）。プロファイルデータとは、質量分析装置の生データ（Raw Data）に相当するデータ形態であり、質量分析装置に設けられたイオン検出器から連続的に送出される検出信号の波形を、横軸を時間（又はm/z）とし、縦軸をイオン強度として表したものである。 Generally, in MALDI-MS, the process of generating ions by laser light irradiation → separating and detecting the generated ions is repeatedly executed many times (for example, 120 times) for one sample, and many profiles are obtained. Data is generated (see Patent Document 2, etc.). Profile data is a data format equivalent to raw data of a mass spectrometer. The waveform of the detection signal continuously sent from the ion detector provided in the mass spectrometer is plotted on the horizontal axis with time. (or m/z), and the vertical axis is the ion intensity.

従来のデータ処理方法では、上記のような一つの試料に対する多数回のレーザ光照射に伴って得られたプロファイルデータをすべて積算した上で、その後のデータ処理の便のために、該積算後のプロファイルデータ（積算プロファイルデータとよぶ）の波形に含まれるピークを検出し（すなわちピーク検出処理を行い）、検出された各ピークの重心位置（又は中心位置）を表すm/z値と、該ピークの面積値とを示したリスト（ピークリスト）に変換していた。すなわち、従来のデータ処理方法では、一つの試料に対する一回の質量分析の結果として、一つのピークリストが生成されていた。 In the conventional data processing method, all the profile data obtained by irradiating a single sample with multiple laser beams as described above are integrated, and for the convenience of subsequent data processing, after the integration, Peaks included in the waveform of the profile data (integrated profile data) are detected (that is, peak detection processing is performed), and the m/z value representing the centroid position (or center position) of each detected peak and the peak were converted to a list (peak list) showing the area values of . That is, in the conventional data processing method, one peak list is generated as a result of one-time mass spectrometric analysis for one sample.

これに対し、本実施形態に係る質量分析データ処理方法は、上記のような一つの試料に対する多数回のレーザ光照射に伴って得られたプロファイルデータを複数のグループに分割し、グループ毎に一つのピークリストを生成する。すなわち、一つの試料に対する一回の質量分析の結果として、複数のピークリストを生成する。これにより、質量分析の実行回数を増やすことなく、より多くの学習データを得ることが可能となる。 On the other hand, the mass spectrometry data processing method according to the present embodiment divides the profile data obtained by irradiating one sample with laser light many times as described above into a plurality of groups, generate a single peak list. That is, multiple peak lists are generated as a result of one-time mass spectrometric analysis for one sample. This makes it possible to obtain more learning data without increasing the number of mass spectrometry runs.

以下、このような処理の詳細について、図２のフローチャートを参照しつつ説明する。なお、ここでは予め複数の既知試料（例えば株が既知である微生物）についてMALDI-MSによる質量分析が行われ、前記複数の既知試料の各々についての質量分析結果として、それぞれＮ個（Ｎは２以上の整数）のプロファイルデータが、該既知試料の種類の情報（例えば、既知微生物の株の情報）と関連付けてデータ記憶部５０に記憶されているものとする。以下、前記既知試料の種類の情報を「正解ラベル」とよぶ。 The details of such processing will be described below with reference to the flowchart of FIG. Here, mass spectrometry by MALDI-MS is performed on a plurality of known samples (for example, microorganisms whose strains are known) in advance, and N (N is 2 (integers above) are stored in the data storage unit 50 in association with information on the type of known sample (for example, information on strains of known microorganisms). Hereinafter, the information on the type of known sample will be referred to as a "correct label".

まず、ユーザが入力部６０で所定の操作を行って、データ記憶部５０に記憶されている前記複数の既知試料の質量分析結果を指定すると共に、これらに基づく学習データの生成を指示すると、学習データ生成部２０によって学習データの生成が実行される。具体的には、まず、学習データ生成部２０のプロファイルデータ取得部２１が、ユーザによって指定された複数の既知試料の質量分析結果のうち、一つの既知試料に関する質量分析結果、すなわち該試料に関するＮ個のプロファイルデータをデータ記憶部５０から取得する（ステップＳ１１）。 First, when the user performs a predetermined operation on the input unit 60 to specify the mass spectrometry results of the plurality of known samples stored in the data storage unit 50 and instructs the generation of learning data based on these, learning The data generation unit 20 generates learning data. Specifically, first, the profile data acquisition unit 21 of the learning data generation unit 20 obtains the mass spectrometry result regarding one known sample among the mass spectrometry results of a plurality of known samples designated by the user, that is, the N profile data are obtained from the data storage unit 50 (step S11).

次にグループ化部２２が、前記Ｎ個のプロファイルデータを、所定の基準にしたがって（例えばプロファイルデータの生成順に）、予め定められたＭ個（ＭはＮ以下の整数）のグループに割り振っていく（ステップＳ１２）。このとき、前記Ｍ個のグループには、それぞれ少なくとも一つのプロファイルデータが含まれるようにする。また、各グループに割り振られるプロファイルデータの数はなるべく均等になるようにする。なお、グループの個数Ｍは、予めシステム１０側に記憶されている値としてもよく、ユーザが自由に設定できるようにしてもよい。また、プロファイルデータの個数Ｎ、又は必要とする判別精度等に基づいてシステム１０側で自動的に決定されるようにしてもよい。 Next, the grouping unit 22 allocates the N pieces of profile data to predetermined M pieces (M is an integer equal to or less than N) according to a predetermined standard (for example, in order of generation of the profile data). (Step S12). At this time, each of the M groups includes at least one piece of profile data. Also, the number of pieces of profile data assigned to each group should be as uniform as possible. Note that the number M of groups may be a value stored in the system 10 in advance, or may be set freely by the user. Alternatively, it may be automatically determined by the system 10 based on the number N of profile data or the required discrimination accuracy .

なお、MALDIによる試料のイオン化では、試料上の同じ位置にレーザ光を繰り返し照射し続けると次第にイオンが発生しなくなるため、通常は、試料上の測定領域内で互いに近接した複数の異なる位置にレーザ光が照射させるように試料又はレーザ光を移動させており、プロファイルデータは、その異なる位置（測定点）毎に取得される。このとき、前記測定領域内における試料成分の濃淡によって、各測定点から発生するイオンの量にばらつきが生じる。そこで、前記ステップＳ１２では、前記Ｎ個のプロファイルデータをランダムに前記Ｍ個のグループに割り振るようにすることが望ましい。これにより、測定領域内における試料成分の濃淡の影響を受けることなく適切な学習データを生成することができる。 In ionization of a sample by MALDI, if the same position on the sample is repeatedly irradiated with laser light, ions will gradually cease to be generated. The sample or the laser beam is moved so as to irradiate the light, and profile data is acquired for each different position (measurement point). At this time, the amount of ions generated from each measurement point varies depending on the density of the sample components in the measurement area. Therefore, in step S12, it is desirable to randomly allocate the N pieces of profile data to the M groups. As a result, appropriate learning data can be generated without being affected by the density of the sample components within the measurement area.

また、ステップＳ１２では、前記Ｎ個のプロファイルデータの一部又は全部をそれぞれ複数のグループに重複して割り振るようにしてもよい。このようにすれば、プロファイルデータの個数Ｎが少ない場合や、グループの個数Ｍが多い場合でも、各グループに割り振られるプロファイルデータの数を多くすることができるため、Ｓ／Ｎの低下を防ぐことができる。 Further, in step S12, part or all of the N pieces of profile data may be redundantly allocated to a plurality of groups. In this way, even if the number N of profile data is small or the number M of groups is large, the number of profile data allocated to each group can be increased, thereby preventing a decrease in S/N. can be done.

続いて、ピークリスト生成部２３が、ステップＳ１２で生成されたＭ個のグループ毎にピークリストを生成する（ステップＳ１３）。具体的には、ピークリスト生成部２３が各グループに含まれるプロファイルデータの数を確認し、複数のプロファイルデータを含むグループについては、該複数のプロファイルデータを積算することによって積算プロファイルデータを生成する。そして、該積算プロファイルデータに対して、ノイズ除去処理（バックグラウンド除去処理及びスムージング処理）を行った上で、所定のピーク検出アルゴリズムによってピーク検出を行う。そして、検出されたピークの重心位置又は中心位置と該ピークの面積値を求め、各ピークの重心位置（又は中心位置）のm/zと、該ピークの面積値（強度に相当）を記載したピークリストを生成する。一方、プロファイルデータが一つしか含まれていないグループについては、前記積算処理を行うことなく、該一つのプロファイルデータに対してノイズ除去処理（バックグラウンド除去処理及びスムージング処理）、及びピーク検出処理を行って、ピークリストを生成する。これにより得られたＭ個（すなわちクループの数と同数）のピークリストは、前記正解ラベルと関連付けてデータ記憶部５０に記憶される。 Subsequently, the peak list generator 23 generates a peak list for each of the M groups generated in step S12 (step S13). Specifically, the peak list generator 23 checks the number of pieces of profile data included in each group, and for a group containing multiple pieces of profile data, integrates the pieces of profile data to generate integrated profile data. . After performing noise removal processing (background removal processing and smoothing processing) on the integrated profile data, peak detection is performed using a predetermined peak detection algorithm. Then, the centroid position or center position of the detected peak and the area value of the peak are obtained, and the m / z of the centroid position (or center position) of each peak and the area value (corresponding to the intensity) of the peak are described. Generate a peak list. On the other hand, for a group containing only one profile data, noise removal processing (background removal processing and smoothing processing) and peak detection processing are performed on the one profile data without performing the integration processing. Go to Generate a peak list. The obtained M peak lists (that is, the same number as the number of croup) are stored in the data storage unit 50 in association with the correct label.

その後、ユーザが指示した前記複数の既知試料の全てについてステップＳ１１～Ｓ１３の処理を行い、全ての既知試料について各々Ｍ個のピークリストを生成する。なお、ここでは、説明の簡略化のため、全ての既知試料についてＮ個のプロファイルデータ取得し、該プロファイルデータをＭ個のグループに分割して、グループごとにピークリストを生成するものとしたが、プロファイルデータの個数Ｎ、並びにグループ（及びピークリスト）の個数Ｍは、試料ごとに異なっていてもよい。 After that, the processes of steps S11 to S13 are performed for all of the plurality of known samples designated by the user, and M peak lists are generated for each of the known samples. Here, for the sake of simplification of explanation, it is assumed that N profile data are obtained for all known samples, the profile data are divided into M groups, and a peak list is generated for each group. , the number N of profile data, and the number M of groups (and peak lists) may differ for each sample.

続いて、ユーザが入力部６０を操作して、前記既知試料の各々について生成されたピークリストを学習データとする判別モデルの生成を指示すると、判別モデル生成部３０において判別モデルの生成が行われる（ステップＳ１４）。具体的には、判別モデル生成部３０がデータ記憶部５０から前記既知試料の各々について生成された各Ｍ個のピークリストと、該ピークリストの各々に関連付けられた正解ラベルを読み出し、それらを学習データとして、予め定められた機械学習手法による判別モデルの生成を行う。生成された判別モデルは、データ記憶部５０に記憶される。なお、本実施形態におけるピークリストは、各ピークのm/zをそれぞれ一つの次元とする多次元データであり、判別モデルは、例えば多次元入力と出力との関係を表す判別分析の関数である。 Subsequently, when the user operates the input unit 60 to instruct generation of a discriminant model using the peak list generated for each of the known samples as learning data, the discriminant model generation unit 30 generates a discriminant model. (Step S14). Specifically, the discriminant model generation unit 30 reads out each of the M peak lists generated for each of the known samples and the correct label associated with each of the peak lists from the data storage unit 50, and learns them. As data, a discriminant model is generated by a predetermined machine learning method. The generated discriminant model is stored in the data storage unit 50 . Note that the peak list in this embodiment is multidimensional data in which the m/z of each peak is one dimension, and the discriminant model is, for example, a discriminant analysis function representing the relationship between multidimensional input and output. .

ステップＳ１４で判別モデルの生成に用いられる機械学習手法は、教師あり学習を行うものであれば特に限定されないが、例えば、サポートベクターマシン、ランダムフォレスト、ニューラルネットワーク、線形判別法、非線形判別法などとするとよい。どのような手法を用いるのかは、解析対象であるデータの種類、性質などにより適宜選択することが好ましい。 The machine learning method used to generate the discriminant model in step S14 is not particularly limited as long as it performs supervised learning. do it. It is preferable to appropriately select the method to be used depending on the type and properties of the data to be analyzed.

その後、判別対象とする未知試料（例えば、株が未知である微生物）をMALDI-MSによって分析し、得られたピークリストをデータ記憶部５０に記憶させた上で、ユーザが入力部６０を介して前記判別モデルによる前記未知試料の判別を指示する。なお、前記未知試料のピークリストは、該未知試料をMALDI-MSで分析して得られた複数のプロファイルデータを全て積算し、積算プロファイルデータに対してバックグラウンド除去処理、スムージング処理、及びピーク検出処理を行うことによって予め生成される。前記ユーザからの指示を受けた判別部４０では、未知サンプルデータ取得部４１が前記未知試料のピークリストをデータ記憶部５０から読み出し（ステップＳ１５）、判別実行部４２が、前記判別モデルに該未知試料ピークリストを入力することによって得られる出力値から、前記未知試料の種類（例えば未知微生物が属する株）を判別する（ステップＳ１６）。 After that, an unknown sample (for example, a microorganism whose strain is unknown) to be determined is analyzed by MALDI-MS, and the obtained peak list is stored in the data storage unit 50, and the user inputs through the input unit 60 to instruct the discrimination of the unknown sample by the discrimination model. In addition, the peak list of the unknown sample is obtained by integrating all a plurality of profile data obtained by analyzing the unknown sample by MALDI-MS, and performing background removal processing, smoothing processing, and peak detection on the integrated profile data. It is generated in advance by performing processing. In the determination unit 40 that has received the instruction from the user, the unknown sample data acquisition unit 41 reads the peak list of the unknown sample from the data storage unit 50 (step S15), and the determination execution unit 42 applies the unknown sample to the determination model. From the output values obtained by inputting the sample peak list, the type of the unknown sample (for example, the strain to which the unknown microorganism belongs) is discriminated (step S16).

判別部４０による判別結果は、データ記憶部５０に記憶されると共に、表示部７０の画面上に表示されてユーザに提示される（ステップＳ１７）。 The determination result by the determination unit 40 is stored in the data storage unit 50 and displayed on the screen of the display unit 70 to be presented to the user (step S17).

なお、本実施形態に係る質量分析データ判別システム及び質量分析データ処理方法は、微生物の判別（未知微生物が属する種、亜種、株、又はタイプ等の判別）のための判別モデルの生成に限らず、種々の試料の判別、例えば、油種の判別、又は疾患の判別（がん等の所定の疾病を罹患している人に由来する生体試料と該疾患を罹患していない人に由来する生体試料との判別）のための判別モデルの生成などに適用することができる。また、本実施形態に係る質量分析データ判別システム及び質量分析データ処理方法において学習データの生成に用いるプロファイルデータ及び判別対象とする未知試料のピークリストは、MALDI-MSによる分析で取得されたものに限らず、その他のレーザイオン化法、例えば表面支援レーザ脱離イオン化（Surface Assisted Laser Desorption/Ionization）法などによる試料のイオン化を行う質量分析装置で取得されたものであってもよい。 Note that the mass spectrometry data discrimination system and mass spectrometry data processing method according to the present embodiment are limited to generation of discrimination models for discrimination of microorganisms (discrimination of species, subspecies, strains, types, etc. to which unknown microorganisms belong). First, discrimination of various samples, for example, discrimination of oil type, discrimination of disease (biological samples derived from people suffering from a predetermined disease such as cancer and those derived from people not suffering from the disease It can be applied to generation of a discriminant model for discrimination from a biological sample. In addition, in the mass spectrometry data discrimination system and mass spectrometry data processing method according to the present embodiment, the profile data used to generate learning data and the peak list of unknown samples to be discriminated are obtained by analysis by MALDI-MS. Without limitation, it may be obtained by a mass spectrometer that ionizes a sample by other laser ionization methods, such as Surface Assisted Laser Desorption/Ionization.

本発明の効果を、2種類の微生物（A群とB群）の識別性能によって検証した。ここで、A群は大腸菌（Escherichia coli）であり、B群はアクロモバクター属の微生物 (Achromobacter. sp）である。 The effect of the present invention was verified by the ability to discriminate between two types of microorganisms (group A and group B). Here, group A is Escherichia coli, and group B is microorganisms of the genus Achromobacter (Achromobacter. sp).

まず、MALDI-MSによってA群のサンプルとB群のサンプルをそれぞれ4回ずつ測定した。なお、このとき、1回の測定毎に、サンプルへのレーザ照射を120回行って120個のプロファイルデータを取得した。そして、実施例として、該プロファイルデータを本発明の方法で処理することによってピークリストを生成し、該ピークリストを用いた判別モデルの生成を行った。また、比較例として、前記プロファイルデータを従来の方法で処理することによってピークリストを生成し、該ピークリストを用いた判別モデルの生成を行った。 First, samples of group A and samples of group B were each measured four times by MALDI-MS. At this time, 120 pieces of profile data were obtained by irradiating the sample with laser 120 times for each measurement. Then, as an example, a peak list was generated by processing the profile data by the method of the present invention, and a discriminant model was generated using the peak list. As a comparative example, a peak list was generated by processing the profile data by a conventional method, and a discriminant model was generated using the peak list.

具体的には、実施例では、前記判別モデルの生成に際し、1回の測定で得られた120個のプロファイルデータをランダムに4つのグループに分割した。そして、各グループに含まれる30個のプロファイルデータを積算し、得られた積算プロファイルデータに対してノイズ除去処理及びピーク検出処理を行って単一のピークリストを生成した。これにより得られた32個のピークリスト（2群×4測定×4グループ）を学習データとして、A群とB群を判別するための判別モデルを生成した。 Specifically, in the example, 120 pieces of profile data obtained by one measurement were randomly divided into four groups when generating the discriminant model. Then, 30 pieces of profile data included in each group were integrated, noise removal processing and peak detection processing were performed on the obtained integrated profile data, and a single peak list was generated. Using the 32 peak lists (2 groups x 4 measurements x 4 groups) obtained as learning data, a discriminant model for discriminating between group A and group B was generated.

一方、比較例では、前記判別モデルの生成に際し、1回の測定で得られた120個のプロファイルデータをすべて積算し、得られた積算プロファイルデータに対してノイズ除去処理及びピーク検出処理を行って単一のピークリストを生成した。これにより得られた8個のピークリスト（2群×4測定）を学習モデルとして、A群とB群を判別するための判別モデルを生成した。 On the other hand, in the comparative example, when generating the discriminant model, all 120 pieces of profile data obtained in one measurement were integrated, and noise removal processing and peak detection processing were performed on the obtained integrated profile data. A single peak list was generated. Using the 8 peak lists (2 groups x 4 measurements) obtained as a learning model, a discriminant model for discriminating between group A and group B was generated.

なお、実施例及び比較例のいずれにおいても、判別モデルの生成には、統計解析ソフトeMSTAT Solution（登録商標）を使用し、機械学習アルゴリズムとしてSVM（サポートベクトルマシン）を使用した（以下、同じ）。 In both Examples and Comparative Examples, statistical analysis software eMSTAT Solution (registered trademark) was used to generate a discriminant model, and SVM (support vector machine) was used as a machine learning algorithm (hereinafter the same). .

実施例の判別モデル及び比較例の判別モデルによる判別性能を検証したところ、テストデータを与えた出力結果（データがA群であるかB群であるか）については両手法とも100%正解であったが、クロスバリデーションによる誤差（推定誤差）は、比較例のモデルが13％であるのに対し、実施例のモデルでは0%であった。なお、前記クロスバリデーションにはleave-one-out法を使用した（後述の実施例２，３において同じ）。すなわち、各群の学習データから各々１データをテスト用データとして抜き出し、残ったデータで機械学習を行った。これを全データが１回ずつテストデータとなるまで繰り返し、それらの結果を平均して推定誤差を求めた。これにより、本発明によれば測定回数を増やすことなく従来よりも高精度な判別モデルを得られることが確かめられた。 When the discriminant performance of the discriminant model of the example and the discriminant model of the comparative example was verified, both methods were 100% correct regarding the output result (whether the data is group A or group B) given test data. However, the error (estimation error) due to cross-validation was 13% in the model of the comparative example, whereas it was 0% in the model of the example. The leave-one-out method was used for the cross-validation (the same applies to Examples 2 and 3 described later). That is, one piece of data was extracted as test data from the learning data of each group, and machine learning was performed using the remaining data. This was repeated until all data became test data once, and the results were averaged to obtain an estimation error. As a result, it was confirmed that according to the present invention, a discriminant model with higher accuracy than the conventional one can be obtained without increasing the number of measurements.

また、更なる実施例（実施例２）として、前記A群のサンプルとB群のサンプルに対する各4回の測定のうちの1測定分のデータである120個のプロファイルデータを、120個のグループに分割した。そして、各グループに含まれる1個のプロファイルデータに対して、それぞれノイズ除去処理及びピーク検出処理を行ってピークリストを生成した。これにより得られた240個（2群×1測定×120グループ）のピークリストを学習データとして、A群とB群を判別するための判別モデルを生成した。なお、ここで各群について1測定分のみのプロファイルデータを判別モデルに使用したのは、データ数が多くなりすぎて処理に時間が掛かるのを防ぐためである。 In addition, as a further example (Example 2), 120 profile data, which is data for one measurement out of four measurements for each of the samples of Group A and Group B, are divided into 120 groups. divided into Then, noise removal processing and peak detection processing were performed on one piece of profile data included in each group to generate a peak list. A discriminant model for discriminating between group A and group B was generated using the 240 (2 groups x 1 measurement x 120 groups) peak list obtained as learning data. The reason why profile data for only one measurement for each group was used for the discriminant model here is to prevent processing from taking too much time due to an excessive amount of data.

また、更なる実施例（実施例３）として、前記A群のサンプルとB群のサンプルに対する各4回の測定について、各回の測定で得られた120個のプロファイルデータをランダムに2つのグループに分割した。そして、各グループに含まれる60個のプロファイルデータを積算し、得られた積算プロファイルデータにノイズ除去処理及びピーク検出処理を行ったピークリストを生成した。これにより得られた16個のピークリスト（2群×4測定×2グループ）を学習データとして、A群とB群を判別するための判別モデルを生成した。 In addition, as a further example (Example 3), 120 profile data obtained in each measurement were randomly divided into two groups for each of the four measurements for the samples in Group A and the samples in Group B. split. Then, 60 pieces of profile data included in each group were integrated, and a peak list was generated by subjecting the obtained integrated profile data to noise removal processing and peak detection processing. Using the 16 peak lists (2 groups x 4 measurements x 2 groups) obtained as learning data, a discriminant model for discriminating between group A and group B was generated.

前記実施例２及び実施例３で得られた判別モデルの判別性能を検証したところ、いずれにおいても、推定誤差0%のモデルを生成できること及びテストデータを100%正解できることが確認された。 When the discriminant performance of the discriminant models obtained in Examples 2 and 3 was verified, it was confirmed that a model with an estimation error of 0% could be generated and that the test data could be correct 100% in both cases.

[種々の態様]
上述した例示的な実施形態は、以下の態様の具体例であることが当業者により理解される。 [Various aspects]
It will be appreciated by those skilled in the art that the exemplary embodiments described above are specific examples of the following aspects.

（第１項）一態様に係る質量分析データ処理方法は、
レーザイオン化による試料のイオン化を行う質量分析装置において既知試料に対する複数回のレーザ光照射を行い、該複数回のレーザ光照射の各々において前記既知試料から発生するイオンのm/zと強度との関係を示すスペクトルである複数のプロファイルデータを取得し、
前記複数のプロファイルデータを、各グループに一つ以上のプロファイルデータが含まれるように複数のグループに振り分け、
前記複数のグループの各々について、該グループに含まれる前記一つ以上のプロファイルデータに基づいて前記既知試料に由来するピークのm/zと該ピークの強度とを記載したピークリストを生成し、
前記ピークリスト及び前記既知試料の種類に関する情報を学習データとして、未知試料を判別するための判別モデルを生成するものである。 (Section 1) A mass spectrometry data processing method according to one aspect includes:
A relationship between m/z and intensity of ions generated from the known sample in each of the plurality of times of laser light irradiation in a mass spectrometer that ionizes the sample by laser ionization. Acquire multiple profile data that are spectra showing
sorting the plurality of profile data into a plurality of groups such that each group contains one or more profile data;
For each of the plurality of groups, generating a peak list that describes the m/z of the peaks derived from the known sample and the intensities of the peaks based on the one or more profile data included in the group;
A discriminant model for discriminating an unknown sample is generated by using the peak list and information on the type of the known sample as learning data.

（第２項）第１項に記載の質量分析データ処理方法は、
前記複数のプロファイルデータを、前記複数のグループにランダムに振り分けるものであってもよい。 (Section 2) The mass spectrometry data processing method according to Section 1,
The plurality of profile data may be randomly distributed to the plurality of groups.

（第３項）第１項又は第２項に記載の質量分析データ処理方法は、
前記複数のプロファイルデータを前記複数のグループに振り分ける際に、前記複数のプロファイルデータのうちの少なくとも一つを、前記複数のグループのうちの二つ以上に重複して振り分けるものであってもよい。 (Section 3) The mass spectrometry data processing method according to Section 1 or 2,
When allocating the plurality of profile data to the plurality of groups, at least one of the plurality of profile data may be redundantly allocated to two or more of the plurality of groups.

（第４項）第１項～第３項のいずれかに記載の質量分析データ処理方法は、
更に、未知試料を質量分析して得られたプロファイルデータに基づいて生成されたピークリストを、前記判別モデルに適用することによって前記未知試料の判別を行うものであってもよい。 (Section 4) The mass spectrometry data processing method according to any one of Sections 1 to 3,
Furthermore, the unknown sample may be discriminated by applying a peak list generated based on profile data obtained by mass spectrometric analysis of the unknown sample to the discriminant model.

（第５項）一態様に係る質量分析データ処理システムは、
レーザイオン化による試料のイオン化を行う質量分析装置において既知試料に対する複数回のレーザ光照射を行って取得された、該複数回のレーザ光照射の各々において前記既知試料から発生するイオンのm/zと強度との関係を示すスペクトルである複数のプロファイルデータを取得するプロファイルデータ取得部と、
前記複数のプロファイルデータを、各グループに一つ以上のプロファイルデータが含まれるように複数のグループに振り分けるグループ化部と、
前記複数のグループの各々について、該グループに含まれる前記一つ以上のプロファイルデータに基づいて前記既知試料に由来するピークのm/zと該ピークの強度とを記載したピークリストを生成するピークリスト生成部と、
前記ピークリスト及び前記既知試料の種類に関する情報を学習データとして、未知試料を判別するための判別モデルを生成する判別モデル生成部と、
を備えるものである。 (Section 5) A mass spectrometry data processing system according to one aspect,
m/z of ions generated from the known sample at each of the plurality of laser light irradiations, obtained by irradiating the known sample with the laser light multiple times in a mass spectrometer that ionizes the sample by laser ionization; and a profile data acquisition unit that acquires a plurality of profile data, which are spectra showing a relationship with intensity;
a grouping unit that sorts the plurality of profile data into a plurality of groups such that each group includes one or more profile data;
A peak list for generating a peak list describing the m/z of the peak derived from the known sample and the intensity of the peak for each of the plurality of groups based on the one or more profile data included in the group. a generator;
a discriminant model generation unit that generates a discriminant model for discriminating an unknown sample using the peak list and information about the type of the known sample as learning data;
is provided.

（第６項）第５項に記載の質量分析データ処理システムは、
前記グループ化部が、前記複数のプロファイルデータを、前記複数のグループにランダムに振り分けるものであってもよい。 (Section 6) The mass spectrometry data processing system according to Section 5,
The grouping unit may randomly sort the plurality of profile data into the plurality of groups.

（第７項）第５項又は第６項に記載の質量分析データ処理システムは、
前記グループ化部が、前記複数のプロファイルデータのうちの少なくとも一つを、前記複数のグループのうちの二つ以上に重複して振り分けるものであってもよい。 (Section 7) The mass spectrometry data processing system according to Section 5 or 6,
The grouping unit may distribute at least one of the plurality of profile data to two or more of the plurality of groups in duplicate.

（第８項）第５項～第７項のいずれかに記載の質量分析データ処理システムは、
未知試料を質量分析して得られたプロファイルデータに基づいて生成されたピークリストを、前記判別モデルに適用することによって前記未知試料の判別を行う判別部、
を更に備えるものであってもよい。 (Section 8) The mass spectrometry data processing system according to any one of Sections 5 to 7,
a discrimination unit that discriminates the unknown sample by applying a peak list generated based on profile data obtained by mass spectrometry of the unknown sample to the discriminant model;
may be further provided.

（第９項）一態様に係る質量分析データ処理プログラムは、コンピュータを、第５項～第８項のいずれかに記載の質量分析データ処理システムの各部として機能させるものである。 (Item 9) A mass spectrometry data processing program according to one aspect causes a computer to function as each part of the mass spectrometry data processing system according to any one of the 5th to 8th items.

第１項に記載の質量分析データ処理方法、第５項に記載の質量分析データ処理システム、又は第９項に記載の質量分析データ処理プログラムによれば、高精度な判別モデルを構築するために必要な多量の学習データを、少ない質量分析回数で得ることが可能となる。 According to the mass spectrometry data processing method according to item 1, the mass spectrometry data processing system according to item 5, or the mass spectrometry data processing program according to item 9, in order to construct a highly accurate discrimination model It is possible to obtain a large amount of required learning data with a small number of mass spectrometry runs.

また、第２項に記載の質量分析データ処理方法又は第６項に記載の質量分析データ処理システムによれば、試料上の測定領域内における試料成分の濃淡の影響を受けることなく適切な学習データを生成することができる。 In addition, according to the mass spectrometry data processing method described in item 2 or the mass spectrometry data processing system described in item 6, appropriate learning data is obtained without being affected by the density of the sample components in the measurement area on the sample. can be generated.

また、第３項に記載の質量分析データ処理方法又は第７項に記載の質量分析データ処理システムによれば、プロファイルデータの数が少ない場合や、グループの数が多い場合でも、各グループに割り振られるプロファイルデータの数を多くすることができるため、Ｓ／Ｎの低下を防ぐことができる。 Further, according to the mass spectrometry data processing method described in paragraph 3 or the mass spectrometry data processing system described in paragraph 7, even if the number of profile data is small or the number of groups is large, Since the number of profile data to be stored can be increased, a decrease in S/N can be prevented.

１０…質量分析データ処理システム
２０…学習データ生成部
２１…プロファイルデータ取得部
２２…グループ化部
２３…ピークリスト生成部
３０…判別モデル生成部
４０…判別部
４１…未知サンプルデータ取得部
４２…判別実行部
５０…データ記憶部
６０…入力部
７０…表示部 10 Mass spectrometry data processing system 20 Learning data generation unit 21 Profile data acquisition unit 22 Grouping unit 23 Peak list generation unit 30 Discriminant model generation unit 40 Discrimination unit 41 Unknown sample data acquisition unit 42 Discrimination Execution unit 50 Data storage unit 60 Input unit 70 Display unit

Claims

In a mass spectrometer that ionizes a sample by laser ionization, one known sample is irradiated with laser light a plurality of times, and the m/z of ions generated from the known sample at each of the plurality of laser light irradiations a profile data acquisition step of acquiring a plurality of profile data, which are spectra showing a relationship with intensity;
a grouping step of allocating the plurality of profile data into a plurality of groups such that each group includes one or more profile data;
For each of the plurality of groups, generate a peak list describing the m/z of the peak derived from the one known sample and the intensity of the peak based on the one or more profile data included in the group. a peak list generation step ;
a learning data generating step of generating learning data by associating information about the type of the one known sample with each of the plurality of peak lists related to the one known sample generated in the peak list generating step;
Discriminating an unknown sample using a plurality of learning data obtained by executing the profile data acquiring step, the grouping step, the peak list generating step, and the learning data generating step for each of the plurality of known samples A discriminant model generation step for generating a discriminant model for
A mass spectrometry data processing method comprising :

The mass spectrometry data processing method according to claim 1, wherein the plurality of profile data are randomly assigned to the plurality of groups.

3. The method according to claim 1, wherein when the plurality of profile data are assigned to the plurality of groups, at least one of the plurality of profile data is redundantly assigned to two or more of the plurality of groups. mass spectrometry data processing method.

Further, the unknown sample is discriminated by applying a peak list generated based on profile data obtained by mass spectrometry of the unknown sample to the discrimination model. Mass spectrometry data processing method.

In a mass spectrometer that ionizes a sample by laser ionization, one known sample is irradiated with laser light a plurality of times, and the number of ions generated from the one known sample at each of the plurality of laser light irradiations. a profile data acquisition unit that acquires a plurality of profile data, which are spectra showing the relationship between m/z and intensity;
a grouping unit that sorts the plurality of profile data into a plurality of groups such that each group includes one or more profile data;
For each of the plurality of groups, generate a peak list describing the m/z of the peak derived from the one known sample and the intensity of the peak based on the one or more profile data included in the group. a peak list generator;
a learning data generation unit that generates learning data by associating information regarding the type of the one known sample with each of the plurality of peak lists related to the one known sample generated by the peak list generation unit;
An unknown sample using a plurality of learning data obtained by executing processing by the profile data acquisition unit, the grouping unit, the peak list generation unit, and the learning data generation unit for each of a plurality of known samples a discriminant model generation unit that generates a discriminant model for discriminating
A mass spectrometry data processing system comprising:

The mass spectrometry data processing system according to claim 5, wherein the grouping unit randomly distributes the plurality of profile data to the plurality of groups.

The mass spectrometry data processing system according to claim 5 or 6, wherein the grouping unit redundantly sorts at least one of the plurality of profile data into two or more of the plurality of groups.

a discrimination unit that discriminates the unknown sample by applying a peak list generated based on profile data obtained by mass spectrometry of the unknown sample to the discriminant model;
The mass spectrometry data processing system according to any one of claims 5 to 7, further comprising:

A mass spectrometry data processing program that causes a computer to function as each part of the mass spectrometry data processing system according to any one of claims 5 to 8.