JP2021025953A

JP2021025953A - Mass analysis data processing method, mass analysis data processing system, and mass analysis data processing program

Info

Publication number: JP2021025953A
Application number: JP2019145984A
Authority: JP
Inventors: 達樹大久保; Tatsuki Okubo; 賢志山田; Kenji Yamada
Original assignee: Shimadzu Corp
Current assignee: Shimadzu Corp
Priority date: 2019-08-08
Filing date: 2019-08-08
Publication date: 2021-02-22
Anticipated expiration: 2039-08-08
Also published as: JP7268530B2

Abstract

To acquire a large amount of learning data required to construct a highly accurate discrimination model by a small number of mass analysis times.SOLUTION: A mass analysis data analysis method comprise: performing laser ionization of a known sample multiple times in a mass analysis device for ionizing a sample, and acquiring a plurality of pieces of profile data being a spectrum showing a relation between a m/z and strength of an ion generated from the known sample in each laser beam irradiation (Step S11); distributing the plurality of pieces of profile data into a plurality of groups such that each group includes one or more pieces of profile data (Step S12); generating, in each group, a peak list describing the m/z of a peak derived from the known sample and the strength of the peak on the basis of the one or more pieces of profile data included in the group (Step S13); and a generating a discrimination model for discriminating an unknown sample with the peak list and information on a kind of the known sample as learning data.SELECTED DRAWING: Figure 2

Description

本発明は、質量分析データ処理方法、質量分析データ処理システム、及び質量分析データ処理プログラムに関する。 The present invention relates to a mass spectrometric data processing method, a mass spectrometric data processing system, and a mass spectrometric data processing program.

質量分析装置のイオン化法の一つとしてマトリックス支援レーザ脱離イオン化（Matrix Assisted Laser Desorption/Ionization；MALDI）法がよく知られている。MALDI法は、レーザ光を吸収しにくい試料、又はタンパク質等のレーザ光で損傷を受けやすい試料を分析するために、レーザ光を吸収し易く且つイオン化し易い物質をマトリクスとして試料に予め混合しておき、これにレーザ光を照射することで試料をイオン化する手法である。特にMALDIイオン源を用いた質量分析装置（以下、MALDI-MSとよぶ）は、分子量の大きな高分子化合物をあまり開裂させることなく分析することが可能であり、しかも微量分析にも好適であることから、生命科学などの分野で広範に利用されている。 The Matrix Assisted Laser Desorption / Ionization (MALDI) method is well known as one of the ionization methods for mass spectrometers. In the MALDI method, in order to analyze a sample that does not easily absorb laser light or a sample that is easily damaged by laser light such as protein, a substance that easily absorbs laser light and is easily ionized is mixed with the sample in advance as a matrix. This is a method of ionizing a sample by irradiating it with a laser beam. In particular, a mass spectrometer using a MALDI ion source (hereinafter referred to as MALDI-MS) can analyze a polymer compound having a large molecular weight without causing too much cleavage, and is also suitable for microanalysis. Therefore, it is widely used in fields such as life science.

また、近年ではMALDI-MSによって得られたマススペクトルに機械学習を適用することによって未知試料の判別を行う試みが進められている（例えば、特許文献１を参照）。機械学習は、多種多様である大量のデータの中から規則性を見出し、それを利用してデータの予測、判別、又は回帰を行うために有用な手法の一つであり、大別して教師あり学習と教師なし学習がある。例えば、微生物をMALDI-MSで分析した結果に基づいて当該微生物の種類（例えば、種、亜種、株、又はタイプなど）を判別しようとする場合、予め種々の微生物について多数の質量分析データを集めておき、それらのデータを学習データ（教師データ又は訓練データともいう）とする教師あり学習を行って、未知微生物の種類を判別するための判別モデルを構築する。 Further, in recent years, attempts have been made to discriminate unknown samples by applying machine learning to the mass spectrum obtained by MALDI-MS (see, for example, Patent Document 1). Machine learning is one of the useful methods for finding regularity in a large amount of diverse data and using it to predict, discriminate, or regress data. It is roughly divided into supervised learning. And there is unsupervised learning. For example, when trying to determine the type of a microorganism (for example, species, subspecies, strain, or type) based on the result of analyzing a microorganism with MALDI-MS, a large amount of mass spectrometric data for various microorganisms is previously collected. A discriminant model for discriminating the type of an unknown microorganism is constructed by performing supervised learning using these data as training data (also referred to as teacher data or training data).

特開2018-155522号公報JP-A-2018-155522 特開2010-205460号公報Japanese Unexamined Patent Publication No. 2010-205460

しかしながら、高精度な判別モデルを構築するためには、多数の学習データを収集する必要がある。そのためには、多数回の質量分析を行う必要があるため、多くの労力とコストが掛かるという問題があった。 However, in order to build a highly accurate discrimination model, it is necessary to collect a large amount of training data. For that purpose, it is necessary to perform mass spectrometry many times, which causes a problem that a lot of labor and cost are required.

本発明は上記の点に鑑みてなされたものであり、その目的とするところは、高精度な判別モデルを構築するために必要な多量の学習データを、少ない質量分析回数で得ることのできる質量分析データ処理方法、質量分析データ処理システム、及び質量分析データ処理プログラムを提供することにある。 The present invention has been made in view of the above points, and an object of the present invention is a mass that can obtain a large amount of training data necessary for constructing a highly accurate discrimination model with a small number of mass spectrometrys. It is an object of the present invention to provide an analytical data processing method, a mass spectrometric data processing system, and a mass spectrometric data processing program.

上記課題を解決するために成された本発明に係る質量分析データ処理方法は、
レーザイオン化による試料のイオン化を行う質量分析装置において既知試料に対する複数回のレーザ光照射を行い、該複数回のレーザ光照射の各々において前記既知試料から発生するイオンのm/zと強度との関係を示すスペクトルである複数のプロファイルデータを取得し、
前記複数のプロファイルデータを、各グループに一つ以上のプロファイルデータが含まれるように複数のグループに振り分け、
前記複数のグループの各々について、該グループに含まれる前記一つ以上のプロファイルデータに基づいて前記既知試料に由来するピークのm/zと該ピークの強度とを記載したピークリストを生成し、
前記ピークリスト及び前記既知試料の種類に関する情報を学習データとして、未知試料を判別するための判別モデルを生成するものである。 The mass spectrometric data processing method according to the present invention, which has been made to solve the above problems, is
A mass spectrometer that ionizes a sample by laser ionization irradiates a known sample with laser light multiple times, and the relationship between the m / z of ions generated from the known sample and the intensity in each of the multiple laser light irradiations. Acquire multiple profile data, which is a spectrum showing
The plurality of profile data are divided into a plurality of groups so that one or more profile data is included in each group.
For each of the plurality of groups, a peak list describing the m / z of the peak derived from the known sample and the intensity of the peak was generated based on the one or more profile data included in the group.
A discrimination model for discriminating an unknown sample is generated by using the peak list and information on the type of the known sample as learning data.

上記課題を解決するために成された本発明に係る質量分析データ処理システムは、
レーザイオン化による試料のイオン化を行う質量分析装置において既知試料に対する複数回のレーザ光照射を行って取得された、該複数回のレーザ光照射の各々において前記既知試料から発生するイオンのm/zと強度との関係を示すスペクトルである複数のプロファイルデータを取得するプロファイルデータ取得部と、
前記複数のプロファイルデータを、各グループに一つ以上のプロファイルデータが含まれるように複数のグループに振り分けるグループ化部と、
前記複数のグループの各々について、該グループに含まれる前記一つ以上のプロファイルデータに基づいて前記既知試料に由来するピークのm/zと該ピークの強度とを記載したピークリストを生成するピークリスト生成部と、
前記ピークリスト及び前記既知試料の種類に関する情報を学習データとして、未知試料を判別するための判別モデルを生成する判別モデル生成部と、
を備えるものである。 The mass spectrometric data processing system according to the present invention made to solve the above problems is
The m / z of ions generated from the known sample in each of the multiple laser beam irradiations obtained by performing multiple laser beam irradiations on the known sample in a mass spectrometer that ionizes the sample by laser ionization. A profile data acquisition unit that acquires a plurality of profile data, which is a spectrum showing a relationship with intensity,
A grouping unit that distributes the plurality of profile data into a plurality of groups so that each group includes one or more profile data.
For each of the plurality of groups, a peak list that generates a peak list describing the m / z of the peak derived from the known sample and the intensity of the peak based on the one or more profile data included in the group. With the generator
A discriminant model generator that generates a discriminant model for discriminating an unknown sample by using the peak list and information on the type of the known sample as learning data.
Is provided.

上記課題を解決するために成された本発明に係る質量分析データ処理プログラムは、コンピュータを、前記質量分析データ処理システムの各部として機能させるものである。 The mass spectrometric data processing program according to the present invention made to solve the above problems causes a computer to function as each part of the mass spectrometric data processing system.

上記本発明に係る質量分析データ処理方法、質量分析データ処理システム、及び質量分析データ処理プログラムでは、一つの試料に対する多数回のレーザ光照射に伴って得られたプロファイルデータを複数のグループに分割し、グループ毎に一つのピークリストを生成する。これにより、一つの試料に対する質量分析で得られるピークリストの数を増やすことができる。その結果、高精度な判別モデルを構築するために必要な多量の学習データを、少ない質量分析回数で得ることが可能となる。 In the mass spectrometry data processing method, the mass spectrometry data processing system, and the mass spectrometry data processing program according to the present invention, the profile data obtained by irradiating one sample with a large number of laser beams is divided into a plurality of groups. , Generate one peak list for each group. This makes it possible to increase the number of peak lists obtained by mass spectrometry for one sample. As a result, it is possible to obtain a large amount of learning data necessary for constructing a highly accurate discrimination model with a small number of mass spectrometrys.

本発明の一実施形態に係る質量分析データ処理システムの要部構成を示すブロック図。The block diagram which shows the main part structure of the mass spectrometry data processing system which concerns on one Embodiment of this invention. 同実施形態における質量分析データの処理手順を示すフローチャート。The flowchart which shows the processing procedure of the mass spectrometry data in the same embodiment.

以下、本発明を実施するための形態について図面を参照しつつ説明する。図１は、本発明の一実施形態に係る質量分析データ処理システム１０の要部構成を示すブロック図である。 Hereinafter, embodiments for carrying out the present invention will be described with reference to the drawings. FIG. 1 is a block diagram showing a main configuration of a mass spectrometry data processing system 10 according to an embodiment of the present invention.

このシステム１０は、図示しないMALDI-MSによる試料の分析によって得られた質量分析データを処理するものであって、学習データ生成部２０と、判別モデル生成部３０と、判別部４０と、データ記憶部５０と、マウス等のポインティングデバイス及びキーボード等を含む入力部６０と、液晶ディスプレイ等の表示装置を含む表示部７０とを備えている。 This system 10 processes mass spectrometric data obtained by analyzing a sample by MALDI-MS (not shown), and includes a learning data generation unit 20, a discrimination model generation unit 30, a discrimination unit 40, and data storage. It includes a unit 50, an input unit 60 including a pointing device such as a mouse and a keyboard, and a display unit 70 including a display device such as a liquid crystal display.

学習データ生成部２０は、既知試料（例えば属する株が既知である微生物）をMALDI-MSで分析して得られた質量分析データに所定の処理を施すことによって、機械学習に用いるための学習データを生成するものである。学習データ生成部２０は、プロファイルデータ取得部２１、グループ化部２２、及びピークリスト生成部２３を含んでいる。 The learning data generation unit 20 performs a predetermined process on the mass spectrometric data obtained by analyzing a known sample (for example, a microorganism to which the strain is known) with MALDI-MS, so that the learning data is used for machine learning. Is to generate. The learning data generation unit 20 includes a profile data acquisition unit 21, a grouping unit 22, and a peak list generation unit 23.

判別モデル生成部３０は、学習データ生成部２０で生成された複数の学習データを用いて、未知試料（例えば属する株が不明である微生物）を判別するための判別モデルを生成するものである。 The discrimination model generation unit 30 uses a plurality of training data generated by the training data generation unit 20 to generate a discrimination model for discriminating an unknown sample (for example, a microorganism to which the strain to which it belongs is unknown).

判別部４０は、未知試料をMALDI-MSで分析して得られた質量分析データを前記判別モデルに適用することによって、該未知試料の種類（例えば前記微生物が属する株）を判別するものである。判別部４０は、未知サンプルデータ取得部４１と、判別実行部４２とを備えている。 The discrimination unit 40 discriminates the type of the unknown sample (for example, the strain to which the microorganism belongs) by applying the mass spectrometric data obtained by analyzing the unknown sample with MALDI-MS to the discrimination model. .. The discrimination unit 40 includes an unknown sample data acquisition unit 41 and a discrimination execution unit 42.

学習データ生成部２０、判別モデル生成部３０、及び判別部４０の実体は、コンピュータ（パーソナルコンピュータ又はそれよりも高性能なコンピュータ）であり、該コンピュータに予めインストールされた専用のデータ処理ソフトウェアを該コンピュータ上で動作させることにより、前記各部の機能が実現される。データ記憶部５０は、前記コンピュータに内蔵された又は前記コンピュータに直接接続された記憶装置によるものとするほか、例えば、前記コンピュータからインターネット等を介してアクセス可能である別のコンピュータシステム上に存在する、つまりはクラウドコンピューティングにおける記憶装置などを利用してもよい。 The substance of the learning data generation unit 20, the discrimination model generation unit 30, and the discrimination unit 40 is a computer (a personal computer or a computer having a higher performance than that), and the dedicated data processing software pre-installed in the computer is used. By operating on a computer, the functions of the above-mentioned parts are realized. The data storage unit 50 may be a storage device built in the computer or directly connected to the computer, or may exist on another computer system that can be accessed from the computer via the Internet or the like, for example. That is, a storage device in cloud computing may be used.

また、本実施形態に係るシステム１０は、学習データ生成部２０、判別モデル生成部３０、及び判別部４０の機能を複数のコンピュータに分担させるものとすることもできる。具体的には、例えば、学習データ生成部２０及び判別モデル生成部３０の機能を一台のコンピュータに割り当て、判別部４０の機能をそれとは別のコンピュータに割り当てることが考えられる。 Further, the system 10 according to the present embodiment may have a plurality of computers share the functions of the learning data generation unit 20, the discrimination model generation unit 30, and the discrimination unit 40. Specifically, for example, it is conceivable to assign the functions of the learning data generation unit 20 and the discrimination model generation unit 30 to one computer and the functions of the discrimination unit 40 to another computer.

続いて、本実施形態に係るシステム１０における処理の特徴について説明する。 Subsequently, the characteristics of the processing in the system 10 according to the present embodiment will be described.

一般的に、MALDI-MSでは、一つの試料に対して、レーザ光照射によるイオンの生成→生成したイオンの分離及び検出、というプロセスが多数回（例えば120回）繰り返し実行されて、多数のプロファイルデータが生成される（特許文献２など参照）。プロファイルデータとは、質量分析装置の生データ（Raw Data）に相当するデータ形態であり、質量分析装置に設けられたイオン検出器から連続的に送出される検出信号の波形を、横軸を時間（又はm/z）とし、縦軸をイオン強度として表したものである。 Generally, in MALDI-MS, the process of generating ions by laser irradiation → separating and detecting the generated ions is repeatedly executed many times (for example, 120 times) for one sample, and many profiles are used. Data is generated (see Patent Document 2 and the like). The profile data is a data form corresponding to the raw data of the mass spectrometer, and the waveform of the detection signal continuously transmitted from the ion detector provided in the mass spectrometer is shown on the horizontal axis with time. (Or m / z), and the vertical axis represents the ion intensity.

従来のデータ処理方法では、上記のような一つの試料に対する多数回のレーザ光照射に伴って得られたプロファイルデータをすべて積算した上で、その後のデータ処理の便のために、該積算後のプロファイルデータ（積算プロファイルデータとよぶ）の波形に含まれるピークを検出し（すなわちピーク検出処理を行い）、検出された各ピークの重心位置（又は中心位置）を表すm/z値と、該ピークの面積値とを示したリスト（ピークリスト）に変換していた。すなわち、従来のデータ処理方法では、一つの試料に対する一回の質量分析の結果として、一つのピークリストが生成されていた。 In the conventional data processing method, all the profile data obtained by irradiating one sample with a large number of laser beams as described above are integrated, and then the integration is performed for the convenience of subsequent data processing. The peak included in the waveform of profile data (called integrated profile data) is detected (that is, peak detection processing is performed), and the m / z value indicating the position of the center of gravity (or center position) of each detected peak and the peak It was converted into a list (peak list) showing the area value of. That is, in the conventional data processing method, one peak list is generated as a result of one mass spectrometry for one sample.

これに対し、本実施形態に係る質量分析データ処理方法は、上記のような一つの試料に対する多数回のレーザ光照射に伴って得られたプロファイルデータを複数のグループに分割し、グループ毎に一つのピークリストを生成する。すなわち、一つの試料に対する一回の質量分析の結果として、複数のピークリストを生成する。これにより、質量分析の実行回数を増やすことなく、より多くの学習データを得ることが可能となる。 On the other hand, in the mass spectrometry data processing method according to the present embodiment, the profile data obtained by irradiating one sample with a large number of laser beams as described above is divided into a plurality of groups, and each group is divided into one group. Generate one peak list. That is, a plurality of peak lists are generated as a result of one mass spectrometry for one sample. This makes it possible to obtain more training data without increasing the number of executions of mass spectrometry.

以下、このような処理の詳細について、図２のフローチャートを参照しつつ説明する。なお、ここでは予め複数の既知試料（例えば株が既知である微生物）についてMALDI-MSによる質量分析が行われ、前記複数の既知試料の各々についての質量分析結果として、それぞれＮ個（Ｎは２以上の整数）のプロファイルデータが、該既知試料の種類の情報（例えば、既知微生物の株の情報）と関連付けてデータ記憶部５０に記憶されているものとする。以下、前記既知試料の種類の情報を「正解ラベル」とよぶ。 Hereinafter, the details of such processing will be described with reference to the flowchart of FIG. Here, mass spectrometry is performed on a plurality of known samples (for example, microorganisms whose strains are known) by MALDI-MS in advance, and as a result of mass spectrometry for each of the plurality of known samples, N (N is 2). It is assumed that the profile data (the above integers) is stored in the data storage unit 50 in association with the information on the type of the known sample (for example, the information on the strain of the known microorganism). Hereinafter, the information on the type of the known sample is referred to as a "correct label".

まず、ユーザが入力部６０で所定の操作を行って、データ記憶部５０に記憶されている前記複数の既知試料の質量分析結果を指定すると共に、これらに基づく学習データの生成を指示すると、学習データ生成部２０によって学習データの生成が実行される。具体的には、まず、学習データ生成部２０のプロファイルデータ取得部２１が、ユーザによって指定された複数の既知試料の質量分析結果のうち、一つの既知試料に関する質量分析結果、すなわち該試料に関するＮ個のプロファイルデータをデータ記憶部５０から取得する（ステップＳ１１）。 First, when the user performs a predetermined operation on the input unit 60 to specify the mass spectrometric analysis results of the plurality of known samples stored in the data storage unit 50 and instruct the generation of learning data based on these, learning is performed. The data generation unit 20 generates learning data. Specifically, first, the profile data acquisition unit 21 of the learning data generation unit 20 determines the mass spectrometric result of one known sample among the mass spectrometric results of a plurality of known samples designated by the user, that is, N related to the sample. The profile data is acquired from the data storage unit 50 (step S11).

次にグループ化部２２が、前記Ｎ個のプロファイルデータを、所定の基準にしたがって（例えばプロファイルデータの生成順に）、予め定められたＭ個（ＭはＮ以下の整数）のグループに割り振っていく（ステップＳ１２）。このとき、前記Ｍ個のグループには、それぞれ少なくとも一つのプロファイルデータが含まれるようにする。また、各グループに割り振られるプロファイルデータの数はなるべく均等になるようにする。なお、グループの個数Ｍは、予めシステム１０側に記憶されている値としてもよく、ユーザが自由に設定できるようにしてもよい。また、プロファイルデータの個数Ｎ、又は必要とする判別制度等に基づいてシステム１０側で自動的に決定されるようにしてもよい。 Next, the grouping unit 22 allocates the N profile data to a predetermined M group (M is an integer of N or less) according to a predetermined standard (for example, in the order in which the profile data is generated). (Step S12). At this time, at least one profile data is included in each of the M groups. Also, try to make the number of profile data allocated to each group as even as possible. The number M of the groups may be a value stored in advance on the system 10 side, and may be freely set by the user. Further, the system 10 may automatically determine the number N of profile data or the required discrimination system.

なお、MALDIによる試料のイオン化では、試料上の同じ位置にレーザ光を繰り返し照射し続けると次第にイオンが発生しなくなるため、通常は、試料上の測定領域内で互いに近接した複数の異なる位置にレーザ光が照射させるように試料又はレーザ光を移動させており、プロファイルデータは、その異なる位置（測定点）毎に取得される。このとき、前記測定領域内における試料成分の濃淡によって、各測定点から発生するイオンの量にばらつきが生じる。そこで、前記ステップＳ１２では、前記Ｎ個のプロファイルデータをランダムに前記Ｍ個のグループに割り振るようにすることが望ましい。これにより、測定領域内における試料成分の濃淡の影響を受けることなく適切な学習データを生成することができる。 In the ionization of a sample by MALDI, if the same position on the sample is repeatedly irradiated with laser light, ions will gradually disappear, so normally, the laser is located at a plurality of different positions close to each other in the measurement region on the sample. The sample or laser light is moved so that the light irradiates, and profile data is acquired for each of the different positions (measurement points). At this time, the amount of ions generated from each measurement point varies depending on the shade of the sample component in the measurement region. Therefore, in step S12, it is desirable to randomly allocate the N profile data to the M groups. As a result, appropriate learning data can be generated without being affected by the shading of the sample component in the measurement region.

また、ステップＳ１２では、前記Ｎ個のプロファイルデータの一部又は全部をそれぞれ複数のグループに重複して割り振るようにしてもよい。このようにすれば、プロファイルデータの個数Ｎが少ない場合や、グループの個数Ｍが多い場合でも、各グループに割り振られるプロファイルデータの数を多くすることができるため、Ｓ／Ｎの低下を防ぐことができる。 Further, in step S12, a part or all of the N profile data may be allocated to a plurality of groups in duplicate. By doing so, even when the number N of profile data is small or the number M of groups is large, the number of profile data allocated to each group can be increased, so that a decrease in S / N can be prevented. Can be done.

続いて、ピークリスト生成部２３が、ステップＳ１２で生成されたＭ個のグループ毎にピークリストを生成する（ステップＳ１３）。具体的には、ピークリスト生成部２３が各グループに含まれるプロファイルデータの数を確認し、複数のプロファイルデータを含むグループについては、該複数のプロファイルデータを積算することによって積算プロファイルデータを生成する。そして、該積算プロファイルデータに対して、ノイズ除去処理（バックグラウンド除去処理及びスムージング処理）を行った上で、所定のピーク検出アルゴリズムによってピーク検出を行う。そして、検出されたピークの重心位置又は中心位置と該ピークの面積値を求め、各ピークの重心位置（又は中心位置）のm/zと、該ピークの面積値（強度に相当）を記載したピークリストを生成する。一方、プロファイルデータが一つしか含まれていないグループについては、前記積算処理を行うことなく、該一つのプロファイルデータに対してノイズ除去処理（バックグラウンド除去処理及びスムージング処理）、及びピーク検出処理を行って、ピークリストを生成する。これにより得られたＭ個（すなわちクループの数と同数）のピークリストは、前記正解ラベルと関連付けてデータ記憶部５０に記憶される。 Subsequently, the peak list generation unit 23 generates a peak list for each of the M groups generated in step S12 (step S13). Specifically, the peak list generation unit 23 confirms the number of profile data included in each group, and for a group including a plurality of profile data, the integrated profile data is generated by integrating the plurality of profile data. .. Then, after performing noise removal processing (background removal processing and smoothing processing) on the integrated profile data, peak detection is performed by a predetermined peak detection algorithm. Then, the center of gravity position or center position of the detected peak and the area value of the peak were obtained, and the m / z of the center of gravity position (or center position) of each peak and the area value (corresponding to the intensity) of the peak were described. Generate a peak list. On the other hand, for a group containing only one profile data, noise removal processing (background removal processing and smoothing processing) and peak detection processing are performed on the one profile data without performing the integration processing. Go to generate a peak list. The M peak list thus obtained (that is, the same number as the number of croups) is stored in the data storage unit 50 in association with the correct answer label.

その後、ユーザが指示した前記複数の既知試料の全てについてステップＳ１１〜Ｓ１３の処理を行い、全ての既知試料について各々Ｍ個のピークリストを生成する。なお、ここでは、説明の簡略化のため、全ての既知試料についてＮ個のプロファイルデータ取得し、該プロファイルデータをＭ個のグループに分割して、グループごとにピークリストを生成するものとしたが、プロファイルデータの個数Ｎ、並びにグループ（及びピークリスト）の個数Ｍは、試料ごとに異なっていてもよい。 After that, the processing of steps S11 to S13 is performed on all of the plurality of known samples instructed by the user, and M peak lists are generated for each of all the known samples. Here, for the sake of simplification of the explanation, N profile data are acquired for all known samples, the profile data is divided into M groups, and a peak list is generated for each group. , The number N of profile data and the number M of groups (and peak lists) may be different for each sample.

続いて、ユーザが入力部６０を操作して、前記既知試料の各々について生成されたピークリストを学習データとする判別モデルの生成を指示すると、判別モデル生成部３０において判別モデルの生成が行われる（ステップＳ１４）。具体的には、判別モデル生成部３０がデータ記憶部５０から前記既知試料の各々について生成された各Ｍ個のピークリストと、該ピークリストの各々に関連付けられた正解ラベルを読み出し、それらを学習データとして、予め定められた機械学習手法による判別モデルの生成を行う。生成された判別モデルは、データ記憶部５０に記憶される。なお、本実施形態におけるピークリストは、各ピークのm/zをそれぞれ一つの次元とする多次元データであり、判別モデルは、例えば多次元入力と出力との関係を表す判別分析の関数である。 Subsequently, when the user operates the input unit 60 to instruct the generation of the discrimination model using the peak list generated for each of the known samples as learning data, the discrimination model generation unit 30 generates the discrimination model. (Step S14). Specifically, the discrimination model generation unit 30 reads out each M peak list generated for each of the known samples from the data storage unit 50 and the correct answer label associated with each of the peak lists, and learns them. As data, a discrimination model is generated by a predetermined machine learning method. The generated discrimination model is stored in the data storage unit 50. The peak list in the present embodiment is multidimensional data in which m / z of each peak is one dimension, and the discriminant model is, for example, a function of discriminant analysis representing the relationship between multidimensional input and output. ..

ステップＳ１４で判別モデルの生成に用いられる機械学習手法は、教師あり学習を行うものであれば特に限定されないが、例えば、サポートベクターマシン、ランダムフォレスト、ニューラルネットワーク、線形判別法、非線形判別法などとするとよい。どのような手法を用いるのかは、解析対象であるデータの種類、性質などにより適宜選択することが好ましい。 The machine learning method used to generate the discrimination model in step S14 is not particularly limited as long as it performs supervised learning, and includes, for example, a support vector machine, a random forest, a neural network, a linear discrimination method, a nonlinear discrimination method, and the like. It is good to do. It is preferable to appropriately select what kind of method is used depending on the type and properties of the data to be analyzed.

その後、判別対象とする未知試料（例えば、株が未知である微生物）をMALDI-MSによって分析し、得られたピークリストをデータ記憶部５０に記憶させた上で、ユーザが入力部６０を介して前記判別モデルによる前記未知試料の判別を指示する。なお、前記未知試料のピークリストは、該未知試料をMALDI-MSで分析して得られた複数のプロファイルデータを全て積算し、積算プロファイルデータに対してバックグラウンド除去処理、スムージング処理、及びピーク検出処理を行うことによって予め生成される。前記ユーザからの指示を受けた判別部４０では、未知サンプルデータ取得部４１が前記未知試料のピークリストをデータ記憶部５０から読み出し（ステップＳ１５）、判別実行部４２が、前記判別モデルに該未知試料ピークリストを入力することによって得られる出力値から、前記未知試料の種類（例えば未知微生物が属する株）を判別する（ステップＳ１６）。 After that, an unknown sample to be discriminated (for example, a microorganism whose strain is unknown) is analyzed by MALDI-MS, the obtained peak list is stored in the data storage unit 50, and then the user passes through the input unit 60. Instructs the discrimination of the unknown sample by the discrimination model. In the peak list of the unknown sample, all the plurality of profile data obtained by analyzing the unknown sample with MALDI-MS are integrated, and the integrated profile data is subjected to background removal processing, smoothing processing, and peak detection. It is generated in advance by performing the process. In the discrimination unit 40 that receives the instruction from the user, the unknown sample data acquisition unit 41 reads out the peak list of the unknown sample from the data storage unit 50 (step S15), and the discrimination execution unit 42 reads the unknown in the discrimination model. From the output value obtained by inputting the sample peak list, the type of the unknown sample (for example, the strain to which the unknown microorganism belongs) is determined (step S16).

判別部４０による判別結果は、データ記憶部５０に記憶されると共に、表示部７０の画面上に表示されてユーザに提示される（ステップＳ１７）。 The discrimination result by the discrimination unit 40 is stored in the data storage unit 50, displayed on the screen of the display unit 70, and presented to the user (step S17).

なお、本実施形態に係る質量分析データ判別システム及び質量分析データ処理方法は、微生物の判別（未知微生物が属する種、亜種、株、又はタイプ等の判別）のための判別モデルの生成に限らず、種々の試料の判別、例えば、油種の判別、又は疾患の判別（がん等の所定の疾病を罹患している人に由来する生体試料と該疾患を罹患していない人に由来する生体試料との判別）のための判別モデルの生成などに適用することができる。また、本実施形態に係る質量分析データ判別システム及び質量分析データ処理方法において学習データの生成に用いるプロファイルデータ及び判別対象とする未知試料のピークリストは、MALDI-MSによる分析で取得されたものに限らず、その他のレーザイオン化法、例えば表面支援レーザ脱離イオン化（Surface Assisted Laser Desorption/Ionization）法などによる試料のイオン化を行う質量分析装置で取得されたものであってもよい。 The mass spectrometry data discrimination system and the mass spectrometry data processing method according to the present embodiment are limited to the generation of a discrimination model for discrimination of microorganisms (discrimination of species, subspecies, strains, types, etc. to which unknown microorganisms belong). However, discrimination of various samples, for example, discrimination of oil type, or discrimination of disease (derived from a biological sample derived from a person suffering from a predetermined disease such as cancer and a person not suffering from the disease) It can be applied to the generation of a discrimination model for discrimination from a biological sample). In addition, the profile data used for generating training data and the peak list of unknown samples to be discriminated in the mass spectrometric data discrimination system and the mass spectrometric data processing method according to the present embodiment are those obtained by analysis by MALDI-MS. Not limited to this, it may be obtained by a mass spectrometer that ionizes a sample by another laser ionization method, for example, a surface assisted laser desorption / ionization method.

本発明の効果を、2種類の微生物（A群とB群）の識別性能によって検証した。ここで、A群は大腸菌（Escherichia coli）であり、B群はアクロモバクター属の微生物 (Achromobacter. sp）である。 The effect of the present invention was verified by the discrimination performance of two types of microorganisms (group A and group B). Here, group A is Escherichia coli, and group B is a microorganism of the genus Achromobacter (Achromobacter. Sp).

まず、MALDI-MSによってA群のサンプルとB群のサンプルをそれぞれ4回ずつ測定した。なお、このとき、1回の測定毎に、サンプルへのレーザ照射を120回行って120個のプロファイルデータを取得した。そして、実施例として、該プロファイルデータを本発明の方法で処理することによってピークリストを生成し、該ピークリストを用いた判別モデルの生成を行った。また、比較例として、前記プロファイルデータを従来の方法で処理することによってピークリストを生成し、該ピークリストを用いた判別モデルの生成を行った。 First, the sample of group A and the sample of group B were measured four times each by MALDI-MS. At this time, the sample was irradiated with a laser 120 times for each measurement, and 120 profile data were acquired. Then, as an example, a peak list was generated by processing the profile data by the method of the present invention, and a discrimination model using the peak list was generated. Further, as a comparative example, a peak list was generated by processing the profile data by a conventional method, and a discrimination model using the peak list was generated.

具体的には、実施例では、前記判別モデルの生成に際し、1回の測定で得られた120個のプロファイルデータをランダムに4つのグループに分割した。そして、各グループに含まれる30個のプロファイルデータを積算し、得られた積算プロファイルデータに対してノイズ除去処理及びピーク検出処理を行って単一のピークリストを生成した。これにより得られた32個のピークリスト（2群×4測定×4グループ）を学習データとして、A群とB群を判別するための判別モデルを生成した。 Specifically, in the example, when the discrimination model was generated, 120 profile data obtained in one measurement were randomly divided into four groups. Then, 30 profile data included in each group were integrated, and the obtained integrated profile data was subjected to noise removal processing and peak detection processing to generate a single peak list. Using the 32 peak lists (2 groups x 4 measurements x 4 groups) obtained as a result as training data, a discrimination model for discriminating between groups A and B was generated.

一方、比較例では、前記判別モデルの生成に際し、1回の測定で得られた120個のプロファイルデータをすべて積算し、得られた積算プロファイルデータに対してノイズ除去処理及びピーク検出処理を行って単一のピークリストを生成した。これにより得られた8個のピークリスト（2群×4測定）を学習モデルとして、A群とB群を判別するための判別モデルを生成した。 On the other hand, in the comparative example, when the discrimination model is generated, all 120 profile data obtained in one measurement are integrated, and the obtained integrated profile data is subjected to noise removal processing and peak detection processing. Generated a single peak list. Using the eight peak lists (2 groups x 4 measurements) obtained as a result as a learning model, a discrimination model for discriminating between groups A and B was generated.

なお、実施例及び比較例のいずれにおいても、判別モデルの生成には、統計解析ソフトeMSTAT Solution（登録商標）を使用し、機械学習アルゴリズムとしてSVM（サポートベクトルマシン）を使用した（以下、同じ）。 In both the examples and the comparative examples, the statistical analysis software eMSTAT Solution (registered trademark) was used to generate the discrimination model, and SVM (support vector machine) was used as the machine learning algorithm (hereinafter, the same). ..

実施例の判別モデル及び比較例の判別モデルによる判別性能を検証したところ、テストデータを与えた出力結果（データがA群であるかB群であるか）については両手法とも100%正解であったが、クロスバリデーションによる誤差（推定誤差）は、比較例のモデルが13％であるのに対し、実施例のモデルでは0%であった。なお、前記クロスバリデーションにはleave-one-out法を使用した（後述の実施例２，３において同じ）。すなわち、各群の学習データから各々１データをテスト用データとして抜き出し、残ったデータで機械学習を行った。これを全データが１回ずつテストデータとなるまで繰り返し、それらの結果を平均して推定誤差を求めた。これにより、本発明によれば測定回数を増やすことなく従来よりも高精度な判別モデルを得られることが確かめられた。 When the discrimination performance by the discrimination model of the example and the discrimination model of the comparative example was verified, the output result (whether the data is group A or group B) given the test data was 100% correct in both methods. However, the error (estimation error) due to cross-validation was 13% in the model of the comparative example, while it was 0% in the model of the example. The leave-one-out method was used for the cross-validation (the same applies to Examples 2 and 3 described later). That is, one data was extracted from the training data of each group as test data, and machine learning was performed using the remaining data. This was repeated once for all the data until it became test data, and the results were averaged to obtain the estimation error. As a result, it was confirmed that according to the present invention, a discrimination model with higher accuracy than before can be obtained without increasing the number of measurements.

また、更なる実施例（実施例２）として、前記A群のサンプルとB群のサンプルに対する各4回の測定のうちの1測定分のデータである120個のプロファイルデータを、120個のグループに分割した。そして、各グループに含まれる1個のプロファイルデータに対して、それぞれノイズ除去処理及びピーク検出処理を行ってピークリストを生成した。これにより得られた240個（2群×1測定×120グループ）のピークリストを学習データとして、A群とB群を判別するための判別モデルを生成した。なお、ここで各群について1測定分のみのプロファイルデータを判別モデルに使用したのは、データ数が多くなりすぎて処理に時間が掛かるのを防ぐためである。 Further, as a further example (Example 2), 120 profile data, which is data for one measurement out of each of the four measurements for the sample of group A and the sample of group B, are combined with 120 groups. Divided into. Then, noise removal processing and peak detection processing were performed on one profile data included in each group to generate a peak list. Using the 240 peak lists (2 groups x 1 measurement x 120 groups) obtained as a result as training data, a discrimination model for discriminating between groups A and B was generated. The reason why the profile data for only one measurement for each group is used for the discrimination model here is to prevent the number of data from becoming too large and the processing from taking a long time.

また、更なる実施例（実施例３）として、前記A群のサンプルとB群のサンプルに対する各4回の測定について、各回の測定で得られた120個のプロファイルデータをランダムに2つのグループに分割した。そして、各グループに含まれる60個のプロファイルデータを積算し、得られた積算プロファイルデータにノイズ除去処理及びピーク検出処理を行ったピークリストを生成した。これにより得られた16個のピークリスト（2群×4測定×2グループ）を学習データとして、A群とB群を判別するための判別モデルを生成した。 In addition, as a further example (Example 3), for each of the four measurements of the sample of group A and the sample of group B, 120 profile data obtained in each measurement are randomly divided into two groups. Divided. Then, 60 profile data included in each group were integrated, and a peak list was generated in which the obtained integrated profile data was subjected to noise removal processing and peak detection processing. Using the 16 peak lists (2 groups x 4 measurements x 2 groups) obtained as a result as training data, a discrimination model for discriminating between groups A and B was generated.

前記実施例２及び実施例３で得られた判別モデルの判別性能を検証したところ、いずれにおいても、推定誤差0%のモデルを生成できること及びテストデータを100%正解できることが確認された。 When the discrimination performance of the discrimination models obtained in Examples 2 and 3 was verified, it was confirmed that a model with an estimation error of 0% could be generated and that the test data could be 100% correct.

[種々の態様]
上述した例示的な実施形態は、以下の態様の具体例であることが当業者により理解される。 [Various aspects]
It will be understood by those skilled in the art that the above-described exemplary embodiments are specific examples of the following embodiments.

（第１項）一態様に係る質量分析データ処理方法は、
レーザイオン化による試料のイオン化を行う質量分析装置において既知試料に対する複数回のレーザ光照射を行い、該複数回のレーザ光照射の各々において前記既知試料から発生するイオンのm/zと強度との関係を示すスペクトルである複数のプロファイルデータを取得し、
前記複数のプロファイルデータを、各グループに一つ以上のプロファイルデータが含まれるように複数のグループに振り分け、
前記複数のグループの各々について、該グループに含まれる前記一つ以上のプロファイルデータに基づいて前記既知試料に由来するピークのm/zと該ピークの強度とを記載したピークリストを生成し、
前記ピークリスト及び前記既知試料の種類に関する情報を学習データとして、未知試料を判別するための判別モデルを生成するものである。 (Clause 1) The mass spectrometric data processing method according to one aspect is
A mass spectrometer that ionizes a sample by laser ionization irradiates a known sample with laser light multiple times, and the relationship between the m / z of ions generated from the known sample and the intensity in each of the multiple laser light irradiations. Acquire multiple profile data, which is a spectrum showing
The plurality of profile data are divided into a plurality of groups so that one or more profile data is included in each group.
For each of the plurality of groups, a peak list describing the m / z of the peak derived from the known sample and the intensity of the peak was generated based on the one or more profile data included in the group.
A discrimination model for discriminating an unknown sample is generated by using the peak list and information on the type of the known sample as learning data.

（第２項）第１項に記載の質量分析データ処理方法は、
前記複数のプロファイルデータを、前記複数のグループにランダムに振り分けるものであってもよい。 (Section 2) The mass spectrometric data processing method described in paragraph 1 is
The plurality of profile data may be randomly distributed to the plurality of groups.

（第３項）第１項又は第２項に記載の質量分析データ処理方法は、
前記複数のプロファイルデータを前記複数のグループに振り分ける際に、前記複数のプロファイルデータのうちの少なくとも一つを、前記複数のグループのうちの二つ以上に重複して振り分けるものであってもよい。 (Section 3) The mass spectrometric data processing method according to the first or second paragraph
When distributing the plurality of profile data to the plurality of groups, at least one of the plurality of profile data may be duplicated and distributed to two or more of the plurality of groups.

（第４項）第１項〜第３項のいずれかに記載の質量分析データ処理方法は、
更に、未知試料を質量分析して得られたプロファイルデータに基づいて生成されたピークリストを、前記判別モデルに適用することによって前記未知試料の判別を行うものであってもよい。 (Item 4) The mass spectrometric data processing method according to any one of items 1 to 3 is
Further, the unknown sample may be discriminated by applying the peak list generated based on the profile data obtained by mass spectrometry of the unknown sample to the discriminant model.

（第５項）一態様に係る質量分析データ処理システムは、
レーザイオン化による試料のイオン化を行う質量分析装置において既知試料に対する複数回のレーザ光照射を行って取得された、該複数回のレーザ光照射の各々において前記既知試料から発生するイオンのm/zと強度との関係を示すスペクトルである複数のプロファイルデータを取得するプロファイルデータ取得部と、
前記複数のプロファイルデータを、各グループに一つ以上のプロファイルデータが含まれるように複数のグループに振り分けるグループ化部と、
前記複数のグループの各々について、該グループに含まれる前記一つ以上のプロファイルデータに基づいて前記既知試料に由来するピークのm/zと該ピークの強度とを記載したピークリストを生成するピークリスト生成部と、
前記ピークリスト及び前記既知試料の種類に関する情報を学習データとして、未知試料を判別するための判別モデルを生成する判別モデル生成部と、
を備えるものである。 (Section 5) The mass spectrometry data processing system according to one aspect is
The m / z of ions generated from the known sample in each of the multiple laser beam irradiations obtained by performing multiple laser beam irradiations on the known sample in a mass spectrometer that ionizes the sample by laser ionization. A profile data acquisition unit that acquires a plurality of profile data, which is a spectrum showing a relationship with intensity,
A grouping unit that distributes the plurality of profile data into a plurality of groups so that each group includes one or more profile data.
For each of the plurality of groups, a peak list that generates a peak list describing the m / z of the peak derived from the known sample and the intensity of the peak based on the one or more profile data included in the group. With the generator
A discriminant model generator that generates a discriminant model for discriminating an unknown sample by using the peak list and information on the type of the known sample as learning data.
Is provided.

（第６項）第５項に記載の質量分析データ処理システムは、
前記グループ化部が、前記複数のプロファイルデータを、前記複数のグループにランダムに振り分けるものであってもよい。 (Section 6) The mass spectrometry data processing system according to paragraph 5 is
The grouping unit may randomly distribute the plurality of profile data to the plurality of groups.

（第７項）第５項又は第６項に記載の質量分析データ処理システムは、
前記グループ化部が、前記複数のプロファイルデータのうちの少なくとも一つを、前記複数のグループのうちの二つ以上に重複して振り分けるものであってもよい。 (Section 7) The mass spectrometric data processing system according to paragraph 5 or 6 is
The grouping unit may duplicately distribute at least one of the plurality of profile data to two or more of the plurality of groups.

（第８項）第５項〜第７項のいずれかに記載の質量分析データ処理システムは、
未知試料を質量分析して得られたプロファイルデータに基づいて生成されたピークリストを、前記判別モデルに適用することによって前記未知試料の判別を行う判別部、
を更に備えるものであってもよい。 (Item 8) The mass spectrometry data processing system according to any one of items 5 to 7 is
A discrimination unit that discriminates the unknown sample by applying the peak list generated based on the profile data obtained by mass spectrometry of the unknown sample to the discrimination model.
May be further provided.

（第９項）一態様に係る質量分析データ処理プログラムは、コンピュータを、第５項〜第８項のいずれかに記載の質量分析データ処理システムの各部として機能させるものである。 (Section 9) The mass spectrometry data processing program according to one aspect causes a computer to function as each part of the mass spectrometry data processing system according to any one of paragraphs 5 to 8.

第１項に記載の質量分析データ処理方法、第５項に記載の質量分析データ処理システム、又は第９項に記載の質量分析データ処理プログラムによれば、高精度な判別モデルを構築するために必要な多量の学習データを、少ない質量分析回数で得ることが可能となる。 According to the mass spectrometric data processing method according to the first paragraph, the mass spectrometric data processing system according to the fifth paragraph, or the mass spectrometric data processing program according to the ninth paragraph, in order to construct a highly accurate discrimination model. It is possible to obtain a large amount of necessary training data with a small number of mass spectrometrys.

また、第２項に記載の質量分析データ処理方法又は第６項に記載の質量分析データ処理システムによれば、試料上の測定領域内における試料成分の濃淡の影響を受けることなく適切な学習データを生成することができる。 Further, according to the mass spectrometry data processing method according to the second item or the mass spectrometry data processing system according to the sixth item, appropriate learning data is obtained without being affected by the shading of the sample component in the measurement region on the sample. Can be generated.

また、第３項に記載の質量分析データ処理方法又は第７項に記載の質量分析データ処理システムによれば、プロファイルデータの数が少ない場合や、グループの数が多い場合でも、各グループに割り振られるプロファイルデータの数を多くすることができるため、Ｓ／Ｎの低下を防ぐことができる。 Further, according to the mass spectrometry data processing method described in the third item or the mass spectrometry data processing system described in the seventh item, even if the number of profile data is small or the number of groups is large, the data is allocated to each group. Since the number of profile data to be generated can be increased, it is possible to prevent a decrease in S / N.

１０…質量分析データ処理システム
２０…学習データ生成部
２１…プロファイルデータ取得部
２２…グループ化部
２３…ピークリスト生成部
３０…判別モデル生成部
４０…判別部
４１…未知サンプルデータ取得部
４２…判別実行部
５０…データ記憶部
６０…入力部
７０…表示部 10 ... Mass analysis data processing system 20 ... Learning data generation unit 21 ... Profile data acquisition unit 22 ... Grouping unit 23 ... Peak list generation unit 30 ... Discrimination model generation unit 40 ... Discrimination unit 41 ... Unknown sample data acquisition unit 42 ... Discrimination Execution unit 50 ... Data storage unit 60 ... Input unit 70 ... Display unit

Claims

A mass spectrometer that ionizes a sample by laser ionization irradiates a known sample with laser light multiple times, and the relationship between the m / z of ions generated from the known sample and the intensity in each of the multiple laser light irradiations. Acquire multiple profile data, which is a spectrum showing
The plurality of profile data are divided into a plurality of groups so that one or more profile data is included in each group.
For each of the plurality of groups, a peak list describing the m / z of the peak derived from the known sample and the intensity of the peak was generated based on the one or more profile data included in the group.
A mass spectrometric data processing method for generating a discrimination model for discriminating an unknown sample by using the peak list and information on the type of the known sample as learning data.

The mass spectrometric data processing method according to claim 1, wherein the plurality of profile data are randomly distributed into the plurality of groups.

The invention according to claim 1 or 2, wherein at least one of the plurality of profile data is duplicated and distributed to two or more of the plurality of groups when the plurality of profile data is distributed to the plurality of groups. Mass spectrometric data processing method.

Further, according to any one of claims 1 to 3, the peak list generated based on the profile data obtained by mass spectrometry of the unknown sample is applied to the discrimination model to discriminate the unknown sample. Mass spectrometric data processing method.

The m / z of ions generated from the known sample in each of the multiple laser beam irradiations obtained by performing multiple laser beam irradiations on the known sample in a mass spectrometer that ionizes the sample by laser ionization. A profile data acquisition unit that acquires a plurality of profile data, which is a spectrum showing a relationship with intensity,
A grouping unit that distributes the plurality of profile data into a plurality of groups so that each group includes one or more profile data.
For each of the plurality of groups, a peak list that generates a peak list describing the m / z of the peak derived from the known sample and the intensity of the peak based on the one or more profile data included in the group. With the generator
A discriminant model generator that generates a discriminant model for discriminating an unknown sample by using the peak list and information on the type of the known sample as learning data.
Mass spectrometric data processing system.

The mass spectrometric data processing system according to claim 5, wherein the grouping unit randomly distributes the plurality of profile data to the plurality of groups.

The mass spectrometric data processing system according to claim 5 or 6, wherein the grouping unit duplicates at least one of the plurality of profile data into two or more of the plurality of groups.

A discrimination unit that discriminates the unknown sample by applying the peak list generated based on the profile data obtained by mass spectrometry of the unknown sample to the discrimination model.
The mass spectrometric data processing system according to any one of claims 5 to 7.

A mass spectrometric data processing program that causes a computer to function as each part of the mass spectrometric data processing system according to any one of claims 5 to 8.