JPWO2020044435A1

JPWO2020044435A1 - Data analysis method, data analysis device, and learning model creation method for data analysis

Info

Publication number: JPWO2020044435A1
Application number: JP2020539899A
Authority: JP
Inventors: 藤田　雄一郎; 雄一郎藤田; 陽野田
Original assignee: Shimadzu Corp
Current assignee: Shimadzu Corp
Priority date: 2018-08-28
Filing date: 2018-08-28
Publication date: 2021-08-10
Anticipated expiration: 2038-08-28
Also published as: JP7255597B2; WO2020044435A1; US20210319364A1

Abstract

解析対象のデータを解析パラメータを用いて解析プログラム３３で解析する方法であって、複数の学習用パラメータセットを作成するステップＳ２と、複数の参照データについて前記複数の学習用パラメータセットを用いて解析プログラムによる解析を実行して解析に適した学習用パラメータセットを決定するステップＳ４と、前記複数の学習用パラメータセットのそれぞれに解析に適しているとされた参照データのグループである参照データ群を対応付けるステップＳ７と、未解析データを入力するステップＳ９と、該未解析データと各参照測定データ群の共通性と各参照データ群に対応付けられた学習用パラメータセットから1乃至複数の解析パラメータのそれぞれの値を求めることにより実解析用パラメータセットを決定するステップＳ１１と、実解析用パラメータセットを用いて解析プログラムにより未解析データの解析を実行するステップＳ１２とを含む。This is a method of analyzing the data to be analyzed by the analysis program 33 using the analysis parameters, in step S2 for creating a plurality of learning parameter sets, and analyzing the plurality of reference data using the plurality of learning parameter sets. Step S4, which executes a programmatic analysis to determine a learning parameter set suitable for analysis, and a reference data group, which is a group of reference data considered to be suitable for analysis, are provided for each of the plurality of learning parameter sets. From step S7 to be associated, step S9 to input unanalyzed data, commonality between the unanalyzed data and each reference measurement data group, and a training parameter set associated with each reference data group, one or more analysis parameters It includes a step S11 of determining an actual analysis parameter set by obtaining each value, and a step S12 of executing an analysis of unanalyzed data by an analysis program using the actual analysis parameter set.

Description

本発明は、分析装置を用いて試料を測定することにより得られた測定データをはじめとする各種のデータを解析プログラムにより解析する際に用いられる技術に関する。 The present invention relates to a technique used when analyzing various data including measurement data obtained by measuring a sample using an analyzer by an analysis program.

試料に含まれる目的化合物を同定したり定量したりするために、クロマトグラフと質量分析装置を組み合わせたクロマトグラフ質量分析装置が広く用いられている。クロマトグラフ質量分析装置ではクロマトグラフのカラムに試料を導入し、保持時間（Retention Time, RT）の違いにより試料に含まれる複数の物質を分離し質量分析装置（Mass Spectrometry, MS）に導入する。クロマトグラフで分離された物質が質量分析装置に導入される時間間隔は、質量分析装置におけるスキャン測定の速度（質量走査速度）に応じて決められる。質量分析装置に導入された物質はイオン化された後、質量電荷比（m/z）に応じて分離され検出される。これにより、保持時間（RT）と質量電荷比（m/z）の二軸に対してイオンの検出強度をプロットした三次元データが得られる。この三次元データにおいて、各質量電荷比におけるイオンの検出強度（信号強度）は、その質量電荷比を有するイオンを生成する物質の、試料中の含有量を反映している。 A chromatograph mass spectrometer that combines a chromatograph and a mass spectrometer is widely used for identifying and quantifying a target compound contained in a sample. In a chromatographic mass spectrometer, a sample is introduced into a chromatograph column, and multiple substances contained in the sample are separated according to the difference in retention time (RT) and introduced into a mass spectrometer (Mass Spectrometry, MS). The time interval at which the substances separated by the chromatograph are introduced into the mass spectrometer is determined according to the speed of scan measurement (mass scanning speed) in the mass spectrometer. The substance introduced into the mass spectrometer is ionized and then separated and detected according to the mass-to-charge ratio (m / z). As a result, three-dimensional data obtained by plotting the detection intensity of ions with respect to the two axes of retention time (RT) and mass-to-charge ratio (m / z) can be obtained. In this three-dimensional data, the detection intensity (signal intensity) of an ion at each mass-to-charge ratio reflects the content of a substance that produces an ion having that mass-to-charge ratio in the sample.

この三次元データの保持時間（RT）軸の各点において質量電荷比（m/z）軸の方向の信号強度を積算することによりトータルイオンカレント（Total Ion Current, TIC）が得られる。そして、トータルイオンカレントを保持時間軸に沿ってプロットすることによりトータルイオンカレントクロマトグラム（Total Ion Current chromatogram, TICC）が得られる。 The total ion current (TIC) can be obtained by integrating the signal strength in the direction of the mass-to-charge ratio (m / z) axis at each point on the retention time (RT) axis of the three-dimensional data. Then, the total ion current chromatogram (TICC) can be obtained by plotting the total ion current along the retention time axis.

試料に含まれる各物質がクロマトグラフのカラムで互いに十分に分離されていれば、TICCの波形（TICC波形）には、その物質の保持時間の位置に単峰性の釣鐘型のピークが現れる。その保持時間におけるマススペクトルから物質を同定することで、その保持時間に溶出した物質が何であるかを特定することができる。物質の同定は、同定対象のマススペクトルと、データベース（Data Base, DB）に保存されている既知物質の実測マススペクトルもしくは理論マススペクトルとを比較することにより行われる。その比較項目は、マスピークが存在する質量電荷比（m/z）値、そのマスピークの強度などである。マススペクトルの一致度（スコア）により、物質の同定結果がどの程度の信頼性を有するかを定量的に評価することができる。また、TICC波形上のピークの面積や高さから、クロマトグラフにより分離された各試料の量を推定することができる。 If each substance contained in the sample is sufficiently separated from each other by the column of the chromatograph, a monomodal bell-shaped peak appears at the position of the retention time of the substance in the TICC waveform (TICC waveform). By identifying the substance from the mass spectrum at that retention time, it is possible to identify what the substance was eluted during that retention time. A substance is identified by comparing the mass spectrum to be identified with the measured mass spectrum or the theoretical mass spectrum of a known substance stored in a database (Data Base, DB). The comparison items are the mass-to-charge ratio (m / z) value in which the mass peak exists, the intensity of the mass peak, and the like. The degree of concordance (score) of the mass spectrum can be used to quantitatively evaluate how reliable the identification result of the substance is. In addition, the amount of each sample separated by the chromatograph can be estimated from the area and height of the peak on the TICC waveform.

しかし、保持時間が同じあるいは近い複数の物質が試料に含まれていると、それらの物質の保持時間やその前後の時間にクロマトグラフから溶出する溶出物に複数の物質が混在してしまう。すると、当該保持時間やその前後の時間におけるマススペクトルには複数の物質に由来するマスピークが混在し、それらのマスピークを積算することにより得られるTICC波形のピークも複数の物質に由来するピークが重畳したものになる。通常は、TICC波形に現れる単峰性のピークのピークトップの保持時間に何らかの物質が溶出したと考えるが、重畳ピークになっていると、ピークの形がいびつであったり、大きい単峰性ピークに小さい単峰性ピークが埋もれてしまっていたり、あるいはピークが多峰性になってしまったりする。こうした場合、クロマトグラフからの溶出物に単一の物質しか含まれていない場合に得られるはずの単峰性のピークのピークトップにあたる保持時間を正しく求めることができない。また、測定データにノイズが含まれていたり、信号強度にベースライン成分が含まれていたりすると、状況はより複雑になり、試料に少量しか含まれていない物質に由来する小さなTICCピークの保持時間を求めることがより困難になる。 However, if the sample contains a plurality of substances having the same or similar retention times, the plurality of substances will be mixed in the eluate eluted from the chromatograph during the retention time of these substances or the time before and after that. Then, mass peaks derived from a plurality of substances are mixed in the mass spectrum at the holding time and the time before and after the holding time, and peaks derived from a plurality of substances are superimposed on the peaks of the TICC waveform obtained by integrating those mass peaks. It will be the one that was done. Normally, it is considered that some substance was eluted during the retention time of the peak top of the monomodal peak appearing in the TICC waveform, but when it is a superposed peak, the shape of the peak is distorted or a large monomodal peak. Small monomodal peaks are buried in the area, or the peaks are multimodal. In such a case, the retention time at the peak top of the monomodal peak that should be obtained when the eluate from the chromatograph contains only a single substance cannot be correctly determined. Also, if the measurement data contains noise or the signal strength contains baseline components, the situation becomes more complicated and the retention time for small TICC peaks from substances that are only present in small amounts in the sample. Becomes more difficult to find.

そこで、信号処理や統計処理等によって重畳ピークを分離し、一つのマススペクトルに単一の物質に由来するマスピーク群しか含まれないようにTICCピークを純化するピーク分離（Peak Deconvolution）を行う。こうしてTICCピークを純化すると、測定データのTICC波形にどのようなTICCピークが重畳していたのかを推定することができる。多くの場合、ピーク分離を実行するために専用の解析プログラムが用いられる。ガスクロマトグアフィー／質量分析（GC/MS）で得られた測定データ（GCMSデータ）のピークを純化するために用いられる代表的な解析プログラムとして、アメリカ国立標準技術研究所（National Institute of Standards and Technology, NIST）から提供されているAMDIS（Automated Mass Spectral Deconvolution and Identification System）が知られている（非特許文献１参照）。AMDISでは、ピークを純化するために6つの解析パラメータ（ピーク幅、除外質量電荷比、近傍ピークの数、ピーク間隔、ピーク検出感度、及びモデル適合度）が用いられる。これらの解析パラメータにはそれぞれ初期値が用意されており、多くの場合、解析にはその初期値がそのまま用いられる。 Therefore, superposed peaks are separated by signal processing, statistical processing, etc., and peak separation (Peak Deconvolution) is performed to purify the TICC peaks so that one mass spectrum contains only a group of mass peaks derived from a single substance. By purifying the TICC peak in this way, it is possible to estimate what kind of TICC peak was superimposed on the TICC waveform of the measurement data. In many cases, a dedicated analysis program is used to perform peak separation. The National Institute of Standards and Technology is a typical analysis program used to purify the peaks of measurement data (GCMS data) obtained by gas chromatography-mass spectrometry (GC / MS). AMDIS (Automated Mass Spectral Deconvolution and Identification System) provided by NIST) is known (see Non-Patent Document 1). AMDIS uses six analytical parameters (peak width, excluded mass-to-charge ratio, number of neighboring peaks, peak spacing, peak detection sensitivity, and model goodness of fit) to purify the peaks. Initial values are prepared for each of these analysis parameters, and in many cases, the initial values are used as they are for analysis.

特開2007-41234号公報Japanese Patent Application Laid-Open No. 2007-41234

"Automated Mass Spectral Deconvolution & Identification System",[online],The National Institute of Standards and Technology (NIST) ,[平成30年6月22日検索],インターネット<URL:https://chemdata.nist.gov/mass-spc/amdis/explanation.html>"Automated Mass Spectral Deconvolution & Identification System", [online], The National Institute of Standards and Technology (NIST), [Search June 22, 2018], Internet <URL: https://chemdata.nist.gov/ mass-spc / amdis / explanation.html> "Mass++",[online],株式会社島津製作所,[平成30年6月22日検索],インターネットURL:https://www.shimadzu.co.jp/aboutus/ms_r/masspp.html"Mass ++", [online], Shimadzu Corporation, [Search June 22, 2018], Internet URL: https://www.shimadzu.co.jp/aboutus/ms_r/masspp.html 岡谷貴之著「深層学習（機械学習プロフェッショナルシリーズ）」講談社、2015年4月Takayuki Okatani "Deep Learning (Machine Learning Professional Series)" Kodansha, April 2015

AMDISにおける解析パラメータの初期値は様々なGCMSデータに対して汎用的に用いることを想定して設定された値であり、GCMSデータの状態（重畳ピークの形状、質量走査速度、ノイズの状態等）によっては、必ずしも適切なものであるとは限らない。すなわち、解析対象のGCMSデータのピーク分離に初期値をそのまま用いても十分にピークを分離することができない場合がある。こうした場合には、使用者が各解析パラメータの値を初期値から変更してピークを分離し、解析者が妥当であると考える結果が得られるまで、例えば十分な信頼性でもって物質が同定される、つまり十分に高いスコアが得られるまでパラメータ調整を行う。この際に、解析者は自らが培ってきた勘や経験でもってパラメータ調整を行うため、解析の結果が使用者の能力や感覚に依存したものとなる、つまり解析者の熟練度によって得られる結果にばらつきが生じる、という問題があった。また、パラメータ調整を繰り返す必要があることから解析作業に手間と時間がかかるという問題があった。 The initial values of the analysis parameters in AMDIS are values set assuming general use for various GCMS data, and the state of the GCMS data (shape of superimposed peak, mass scanning speed, noise state, etc.). Some are not always appropriate. That is, even if the initial value is used as it is for the peak separation of the GCMS data to be analyzed, it may not be possible to sufficiently separate the peaks. In such cases, the substance is identified with sufficient reliability, for example, until the user changes the value of each analysis parameter from the initial value to separate the peaks and obtains a result that the analyst considers appropriate. In other words, adjust the parameters until a sufficiently high score is obtained. At this time, since the analyst adjusts the parameters based on the intuition and experience he has cultivated, the analysis result depends on the ability and feeling of the user, that is, the result obtained by the skill level of the analyst. There was a problem that the variation occurred. In addition, there is a problem that the analysis work takes time and effort because it is necessary to repeat the parameter adjustment.

ここでは従来技術の一例として、GCMSデータをAMDISにより解析する場合を説明したが、他の分析装置を用いた試料の測定により得られた測定データ等の様々なデータを、何らかの解析パラメータを用いて解析する際に上記同様の問題があった。 Here, as an example of the prior art, the case of analyzing GCMS data by AMDIS has been described, but various data such as measurement data obtained by measuring a sample using another analyzer can be used with some analysis parameters. There was a similar problem as described above when analyzing.

本発明が解決しようとする課題は、分析装置を用いて試料を測定することにより得られた測定データなどの各種のデータを解析パラメータを用いて解析する際に、簡便に適切な解析結果を得ることができる技術を提供することである。 The problem to be solved by the present invention is to easily obtain appropriate analysis results when analyzing various data such as measurement data obtained by measuring a sample using an analyzer using analysis parameters. It is to provide the technology that can be done.

上記課題を解決するために成された本発明の第１の態様は、解析対象のデータを、1乃至複数の解析パラメータにそれぞれ値を設定して所定の解析プログラムにより解析する方法であって、
前記1乃至複数の解析パラメータのうちの少なくとも1つの値が互いに異なる複数の学習用パラメータセットを作成する学習用パラメータセット作成ステップと、
複数の参照データのそれぞれについて、前記複数の学習用パラメータセットのそれぞれを用いて前記解析プログラムによる解析を実行し、所定の基準により解析に適した学習用パラメータセットを決定する学習用パラメータセット決定ステップと、
前記複数の学習用パラメータセットのそれぞれに、前記学習用パラメータセット決定ステップにおいて当該学習用パラメータセットが解析に適しているとされた参照データのグループである参照データ群を対応付ける参照データ群作成ステップと、
未解析データを入力する未解析データ入力ステップと、
所定の基準により前記未解析データと各参照データ群の共通性を求め、該共通性に基づいて各参照データ群に対応付けられた学習用パラメータセットから前記1乃至複数の解析パラメータのそれぞれについて該未解析データの解析に適した値を求めることにより実解析用パラメータセットを決定する実解析用パラメータセット決定ステップと、
前記実解析用パラメータセットを用いて前記解析プログラムにより前記未解析データの解析を実行する実解析ステップと
を含むことを特徴とする。The first aspect of the present invention, which has been made to solve the above problems, is a method of analyzing data to be analyzed by a predetermined analysis program by setting values for one or a plurality of analysis parameters.
A learning parameter set creation step for creating a plurality of learning parameter sets in which at least one value of at least one of the one or a plurality of analysis parameters is different from each other.
For each of the plurality of reference data, the learning parameter set determination step of executing the analysis by the analysis program using each of the plurality of learning parameter sets and determining the learning parameter set suitable for the analysis according to a predetermined criterion. When,
A reference data group creation step in which a reference data group, which is a group of reference data for which the learning parameter set is determined to be suitable for analysis in the learning parameter set determination step, is associated with each of the plurality of learning parameter sets. ,
Unanalyzed data input step to input unanalyzed data and
The commonality between the unanalyzed data and each reference data group is obtained according to a predetermined standard, and the one or more analysis parameters are obtained from the learning parameter set associated with each reference data group based on the commonality. The actual analysis parameter set determination step, which determines the actual analysis parameter set by obtaining a value suitable for the analysis of unanalyzed data, and the actual analysis parameter set determination step.
It is characterized by including an actual analysis step of executing the analysis of the unanalyzed data by the analysis program using the parameter set for real analysis.

本発明に係るデータの解析方法では、まず、データの解析に使用する1乃至複数の解析パラメータのうちの少なくとも1つの値が互いに異なる複数の学習用パラメータセットを作成する。そして、複数の参照データのそれぞれについて、前記複数の学習用パラメータセットのそれぞれを用いて解析プログラムによる解析を実行し、所定の基準により解析に適した学習用パラメータセットを決定する。これは、例えば、複数の参照データのそれぞれについて、作成した複数の学習用パラメータセットのそれぞれを用いて解析プログラムによる解析を実行することにより得られた解析結果に対する妥当性を表す評価値を求め、その評価値が最も高いものを最適な学習用パラメータセットとすることにより行うことができる。あるいは、評価値が予め決められた基準値以上であるものを解析に適した学習用パラメータセットとしてもよい。前者の場合は参照データのそれぞれについて、その解析に適した1つの学習用パラメータセットが決定され、後者の場合は1乃至複数の学習用パラメータセットが決定される。 In the data analysis method according to the present invention, first, a plurality of learning parameter sets in which at least one value of one or a plurality of analysis parameters used for data analysis is different from each other are created. Then, each of the plurality of reference data is analyzed by an analysis program using each of the plurality of learning parameter sets, and a learning parameter set suitable for the analysis is determined according to a predetermined standard. For example, for each of the plurality of reference data, an evaluation value indicating the validity of the analysis result obtained by executing the analysis by the analysis program using each of the created plurality of training parameter sets is obtained. This can be done by setting the one with the highest evaluation value as the optimum learning parameter set. Alternatively, a learning parameter set whose evaluation value is equal to or higher than a predetermined reference value may be used as a learning parameter set suitable for analysis. In the former case, one learning parameter set suitable for the analysis is determined for each of the reference data, and in the latter case, one or more learning parameter sets are determined.

続いて、複数の学習用パラメータセットのそれぞれについて、学習用パラメータセット決定ステップにおいて当該学習用パラメータセットが解析に適しているとされた参照データのグループである参照データ群を作成する。これにより解析に適した学習用パラメータセットが共通する参照データがグループ化され、未解析データの解析に用いるパラメータセットを決める元となる情報が得られる。先の学習用パラメータセット決定ステップにおいて、1つの参照データについて解析に適したパラメータセットを複数決定した場合、その参照データが複数の参照データ群に含まれるようにしてもよい。 Subsequently, for each of the plurality of learning parameter sets, a reference data group which is a group of reference data for which the learning parameter set is considered to be suitable for analysis is created in the learning parameter set determination step. As a result, the reference data having a common learning parameter set suitable for analysis is grouped, and the information that is the basis for determining the parameter set used for the analysis of the unanalyzed data can be obtained. When a plurality of parameter sets suitable for analysis are determined for one reference data in the above training parameter set determination step, the reference data may be included in the plurality of reference data groups.

次に、未解析データを入力する。そして、所定の基準により未解析データと参照データ群の共通性に基づいて、前記1乃至複数の解析パラメータのそれぞれについて未解析データの解析に適した値を求めることにより実解析用パラメータセットを決定する。この所定の基準は解析対象データの種類によって異なるが、例えば解析対象データがGCMSデータのTICC波形である場合、未解析データのピークに近い形状のピークを有する参照データから構成される参照データ群を、共通性が高い参照データ群とすることができる。 Next, the unanalyzed data is input. Then, based on the commonality between the unanalyzed data and the reference data group according to a predetermined standard, the parameter set for real analysis is determined by obtaining the values suitable for the analysis of the unanalyzed data for each of the one or more analysis parameters. do. This predetermined criterion differs depending on the type of data to be analyzed. For example, when the data to be analyzed is a TICC waveform of GCMS data, a reference data group composed of reference data having a peak having a shape close to the peak of unanalyzed data is used. , Can be a reference data group with high commonality.

実解析用パラメータセット決定ステップは、例えば、未解析データと最も高い共通性を持つ参照データ群を決定し、該参照データ群に対応する学習用パラメータセットをそのまま実解析用パラメータセットとすることにより行うことができる。このように、複数の学習用パラメータセットに付されたパラメータセット番号の中の1つを予測する場合は、最も共通性が高い参照データ群に対応付けられているパラメータセット番号を選択する。これは前記1乃至複数のパラメータの組を1つの「パラメータセット」として扱い、どのパラメータセットで解析すべきかを考える場合である。これは機械学習の用語でいえば「識別」のアプローチである。 In the actual analysis parameter set determination step, for example, the reference data group having the highest commonality with the unanalyzed data is determined, and the learning parameter set corresponding to the reference data group is used as it is as the actual analysis parameter set. It can be carried out. In this way, when predicting one of the parameter set numbers assigned to a plurality of learning parameter sets, the parameter set number associated with the reference data group having the highest commonality is selected. This is a case where the set of one or more parameters is treated as one "parameter set" and the parameter set to be analyzed should be considered. This is an "identification" approach in machine learning terms.

また、実解析用パラメータセット決定ステップを行う別の方法として、参照データ群（各参照データ群は1つの参照データのみで構成されてもよい）に各解析パラメータの値を直接対応付け、未解析データの解析で用いるべき各解析パラメータの値を推定するというアプローチも考えられる。これは機械学習の用語でいれば「回帰」のアプローチである。回帰の場合、ある1つの解析パラメータについて、学習用パラメータセットにその値が2つ（例えば5と10）しか含まれていない場合であっても、未解析データと各参照データ群（あるいは各参照データ）との共通性（例えばTICC波形の類似性）に基づく回帰分析を行い、前記2つの値のいずれでもない中間的な値（例えば7）を未解析データの解析に最適な解析パラメータの値として求めることができる。こうした回帰分析は、1乃至複数の解析パラメータについて個別に行うこともでき、あるいは1乃至複数の解析パラメータについて一括して（即ちパラメータセット単位で）行うこともできる。最後に、上記の回帰分析により求めた1乃至複数の解析パラメータの値で構成された実解析用パラメータセットを用いて解析プログラムにより未解析データの解析を実行する。 In addition, as another method of performing the parameter set determination step for actual analysis, the value of each analysis parameter is directly associated with the reference data group (each reference data group may be composed of only one reference data) and unanalyzed. An approach of estimating the value of each analysis parameter to be used in data analysis is also conceivable. This is a "regression" approach in machine learning terms. In the case of regression, unanalyzed data and each reference data group (or each reference) for one analysis parameter, even if the training parameter set contains only two values (eg 5 and 10). Regression analysis is performed based on the commonality with the data) (for example, the similarity of the TICC waveform), and an intermediate value (for example, 7) that is neither of the above two values is the optimum analysis parameter value for the analysis of unanalyzed data. Can be obtained as. Such regression analysis can be performed individually for one or more analysis parameters, or collectively (ie, in parameter set units) for one or more analysis parameters. Finally, the analysis of the unanalyzed data is executed by the analysis program using the parameter set for real analysis composed of the values of one or more analysis parameters obtained by the above regression analysis.

このように、本発明に係るデータ解析方法では、未解析データの解析に先立ち、複数の参照データを用いた解析により、解析に適した学習用パラメータセットが共通する1乃至複数の参照データを参照データ群としてグループ化しておく。そして、未解析データと参照データ群の共通性に基づいて、該参照データ群に対応づけられた学習用パラメータセットから前記1乃至複数の解析パラメータのそれぞれについて該未解析データの解析に適した値を求め、それらを実解析用パラメータセットとして決定する。そのため、使用者が自ら解析パラメータの値を設定する必要がなく、また未解析データの解析に適したパラメータセットが一義的に決まるため、簡便に適切な解析結果を得ることができる。また、使用者の熟練度によって得られる結果にばらつきが生じることもない。 As described above, in the data analysis method according to the present invention, prior to the analysis of the unanalyzed data, the analysis using a plurality of reference data refers to one or a plurality of reference data having a common learning parameter set suitable for the analysis. Group as a data group. Then, based on the commonality between the unanalyzed data and the reference data group, values suitable for analysis of the unanalyzed data for each of the one or more analysis parameters from the learning parameter set associated with the reference data group. And determine them as a parameter set for actual analysis. Therefore, it is not necessary for the user to set the value of the analysis parameter by himself / herself, and the parameter set suitable for the analysis of the unanalyzed data is uniquely determined, so that the appropriate analysis result can be easily obtained. In addition, the results obtained do not vary depending on the skill level of the user.

上記課題を解決するために成された本発明の第２の態様は、解析対象のデータを所定の解析プログラムにより解析する際に使用する1乃至複数の解析パラメータの値を決定するために用いられる学習モデルを作成する方法であって、
前記1乃至複数の解析パラメータのうちの少なくとも1つの値が互いに異なる複数の学習用パラメータセットを作成する学習用パラメータセット作成ステップと、
複数の参照データのそれぞれについて、前記複数の学習用パラメータセットのそれぞれを用いて前記解析プログラムによる解析を実行し、所定の基準により解析に適した学習用パラメータセットを決定する学習用パラメータセット決定ステップと、
前記複数の学習用パラメータセットのそれぞれに、前記学習用パラメータセット決定ステップにおいて当該学習用パラメータセットが解析に適しているとされた参照データのグループである参照データ群を対応付ける参照データ群作成ステップと、
前記複数の学習用パラメータセットのそれぞれに前記参照データ群を対応付けたものを学習データとする機械学習により学習モデルを作成する学習モデル作成ステップと
を有することを特徴とする。A second aspect of the present invention made to solve the above problems is used to determine the values of one or more analysis parameters used when analyzing the data to be analyzed by a predetermined analysis program. How to create a learning model
A learning parameter set creation step for creating a plurality of learning parameter sets in which at least one value of at least one of the one or a plurality of analysis parameters is different from each other.
For each of the plurality of reference data, the learning parameter set determination step of executing the analysis by the analysis program using each of the plurality of learning parameter sets and determining the learning parameter set suitable for the analysis according to a predetermined criterion. When,
A reference data group creation step in which a reference data group, which is a group of reference data for which the learning parameter set is determined to be suitable for analysis in the learning parameter set determination step, is associated with each of the plurality of learning parameter sets. ,
It is characterized by having a learning model creation step of creating a learning model by machine learning in which the reference data group is associated with each of the plurality of learning parameter sets as learning data.

本発明の第２の態様であるデータ解析用学習モデルの作成方法では、第１の態様のデータ解析方法と同様の学習用パラメータ作成ステップ、学習用パラメータセット決定ステップ、及び参照データ群作成ステップを行うことにより作成される、複数の学習用パラメータセットのそれぞれに前記参照データ群を対応付けたものを学習データとする機械学習により学習モデルを作成する。近年、機械学習の様々な手法が提案されており（例えば特許文献１）、前記機械学習には、例えばディープラーニング（Deep Learning）、そのディープラーニングの一形態である畳み込みニューラルネットワーク（Convolution Neural Network, CNN）、サポートベクターマシン（Support Vector Machine, SVM）、アダブースト（AdaBoost）を用いることができる。こうして作成した学習モデルは、本発明の第１の態様であるデータ解析方法の実解析用パラメータセット決定ステップにおいて好適に用いることができる。 In the method for creating a learning model for data analysis, which is the second aspect of the present invention, the same learning parameter creation step, training parameter set determination step, and reference data group creation step as in the data analysis method of the first aspect are performed. A learning model is created by machine learning in which the reference data group is associated with each of a plurality of learning parameter sets created by performing the training data. In recent years, various methods of machine learning have been proposed (for example, Patent Document 1), and the machine learning includes, for example, deep learning, and a convolutional neural network (Convolution Neural Network,) which is a form of the deep learning. CNN), Support Vector Machine (SVM), and AdaBoost can be used. The learning model thus created can be suitably used in the parameter set determination step for actual analysis of the data analysis method according to the first aspect of the present invention.

さらに、上記課題を解決するために成された本発明の第３の態様は、解析対象のデータを、1乃至複数の解析パラメータにそれぞれ値を設定して所定の解析プログラムにより解析する装置であって、
前記1乃至複数の解析パラメータのうちの少なくとも1つの値が互いに異なる複数の学習用パラメータセットを作成する学習用パラメータセット作成部と、
複数の参照データのそれぞれについて、前記複数の学習用パラメータセットのそれぞれを用いて前記解析プログラムによる解析を実行し、所定の基準により解析に適した学習用パラメータセットを決定する学習用パラメータセット決定部と、
前記複数の学習用パラメータセットのそれぞれに、前記学習用パラメータセット決定部により当該学習用パラメータセットが解析に適しているとされた参照データのグループである参照データ群を対応付ける参照データ群作成部と、
未解析データを入力する未解析データ入力部と、
所定の基準により前記未解析データと各参照データ群の共通性を求め、該共通性に基づいて各参照データ群に対応付けられた学習用パラメータセットから前記1乃至複数の解析パラメータのそれぞれについて該未解析データの解析に適した値を求めることにより実解析用パラメータセットを決定する実解析用パラメータセット決定部と、
前記実解析用パラメータセットを用いて前記解析プログラムにより前記未解析データの解析を実行する実解析実行部と
を備えることを特徴とする。Further, a third aspect of the present invention made to solve the above problems is an apparatus for analyzing data to be analyzed by a predetermined analysis program by setting values for one or a plurality of analysis parameters. hand,
A learning parameter set creation unit that creates a plurality of learning parameter sets in which at least one value of at least one of the one or a plurality of analysis parameters is different from each other.
A learning parameter set determination unit that executes analysis by the analysis program using each of the plurality of learning parameter sets for each of the plurality of reference data, and determines a learning parameter set suitable for analysis according to a predetermined criterion. When,
A reference data group creation unit that associates each of the plurality of learning parameter sets with a reference data group that is a group of reference data for which the learning parameter set is determined to be suitable for analysis by the learning parameter set determination unit. ,
Unanalyzed data input section for inputting unanalyzed data,
The commonality between the unanalyzed data and each reference data group is obtained according to a predetermined standard, and the one or more analysis parameters are obtained from the learning parameter set associated with each reference data group based on the commonality. A parameter set determination unit for actual analysis that determines a parameter set for actual analysis by obtaining a value suitable for analysis of unanalyzed data, and a parameter set determination unit for actual analysis.
It is characterized by including an actual analysis execution unit that executes analysis of the unanalyzed data by the analysis program using the parameter set for real analysis.

分析装置を用いた試料を測定することにより得られた測定データなどの各種データを解析パラメータを用いて解析する際に、本発明に係るデータ解析方法、データ解析装置、あるいはデータ解析用の学習モデル作成方法を用いることにより、簡便に適切な解析結果を得ることができる。 When analyzing various data such as measurement data obtained by measuring a sample using an analyzer using analysis parameters, the data analysis method, data analysis device, or learning model for data analysis according to the present invention. By using the preparation method, an appropriate analysis result can be easily obtained.

本発明に係るデータ解析装置の一実施例である制御・処理装置をガスクロマトグラフ質量分析装置と組み合わせた分析システムの要部構成図。FIG. 6 is a configuration diagram of a main part of an analysis system in which a control / processing device, which is an embodiment of the data analysis device according to the present invention, is combined with a gas chromatograph mass spectrometer. 本発明に係るデータ解析方法の一実施例に関するフローチャート。The flowchart regarding one Example of the data analysis method which concerns on this invention. ガスクロマトグラフ質量分析装置を用いた試料の測定により得られる三次元データのヒートマップの例(a)、及びトータルイオンカレントクロマトグラムの例(b)。An example of a heat map (a) of three-dimensional data obtained by measuring a sample using a gas chromatograph mass spectrometer, and an example of a total ion current chromatogram (b). AMDISで用いられる解析パラメータの説明。A description of the analysis parameters used in AMDIS. 本実施例において用いられる学習用パラメータセットの一部。A part of the learning parameter set used in this embodiment. 本実施例において複数の学習用パラメータセットを用いた分割参照データの解析結果を表すヒストグラム。A histogram showing the analysis result of the divided reference data using a plurality of training parameter sets in this embodiment. 本実施例において用いた3種類の学習用パラメータセットのそれぞれについて、その学習用パラメータセットが解析に最適であったピークを重ね描きしたもの。For each of the three types of learning parameter sets used in this example, the peaks for which the learning parameter set was optimal for analysis are overlaid. 機械学習により作成した学習モデルの評価に用いたデータの構成。The structure of the data used to evaluate the learning model created by machine learning. 本実施例における機械学習において用いた畳み込みニュートラルネットワークの構造を説明する図。The figure explaining the structure of the convolution neutral network used in the machine learning in this Example. 本実施例において最も正答率が高くなった、畳み込みニュートラルネットワークのハイパーパラメータ及びネットワーク構成。The hyperparameters and network configuration of the convolutional neutral network with the highest percentage of correct answers in this example. 本実施例の学習モデルにより最適なパラメータセットを選択する解析処理を5分割交差検証により評価した際の正答率。Correct answer rate when the analysis process of selecting the optimum parameter set according to the learning model of this example is evaluated by 5-fold cross-validation. 未解析データから分割未解析データを取り出す処理を説明する図。The figure explaining the process of taking out the division unanalyzed data from unanalyzed data. 本発明に係るデータ解析方法及び解析装置の概念図。The conceptual diagram of the data analysis method and analysis apparatus which concerns on this invention. 本発明に係るデータ解析装置の変形例のブロック図。The block diagram of the modification of the data analysis apparatus which concerns on this invention.

本発明に係るデータ解析方法、データ解析装置、及びデータ解析用の学習モデル作成方法の実施例について、以下、図面を参照して説明する。 Examples of the data analysis method, the data analysis device, and the learning model creation method for data analysis according to the present invention will be described below with reference to the drawings.

本実施例における解析対象のデータは、ガスクロマトグラフ質量分析装置を用いた測定により取得された三次元のGCMSデータである。また、本実施例では、解析プログラムとしてAMDISを使用し、GCMSデータから得られるトータルイオンカレントクロマトグラムの波形（TICC波形）のピークを分離することにより純化したマススペクトルを、物質データベース（物質DB）に予め保存されている各種の既知の物質のマススペクトルと照合して、試料に含まれる物質を同定するとともに、その一致度を表す評価値（スコア）を算出する。このスコアは、その値が高いほど物質同定の信頼性が高いことを表す。 The data to be analyzed in this example is three-dimensional GCMS data acquired by measurement using a gas chromatograph mass spectrometer. Further, in this example, AMDIS is used as an analysis program, and the mass spectrum purified by separating the peaks of the waveform (TICC waveform) of the total ion current chromatogram obtained from the GCMS data is obtained from the substance database (substance DB). By collating with the mass spectrum of various known substances stored in advance in the sample, the substances contained in the sample are identified, and the evaluation value (score) indicating the degree of agreement is calculated. This score indicates that the higher the value, the higher the reliability of substance identification.

図１は本実施例のデータ解析装置を含む分析システムの要部構成図、図２は本実施例のデータ解析方法に関するフローチャートである。本実施例の分析システムは、ガスクロマトグラフ質量分析装置１と制御・処理装置３を備えている。 FIG. 1 is a configuration diagram of a main part of an analysis system including a data analysis device of this embodiment, and FIG. 2 is a flowchart of a data analysis method of this embodiment. The analysis system of this embodiment includes a gas chromatograph mass spectrometer 1 and a control / processing device 3.

ガスクロマトグラフ質量分析装置１は、ガスクロマトグラフ１０と質量分析計２０から構成される。ガスクロマトグラフ１０では、オートサンプラ１４に予めセットされた液体試料が順次、インジェクタ１３に送られ、インジェクタ１３から試料気化室１２に注入される。また、試料気化室１２にはヘリウム等のキャリアガスが供給される。試料気化室１２は加熱されており、インジェクタ１３から注入された液体試料は気化してキャリアガスの流れに乗り、カラムオーブン１１に収容されたキャピラリカラム１５に送り込まれる。試料ガスに含まれる各種の化合物はキャピラリカラム１５を通過する間に時間方向に分離され、質量分析計２０に順次導入される。 The gas chromatograph mass spectrometer 1 includes a gas chromatograph 10 and a mass spectrometer 20. In the gas chromatograph 10, the liquid samples preset in the autosampler 14 are sequentially sent to the injector 13 and injected from the injector 13 into the sample vaporization chamber 12. Further, a carrier gas such as helium is supplied to the sample vaporization chamber 12. The sample vaporization chamber 12 is heated, and the liquid sample injected from the injector 13 is vaporized and rides on the flow of carrier gas, and is sent to the capillary column 15 housed in the column oven 11. Various compounds contained in the sample gas are separated in the time direction while passing through the capillary column 15, and are sequentially introduced into the mass spectrometer 20.

質量分析計２０は図示しない真空ポンプにより真空排気される真空チャンバ２３を備えており、その内部にはイオン源２１、レンズ電極２２、四重極マスフィルタ２４、及びイオン検出器２５が配設されている。ガスクロマトグラフ１０から導入された試料ガス中の物質は順次、イオン源２１に導入される。イオン源２１は例えばEI（電子イオン化）源であり、イオン化室２１１に導入された試料ガスにフィラメント２１２で生成された熱電子が照射されることによってイオンが生成される。イオン源２１で生成されたイオンは、レンズ電極２２により収束され四重極マスフィルタ２４で質量電荷比に応じて分離されたあと、イオン検出器２５で検出される。イオン検出器２５からの出力信号は制御・処理装置３が有する記憶部３１に保存される。 The mass spectrometer 20 includes a vacuum chamber 23 that is evacuated by a vacuum pump (not shown), and an ion source 21, a lens electrode 22, a quadrupole mass filter 24, and an ion detector 25 are arranged therein. ing. The substances in the sample gas introduced from the gas chromatograph 10 are sequentially introduced into the ion source 21. The ion source 21 is, for example, an EI (electron ionization) source, and ions are generated by irradiating the sample gas introduced into the ionization chamber 211 with thermions generated by the filament 212. The ions generated by the ion source 21 are converged by the lens electrode 22 and separated by the quadrupole mass filter 24 according to the mass-to-charge ratio, and then detected by the ion detector 25. The output signal from the ion detector 25 is stored in the storage unit 31 included in the control / processing device 3.

制御・処理装置３は、ガスクロマトグラフ質量分析装置１の各部を制御する分析制御部としての機能と、該ガスクロマトグラフ質量分析装置１等を用いた測定により得られたデータを処理する機能を有している。後者が本発明に係るデータ解析装置に対応する。制御・処理装置３は記憶部３１と物質データベース（物質DB）３２を備えており、また所定の解析プログラム（本実施例ではAMDIS）３３が予めインストールされている。物質データベース３２は、解析プログラム３３によるデータの解析において試料に含まれる物質を同定するために用いられるデータベースであり、多数の既知の物質のそれぞれについて物質名、化学式、理論保持時間、マススペクトル等の情報が対応付けられて保存されている。 The control / processing device 3 has a function as an analysis control unit that controls each part of the gas chromatograph mass spectrometer 1 and a function of processing data obtained by measurement using the gas chromatograph mass spectrometer 1 or the like. ing. The latter corresponds to the data analysis apparatus according to the present invention. The control / processing device 3 includes a storage unit 31 and a substance database (substance DB) 32, and a predetermined analysis program (AMDIS in this embodiment) 33 is pre-installed. The substance database 32 is a database used for identifying substances contained in a sample in the analysis of data by the analysis program 33, and for each of a large number of known substances, the substance name, chemical formula, theoretical retention time, mass spectrum, etc. Information is associated and saved.

制御・処理装置３は、さらに、機能ブロックとして、参照データ取得部４１、パラメータセット作成部４２、パラメータセット決定部４３、参照データ分割部４４、学習モデル作成部４５、未解析データ入力受付部４６、未解析データ分割部４７、実解析用パラメータ決定部４８、実解析実行部４９、解析結果出力部５０、学習モデル更新部５１を備えている。制御・処理装置３の実体はコンピュータであり、これらの機能ブロックは、制御・処理装置３に予めインストールされたデータ解析用プログラム４０をプロセッサで実行することにより具現化される。また、制御・処理装置３にはマウスやキーボード等の入力部６と表示部７が接続されている。 The control / processing device 3 further has, as functional blocks, a reference data acquisition unit 41, a parameter set creation unit 42, a parameter set determination unit 43, a reference data division unit 44, a learning model creation unit 45, and an unanalyzed data input reception unit 46. , Unanalyzed data division unit 47, actual analysis parameter determination unit 48, actual analysis execution unit 49, analysis result output unit 50, and learning model update unit 51. The substance of the control / processing device 3 is a computer, and these functional blocks are embodied by executing the data analysis program 40 pre-installed in the control / processing device 3 on the processor. Further, an input unit 6 such as a mouse or a keyboard and a display unit 7 are connected to the control / processing device 3.

次に、本実施例においてGCMSデータを解析する手順を、図２のフローチャートを参照して、実際の解析例とともに説明する。なお、図２のフローチャートにおけるステップＳ１〜ステップＳ８は本発明に係る学習モデルの作成方法の一実施形態の手順である。 Next, the procedure for analyzing GCMS data in this embodiment will be described together with an actual analysis example with reference to the flowchart of FIG. In addition, steps S1 to S8 in the flowchart of FIG. 2 are procedures of one Embodiment of the learning model creation method which concerns on this invention.

使用者が入力部６を通じた操作により参照データの取得を指示すると、参照データ取得部４１は、ガスクロマトグラフ質量分析装置１の各部を動作させ、使用者が予めオートサンプラ１４にセットした試料を順にガスクロマトグラフ質量分析装置１に導入し、各試料を測定する。各試料の測定により得られたGCMSデータは順次、制御・処理装置３の記憶部３１に保存される。ここでは実際に試料を測定することにより参照データを取得する場合を説明したが、使用者による指示に従って、参照データ取得部４１が事前に取得された参照データを記憶部３１から読み出すようにしてもよい。こうして複数の参照データが取得される（ステップＳ１）。 When the user instructs the acquisition of the reference data by the operation through the input unit 6, the reference data acquisition unit 41 operates each unit of the gas chromatograph mass spectrometer 1 and sequentially sets the samples set in the autosampler 14 by the user in order. It is introduced into the gas chromatograph mass spectrometer 1 and each sample is measured. The GCMS data obtained by the measurement of each sample is sequentially stored in the storage unit 31 of the control / processing device 3. Here, the case where the reference data is acquired by actually measuring the sample has been described, but the reference data acquisition unit 41 may read the reference data acquired in advance from the storage unit 31 according to the instruction by the user. good. In this way, a plurality of reference data are acquired (step S1).

本実施例では、504種類の既知の物質のうちの一部又は全部を含む生体試料を32種類作製し、それぞれについてガスクロマトグラフ質量分析装置１を用いて測定を実行することにより32個のGCMSデータを取得した。これら504種類の既知物質については、いずれも保持時間とマススペクトルの情報が物質データベース３２に保存されている。また、測定時間は、試料注入後4〜24分の間とし、この時間内に質量電荷比範囲80〜500において24000scanの測定を行った。 In this example, 32 types of biological samples containing a part or all of 504 types of known substances are prepared, and 32 types of GCMS data are measured by performing measurement using a gas chromatograph mass spectrometer 1 for each type. Was acquired. For each of these 504 types of known substances, retention time and mass spectrum information is stored in the substance database 32. The measurement time was between 4 and 24 minutes after sample injection, and within this time, 24000 scans were measured in the mass-to-charge ratio range of 80 to 500.

図３(a)に、GCMSデータの一例を示す。これは、保持時間（RT）と質量電荷比（m/z）を二軸とするグラフのピーク強度をlog₁₀スケールに変換して、その値を寒色〜暖色の差で表現したもの（ただし図３(a)ではモノクロ表示）である。また、このGCMSデータから作成したTICC波形の一部（40scan分のデータ）を図３(b)に示す。本実施例のように多数の物質を含む試料の場合、保持時間が同じあるいは近い複数の物質が含まれていることが多く、それらの物質の保持時間やその前後の時間にクロマトグラフから溶出する溶出物には複数の物質が混在する。その結果、それらの物質の保持時間やその前後の時間におけるマススペクトルに複数の物質に由来するマスピークが混在することになり、それらのマスピークを積算することにより得られるTICC波形のピークも、図３(b)に示すように複数の物質に由来するピークが重畳したもの（重畳ピーク）となる。FIG. 3A shows an example of GCMS data. This is a graph in which the peak intensity of the graph with the retention time (RT) and mass-to-charge ratio (m / z) as the two axes is _{converted to the log 10} scale, and the value is expressed by the difference between cold colors and warm colors (however, the figure). 3 (a) is a monochrome display). In addition, a part of the TICC waveform (data for 40 scans) created from this GCMS data is shown in FIG. 3 (b). In the case of a sample containing a large number of substances as in this example, a plurality of substances having the same or similar retention times are often contained, and the samples are eluted from the chromatograph at the retention time of these substances or the time before and after the retention time. Multiple substances are mixed in the eluate. As a result, mass peaks derived from a plurality of substances are mixed in the mass spectrum at the retention time of those substances and the time before and after that, and the peak of the TICC waveform obtained by integrating those mass peaks is also shown in FIG. As shown in (b), peaks derived from a plurality of substances are superposed (superimposed peaks).

使用者が入力部６を通じた操作により学習用パラメータセットの作成を指示すると、学習用パラメータセット作成部４２は、予め制御・処理装置３にインストールされている解析プログラム４０を実行し、パラメータを設定する画面を表示部７に表示する。本実施例の解析プログラムはAMDISである。AMDISでは、Component width, Omit m/z, Adjacent peak subtraction, Resolution, Sensitivity, Shape requirementという6つの解析パラメータが用いられる。図４に各パラメータの内容を示す。本実施例では、初期値（Parameter Set Number: 0）を基準としてAdjacent peak subtractionをTwoからOneに下げる、ResolutionをHighからMediumに下げるなどして45種類のパラメータセットを作成した。図５にその一部（初期値及び10種類のパラメータセット）を示す。ここでは、参照データの取得後に、学習用パラメータセットを作成する順で説明したが、両者の実行順は逆であってもよく、また両者を並行して行ってもよい。また、使用者による指示に従って、学習用パラメータセット作成部４２が事前に作成された学習用パラメータセットを記憶部３１から読み出すようにしてもよい。こうして複数の学習用パラメータセットが作成される（ステップＳ２）。 When the user instructs the creation of the learning parameter set by the operation through the input unit 6, the learning parameter set creation unit 42 executes the analysis program 40 installed in the control / processing device 3 in advance and sets the parameters. The screen to be displayed is displayed on the display unit 7. The analysis program of this example is AMDIS. AMDIS uses six analysis parameters: Component width, Omit m / z, Adjacent peak subtraction, Resolution, Sensitivity, and Shape requirement. FIG. 4 shows the contents of each parameter. In this example, 45 types of parameter sets were created by lowering the Adjacent peak subtraction from Two to One and the Resolution from High to Medium based on the initial value (Parameter Set Number: 0). FIG. 5 shows a part (initial value and 10 kinds of parameter sets). Here, the order in which the training parameter set is created after the reference data is acquired has been described, but the execution order of both may be reversed, or both may be performed in parallel. Further, according to the instruction by the user, the learning parameter set creating unit 42 may read the learning parameter set created in advance from the storage unit 31. In this way, a plurality of learning parameter sets are created (step S2).

参照データが取得され、また学習用パラメータセットが作成されると、学習用パラメータセット決定部４３は、32個のGCMSデータのそれぞれについて、45種類の学習用パラメータセットを個別に用いてAMDISによる解析を実行する（ステップＳ３）。具体的には、GCMSデータと学習用パラメータセットの組ごとに、そのGCMSデータに含まれるTICC波形のピークを純化し、各ピークに対応するマススペクトルを物質データベース３２に保存されているマススペクトルと照合することにより各ピークに対応する物質を同定する。さらに、マススペクトルの一致度から評価値（スコア）を求める。AMDISでは同定、物質の同定の信頼度を表す1〜100のスコアが求められる。本実施例では、スコアが60以上であればピークを同定完了とし、スコアが60未満のピークは未同定とした。同定を完了したピークについては、同定された物質名、保持時間、解析に用いたパラメータセット番号、及びスコアが対応付けられ記憶部３１に保存される。 When the reference data is acquired and the training parameter set is created, the learning parameter set determination unit 43 analyzes each of the 32 GCMS data by AMDIS using 45 types of training parameter sets individually. Is executed (step S3). Specifically, for each set of GCMS data and training parameter set, the peaks of the TICC waveform included in the GCMS data are purified, and the mass spectrum corresponding to each peak is the mass spectrum stored in the substance database 32. The substance corresponding to each peak is identified by matching. Furthermore, the evaluation value (score) is obtained from the degree of coincidence of the mass spectra. AMDIS requires a score of 1 to 100, which indicates the reliability of identification and substance identification. In this example, if the score is 60 or more, the peak is identified as complete, and the peak with a score of less than 60 is unidentified. For the peaks for which identification has been completed, the identified substance name, retention time, parameter set number used for analysis, and score are associated and stored in the storage unit 31.

学習用パラメータセット決定部４３は、次に、32個のGCMSデータのピークのそれぞれについて、同定に最適な学習用パラメータセットを決定する（ステップＳ４）。同じ保持時間のピークが複数の学習用パラメータセットを用いた解析により同定された場合には、それら複数の学習用パラメータセットのうち、最もスコアが高い解析結果が得られた学習用パラメータセットをそのピークの同定に最適な学習用パラメータセットとする。また、最高スコアが複数ある場合は、学習用パラメータセットの番号が小さい方を最適な学習用パラメータとする。こうした処理を全てのピーク（物質が同定されたピーク）について行い、最適な学習用パラメータセットを決定する。ただし、ピークの保持時間と、同定された物質の理論保持時間の差が0.25分よりも大きい場合には、スコアに関わらず誤同定とし、次にスコアが大きく保持時間の差が0.25分以下である学習用パラメータセットをそのピークの同定に最適な学習用パラメータセットとした。 Next, the learning parameter set determination unit 43 determines the optimum learning parameter set for identification for each of the peaks of the 32 GCMS data (step S4). If the peak with the same retention time is identified by analysis using multiple learning parameter sets, the learning parameter set with the highest score among the multiple learning parameter sets is selected. The optimum training parameter set for peak identification. When there are a plurality of maximum scores, the one with the smaller number of the learning parameter set is set as the optimum learning parameter. Such processing is performed for all peaks (peaks in which substances have been identified) to determine the optimum training parameter set. However, if the difference between the peak retention time and the theoretical retention time of the identified substance is greater than 0.25 minutes, it will be misidentified regardless of the score, and the next largest score and the difference in retention time will be 0.25 minutes or less. A learning parameter set was set as the optimum learning parameter set for identifying the peak.

本実施例では、最高スコアが複数存在する場合に学習用パラメータセットの番号が小さい方を最適な学習用パラメータセットとしたが、学習用パラメータセットの番号が大きい方を最適な学習用パラメータセットとしてもよく、あるいは両方を最適な学習用パラメータセットとしてもよい。 In this embodiment, when there are a plurality of highest scores, the one with the smaller number of the learning parameter set is the optimum learning parameter set, but the one with the larger number of the learning parameter set is the optimum learning parameter set. Or both may be the optimal learning parameter set.

全てのピークについて最適な学習用パラメータセットが決まると、参照データ分割部４４は、各ピークについて保持時間（ピークトップ）を中心とする40scan分のデータを抽出（参照データを分割）する（ステップＳ５）。本実施例では、これにより1806個のデータ（分割参照データ）を得た。図６にこれら1806個のデータと最適な学習用パラメータセットの関係（各学習用パラメータセットに対応付けられた分割参照データの数を示すヒストグラム）を示す。 When the optimum training parameter set for all peaks is determined, the reference data dividing unit 44 extracts (divides the reference data) 40 scans of data centered on the holding time (peak top) for each peak (step S5). ). In this example, 1806 data (divided reference data) were obtained from this. FIG. 6 shows the relationship between these 1806 data and the optimum learning parameter set (histogram showing the number of divided reference data associated with each learning parameter set).

本実施例では、解析パラメータの初期値が最適な学習用パラメータセットであるとされた分割参照データが667個、それ以外の学習用パラメータセットが最適な解析パラメータセットであるとされた分割参照データが1,139個となった。このように、多くの場合、解析パラメータの初期値が最適ではないデータが一定の割合で存在している。 In this embodiment, there are 667 divided reference data in which the initial value of the analysis parameter is the optimum learning parameter set, and the other divided reference data in which the learning parameter set is the optimum analysis parameter set. Was 1,139. As described above, in many cases, there is a certain percentage of data in which the initial values of the analysis parameters are not optimal.

次に、学習モデル作成部４５が、45種類の学習用パラメータセットの中から、対応付けられている分割参照データの数が200個以上である3つの学習用パラメータセット（0, 1, 12）を、実解析に使用する実解析用パラメータセットの候補として選択し、それぞれに対応付けられた分割参照データとともに抽出する。ここで1つの学習用パラメータセットに対応付けられた1乃至複数の分割参照データのグループが1つの参照データ群を構成する。即ち、本実施例の学習モデル作成部４５は、本発明に係る参照データ群作成部としての機能を有している。こうして作成されたデータが後述する機械学習で使用される学習データとなる（ステップＳ６）。200個よりも少ない数の分割参照データが対応付けられた学習用パラメータセットを抽出することも可能であるが、対応付けられている分割参照データの数が少なすぎると、それらに共通する特徴的な部分（例えばピークの形状）を機械学習で特定することが困難である。本実施例に限らず機械学習解析のためにはデータ数がいくつ必要であるかは、解析対象のデータの種類や解析の内容などによって異なるが、一般に、この段階で抽出する学習用パラメータセットには、少なくとも数10〜100程度の（分割）参照データが対応付けられていることが望ましい。 Next, the learning model creation unit 45 selects three learning parameter sets (0, 1, 12) in which the number of associated division reference data is 200 or more from among 45 types of learning parameter sets. Is selected as a candidate for the parameter set for actual analysis to be used for actual analysis, and is extracted together with the divided reference data associated with each. Here, a group of one or a plurality of divided reference data associated with one learning parameter set constitutes one reference data group. That is, the learning model creation unit 45 of this embodiment has a function as a reference data group creation unit according to the present invention. The data created in this way becomes the learning data used in the machine learning described later (step S6). It is possible to extract a training parameter set with less than 200 split reference data associated with it, but if the number of associated split reference data is too small, it is a common feature. It is difficult to identify a specific part (for example, the shape of a peak) by machine learning. Not limited to this embodiment, the number of data required for machine learning analysis differs depending on the type of data to be analyzed and the content of analysis, but in general, it depends on the learning parameter set extracted at this stage. It is desirable that at least several tens to 100 (divided) reference data are associated with each other.

図７は、3つの学習用パラメータセットのそれぞれについて、その学習用パラメータセットに対応付けられた分割参照データのTICC波形（40scan分のデータを、その最高強度で規格化したもの）を重ね描きしたものである。図７(a)はパラメータセット0（初期値）、図７(b)はパラメータセット1、図７(c)はパラメータセット12のものである。図７(a)〜(c)に含まれているTICC波形のピーク形状を相互に目視で比較するのみでは、各グループ（参照データ群）に特徴的なピークの形状を見出すことが困難である。また、多くのピークは中央にピークトップが現れているが、一見すると中央にピークが存在する事が分からないものも含まれている。AMDISのような解析プログラムを用いることで、視覚的には抽出する事が困難なピークも抽出されていることが分かる。 FIG. 7 is an overlay of the TICC waveforms of the divided reference data (40 scans of data standardized with the highest intensity) associated with the learning parameter sets for each of the three learning parameter sets. It is a thing. FIG. 7 (a) is for parameter set 0 (initial value), FIG. 7 (b) is for parameter set 1, and FIG. 7 (c) is for parameter set 12. It is difficult to find the peak shape characteristic of each group (reference data group) only by visually comparing the peak shapes of the TICC waveforms included in FIGS. 7 (a) to 7 (c). .. In addition, many peaks have a peak top appearing in the center, but some peaks do not seem to exist in the center at first glance. By using an analysis program such as AMDIS, it can be seen that peaks that are difficult to extract visually are also extracted.

次に、学習モデル作成部４５は、3つの学習用パラメータセット0, 1, 12のそれぞれに対応付けられた、合計1,092個の分割参照データ（パラメータセット0：667個、パラメータセット1：212個、パラメータセット12：213個）を学習データとする機械学習によって学習モデルを作成する（ステップＳ７）。本実施例では畳み込みニューラルネットワーク（Convolution Neural Network, CNN）を用いて学習モデルを構築した。また、その学習モデルを評価する方法として5分割交差検証（Cross Validation, CV）法を用いた。5分割CV法とは、CV番号1〜4のデータを用いて学習モデルを構築し、CV番号0のデータに適用してCV番号0のデータに対する正答率を算出する、次にCV番号0, 2〜4のデータを用いて学習モデルを構築し、CV番号1のデータに適用してCV番号1のデータに対する正答率を算出する、という処理を順に行って求めた5つの正答率の平均値をモデルの性能とするものである。こうした交差検証法では、「モデル構築に用いたデータ」と「構築モデルの評価に用いるデータ」が異なるため、未知データに対する予測性能を評価する手法といえる。本実施例では、図８に示すように分割参照データを5つのデータ（CV番号0〜4）に分割した。 Next, the learning model creation unit 45 has a total of 1,092 divided reference data (parameter set 0: 667, parameter set 1: 212) associated with each of the three training parameter sets 0, 1 and 12. , Parameter set 12: 213) to create a learning model by machine learning (step S7). In this example, a learning model was constructed using a convolutional neural network (CNN). In addition, the 5-fold cross validation (CV) method was used as a method for evaluating the learning model. The 5-division CV method constructs a learning model using the data of CV numbers 1 to 4, applies it to the data of CV number 0, and calculates the correct answer rate for the data of CV number 0, then CV number 0, The average value of the five correct answer rates obtained by constructing a learning model using the data of 2 to 4 and applying it to the data of CV number 1 to calculate the correct answer rate for the data of CV number 1. Is the performance of the model. In such a cross-validation method, the "data used for model construction" and the "data used for evaluation of the construction model" are different, so it can be said to be a method for evaluating the prediction performance for unknown data. In this embodiment, as shown in FIG. 8, the divided reference data is divided into five data (CV numbers 0 to 4).

本実施例における上記学習モデルは、入力されたデータの特徴に応じた結果を出力する一種の識別器と捉えることができる。本実施例ではCNNを用いたが、その他、CNN以外のディープラーニング、サポートベクターマシン（Support Vector Machine, SVM）、アダブースト（AdaBoost）などを用いて学習モデルを構築することもできる。 The learning model in this embodiment can be regarded as a kind of classifier that outputs a result according to the characteristics of the input data. In this example, CNN is used, but in addition, a learning model can be constructed by using deep learning other than CNN, Support Vector Machine (SVM), AdaBoost, and the like.

図９は、本実施例において学習モデルの作成に用いたCNNのネットワークの概略構成図である。本実施例では一次元の畳み込みを行った。そして、この学習モデルを元に、最もCV正答率が高くなるハイパーパラメータ及びネットワーク構成（例えば非特許文献３）を決定した（ステップＳ８）。その結果を図１０に示す。図１１に示すとおり、このハイパーパラメータとネットワーク構成により得られた正答率の平均値は88.1%であった。言い換えると、未知データ（含有物質が不明である試料のGCMSデータからピークが存在する部分を抽出した、分割未解析データ）に対して、最適なパラメータセットを約90％の確率で予測できる予測モデルが構築された。 FIG. 9 is a schematic configuration diagram of the CNN network used for creating the learning model in this embodiment. In this example, one-dimensional convolution was performed. Then, based on this learning model, the hyperparameters and network configuration (for example, Non-Patent Document 3) having the highest CV correct answer rate were determined (step S8). The result is shown in FIG. As shown in FIG. 11, the average value of the correct answer rate obtained by this hyperparameter and the network configuration was 88.1%. In other words, a prediction model that can predict the optimum parameter set with a probability of about 90% for unknown data (divided unanalyzed data obtained by extracting the part where the peak exists from the GCMS data of the sample whose contained substance is unknown). Was built.

本実施例では、上記のとおり、学習データとして1,092個の分割参照データを用いている。このうち、AMDISの解析パラメータの初期値（パラメータセット0）が最適なパラメータセットであるものは667個、つまり学習データのうちの61.1%であった。これに対し、本実施例で作成した学習モデルでは88.1%の正答率が得られている。これらの比較から、本実施例の学習モデルを用いることにより、データの解析に最適なパラメータセットを選択して最も高い精度で試料に含まれる物質を同定することができる可能性が従来よりも高まったといえる。 In this embodiment, as described above, 1,092 divided reference data are used as training data. Of these, the initial value (parameter set 0) of the AMDIS analysis parameters was the optimum parameter set for 667 items, that is, 61.1% of the training data. On the other hand, in the learning model created in this example, a correct answer rate of 88.1% was obtained. From these comparisons, by using the learning model of this example, it is more likely than before that the optimum parameter set for data analysis can be selected and the substances contained in the sample can be identified with the highest accuracy. It can be said that it was.

学習モデル作成部４５により学習モデルが作成されると、未解析データ入力受付部４６は、解析対象のデータを入力させる画面を表示部７に表示する。使用者は、参照データの取得時と同様に、オートサンプラ１４にセットした試料をガスクロマトグラフ質量分析装置１により測定し、取得したGCMSデータを未解析データとして入力する。あるいは、既に測定済みのデータを解析する場合には、予め記憶部３１に保存しておいた未解析データを読み出して入力する。本実施例の未解析データは、参照データと同様に、試料注入後4〜24分の間に質量電荷比範囲80〜500を24000scanして得たGCMSデータである。こうして解析対象のデータが未解析データとして入力される（ステップＳ９）。なお、未解析データの測定条件は、必ずしも参照データの測定条件と同じでなくてもよい。 When the learning model is created by the learning model creating unit 45, the unanalyzed data input receiving unit 46 displays a screen for inputting the data to be analyzed on the display unit 7. The user measures the sample set in the autosampler 14 with the gas chromatograph mass spectrometer 1 and inputs the acquired GCMS data as unanalyzed data, as in the case of acquiring the reference data. Alternatively, when analyzing the already measured data, the unanalyzed data previously stored in the storage unit 31 is read out and input. The unanalyzed data of this example is GCMS data obtained by 24000 scans of the mass-to-charge ratio range of 80 to 500 within 4 to 24 minutes after sample injection, similar to the reference data. In this way, the data to be analyzed is input as unanalyzed data (step S9). The measurement conditions for the unanalyzed data do not necessarily have to be the same as the measurement conditions for the reference data.

未解析データが入力されると、未解析データ分割部４７は、図１２に示すように、入力された未解析データの保持時間が短い側から例えば10scan分ずつ、取り出し開始位置をずらしながら40scan分のデータを取り出していく。これにより、未解析データから2397個の分割未解析データが作成される（ステップＳ１０）。 When the unanalyzed data is input, the unanalyzed data dividing unit 47, as shown in FIG. 12, 40 scans while shifting the extraction start position, for example, by 10 scans from the side where the holding time of the input unanalyzed data is short. Data is taken out. As a result, 2397 divided unanalyzed data are created from the unanalyzed data (step S10).

分割未解析データが得られると、実解析用パラメータ決定部４８は、分割未解析データを1つずつ、未知データとして学習モデルに入力し、パラメータセット0, 1, 12の中からその分割未解析データの解析に最も適したパラメータセットを出力させる。学習モデルは、分割未解析データに含まれるピークの特徴と最も高い共通性を有する参照データ群を決定し、その参照データ群に対応する実解析用パラメータセットを決定する（ステップＳ１１）。 When the divided unanalyzed data is obtained, the actual analysis parameter determination unit 48 inputs the divided unanalyzed data one by one into the training model as unknown data, and the divided unanalyzed data from the parameter sets 0, 1, 12 is input. Output the most suitable parameter set for data analysis. The training model determines a reference data group having the highest commonality with the peak characteristics included in the divided unanalyzed data, and determines an actual analysis parameter set corresponding to the reference data group (step S11).

通常、1つの未解析データから生成された全ての分割未解析データにピークが含まれているわけではなく、その一部にのみピークが存在する。ピークを含まない分割未解析データについては、そのデータと共通する特徴を有する参照データ群が存在しないため、最適なパラメータセットも存在しない。従って、こうした分割未解析データに対しては解析対象（ピーク）なしと判定し、解析対象（ピーク）が存在する分割未解析データについてのみ最適なパラメータセットを選択する。 Normally, not all divided unanalyzed data generated from one unanalyzed data contains peaks, and only a part of them has peaks. For the divided unanalyzed data that does not include peaks, there is no reference data group that has the same characteristics as the data, so there is no optimal parameter set. Therefore, it is determined that there is no analysis target (peak) for such divided unanalyzed data, and the optimum parameter set is selected only for the divided unanalyzed data in which the analysis target (peak) exists.

実解析実行部４９は、学習モデルにより選択されたパラメータセットを用いてAMDISによる解析を行ってピークを純化し、各ピークに対応する物質を同定してスコアを求める（ステップＳ１２）。実解析実行部４９によりピークに対応する物質の同定が完了すると、解析結果出力部５０は、同定された物質の名称、保持時間、及びスコアを表示部７に表示（出力）する（ステップＳ１３）。これらの情報と併せて、ピークの同定に用いたパラメータセットの番号を出力するようにしてもよい。また、ステップＳ１２において、同定されたピークの保持時間と同定された物質の理論保持時間が所定の時間（例えば0.25分）以上異なる場合に、その同定結果を廃棄する（あるいは注意喚起する表示を加える）ような構成を付加することもできる。これにより、マススペクトルの偶然の一致によって誤同定される可能性を排除し、同定精度をより高くすることができる。 The real analysis execution unit 49 performs analysis by AMDIS using the parameter set selected by the learning model to purify the peaks, identifies the substance corresponding to each peak, and obtains the score (step S12). When the real analysis execution unit 49 completes the identification of the substance corresponding to the peak, the analysis result output unit 50 displays (outputs) the name, retention time, and score of the identified substance on the display unit 7 (step S13). .. In addition to this information, the number of the parameter set used for peak identification may be output. Further, in step S12, when the retention time of the identified peak and the theoretical retention time of the identified substance differ by a predetermined time (for example, 0.25 minutes) or more, the identification result is discarded (or a warning is added. ) Can be added. As a result, the possibility of misidentification due to accidental matching of mass spectra can be eliminated, and the identification accuracy can be improved.

本実施例におけるデータ解析方法及び装置の主たる目的は、上述した未解析データの解析であるが、本実施例のデータ解析装置は、さらに学習モデル更新部５１を備えている。 The main purpose of the data analysis method and the apparatus in this embodiment is to analyze the unanalyzed data described above, but the data analysis apparatus of the present embodiment further includes a learning model update unit 51.

実解析実行部４９により解析された未解析データ（以下、これを「解析済データ」と呼ぶ。）が所定数（例えば30個）蓄積されると、学習モデル更新部５１は、それらの解析済データを先に説明した参照データに設定する。こうして参照データが設定されると、上述したステップＳ１〜Ｓ８と同様の処理が順に行われる。そして、5分割CV法における正答率が最も高くなるように学習モデルのハイパーパラメータとネットワーク構成が再調整され、学習モデルが更新される。このように、解析済データを順次、参照データとして用いることにより、より多様なデータに対応可能となるように学習モデルを更新することができる。ここでは、所定数の解析済データが蓄積される毎に学習モデル更新部５１により学習モデルを更新する構成としたが、解析済データが発生する毎に学習モデルを更新（再構成）するようにしてもよい。ここでは新たに追加された解析済データのみを参照データとして機械学習を実行するオンライン学習（逐次学習）により学習モデルを更新する場合を例に説明したが、既に機械学習に使用した参照データと解析済データの両方を用いたバッチ学習により学習モデルを更新してもよい。 When a predetermined number (for example, 30) of unanalyzed data analyzed by the real analysis execution unit 49 (hereinafter, this is referred to as “analyzed data”) is accumulated, the learning model update unit 51 has analyzed them. Set the data to the reference data described above. When the reference data is set in this way, the same processing as in steps S1 to S8 described above is performed in order. Then, the hyperparameters and network configuration of the learning model are readjusted so that the correct answer rate in the 5-division CV method is the highest, and the learning model is updated. In this way, by sequentially using the analyzed data as reference data, the learning model can be updated so as to be able to handle a wider variety of data. Here, the learning model is updated by the learning model update unit 51 every time a predetermined number of analyzed data is accumulated, but the learning model is updated (reconstructed) every time the analyzed data is generated. You may. Here, the case of updating the learning model by online learning (sequential learning) that executes machine learning using only the newly added analyzed data as reference data has been described as an example, but the reference data and analysis already used for machine learning have been described. The learning model may be updated by batch learning using both the completed data.

図１３は、本実施例のデータ解析方法及び解析装置の概念を模式的に示したものである。図１３に示すように、本実施例では、機械学習により、入力されたデータxの特徴に応じた結果f(x)を出力する識別器としての学習モデル（予測モデルf(x)）を予め作成しておく。解析対象のGCMSデータを入力するとそのGCMSデータから分割未解析データが作成され、さらに該分割未解析データから作成されたトータルイオンカレントクロマトグラムの波形データが学習モデルに入力され、最適なパラメータセットが出力される。そして、これを解析パラメータとしてAMDISによるピーク純化（ピーク分離等）とそのピークに対応する物質の同定が行われ、それらの結果（同定物質名と同定スコア）が出力される。 FIG. 13 schematically shows the concept of the data analysis method and the analysis device of this embodiment. As shown in FIG. 13, in this embodiment, a learning model (prediction model f (x)) as a discriminator that outputs a result f (x) according to the characteristics of the input data x is prepared in advance by machine learning. Create it. When the GCMS data to be analyzed is input, undivided unanalyzed data is created from the GCMS data, and the waveform data of the total ion current chromatogram created from the unanalyzed data is input to the training model to obtain the optimum parameter set. It is output. Then, using this as an analysis parameter, peak purification (peak separation, etc.) by AMDIS and identification of the substance corresponding to the peak are performed, and the results (identification substance name and identification score) are output.

本実施例のデータ解析方法やデータ解析装置では、機械学習によって作成された学習モデルによって、学習用パラメータセットのうちの1つが解析対象のデータに最も適したパラメータセット（実解析用パラメータセット）として選択され、その実解析用パラメータセットを用いてAMDISによる解析が行われる。そのため、使用者が自ら解析パラメータの値を変更する必要がなく、簡便に高い確率で最適な解析結果を得ることができる。また、使用者の熟練度によって解析結果に差が生じることもない。さらに、解析済データが所定数蓄積される毎に学習モデルが更新されていくため、常に高い精度で多様なデータを解析する事ができる。 In the data analysis method and data analysis device of this embodiment, one of the training parameter sets is set as the most suitable parameter set (actual analysis parameter set) for the data to be analyzed by the learning model created by machine learning. It is selected and analyzed by AMDIS using the parameter set for actual analysis. Therefore, the user does not have to change the value of the analysis parameter by himself / herself, and the optimum analysis result can be easily obtained with a high probability. In addition, there is no difference in the analysis result depending on the skill level of the user. Furthermore, since the learning model is updated every time a predetermined number of analyzed data are accumulated, it is possible to always analyze various data with high accuracy.

次に、本発明に係るデータ解析装置の変形例を説明する。上記実施例のデータ解析装置（制御・処理装置３）では、学習モデルの作成とデータの解析の両方を行ったが、変形例のデータ解析装置では、予め作成された学習モデルを用いてデータを解析する。 Next, a modified example of the data analysis device according to the present invention will be described. In the data analysis device (control / processing device 3) of the above embodiment, both the learning model was created and the data was analyzed, but in the data analysis device of the modified example, the data was generated using the learning model created in advance. To analyze.

図１４は、本発明に係るデータ解析装置の変形例である制御・処理装置３aのブロック図である。上記実施例の制御・処理装置３と共通する構成要素には同一の符号を付し、適宜説明を省略する。上記実施例と同様に、変形例の制御・処理装置３aの実体もパーソナルコンピュータであり、解析用プログラム４０aを実行することにより図１４に記載の各機能ブロックが具現化される。 FIG. 14 is a block diagram of a control / processing device 3a which is a modification of the data analysis device according to the present invention. The components common to the control / processing device 3 of the above embodiment are designated by the same reference numerals, and the description thereof will be omitted as appropriate. Similar to the above embodiment, the substance of the control / processing device 3a of the modified example is also a personal computer, and each functional block shown in FIG. 14 is embodied by executing the analysis program 40a.

この制御・処理装置３aには、解析プログラム（上記実施例ではAMDIS）に対応する学習モデル（CNN）３４が予めインストールされており、学習モデル３４を構築する際に用いられた学習用パラメータセットが記憶部３１aに保存されているという点で上記実施例の制御・処理装置３と異なる。この学習モデル３４は、上記実施例で説明したステップＳ１〜Ｓ８を実行することにより作成された学習モデル３４を移植したものであり、変形例の制御・処理装置３aとして構成されたパーソナルコンピュータの出荷前の段階でインストールされる。 A learning model (CNN) 34 corresponding to an analysis program (AMDIS in the above embodiment) is pre-installed in this control / processing device 3a, and a learning parameter set used when constructing the learning model 34 is installed. It differs from the control / processing device 3 of the above embodiment in that it is stored in the storage unit 31a. This learning model 34 is a port of the learning model 34 created by executing steps S1 to S8 described in the above embodiment, and is a shipment of a personal computer configured as a control / processing device 3a of a modified example. Installed at the previous stage.

従って、変形例の制御・処理装置３aの使用者は、上記実施例のＳ１〜Ｓ８を自ら実行することなく、ステップＳ９〜Ｓ１３のみを実行することにより学習モデルを用いてデータを解析することができる。 Therefore, the user of the control / processing device 3a of the modified example can analyze the data using the learning model by executing only steps S9 to S13 without executing S1 to S8 of the above embodiment by himself / herself. can.

変形例の制御・処理装置３aも上記実施例と同様に学習モデル更新部５１aを備えており、上記実施例と同様に、所定数の解析済データが蓄積される毎に学習モデル更新部５０aにより学習モデル３４のパラメータ及びネットワーク構成が適宜に更新される。なお、上記実施例ではバッチ学習とオンライン学習のいずれかにより学習モデルの更新を行ったが、変形例の制御・処理装置３aではオンライン学習によって学習モデルが更新される。 The control / processing device 3a of the modified example also includes the learning model update unit 51a as in the above embodiment, and like the above embodiment, the learning model update unit 50a is used every time a predetermined number of analyzed data are accumulated. The parameters and network configuration of the learning model 34 are updated as appropriate. In the above embodiment, the learning model is updated by either batch learning or online learning, but in the control / processing device 3a of the modified example, the learning model is updated by online learning.

上記実施例は一例であって、本発明の趣旨に沿って適宜に変更することができる。
上記実施例では、学習用パラメータセット決定部４３が、各ピークについて最適な学習用パラメータセットを決定する構成としたが、予め決められた値以上のスコア（評価値）が得られた学習用パラメータセットの全てを、解析に適したパラメータセットとしてもよい。あるいは、同じ保持時間のピークについて得られた最高スコアに対して一定の割合（例えば90%）以上のスコアが得られた学習用パラメータセット全てを解析に適したパラメータセットとすることもできる。これらの場合には、同一のピークデータ（分割参照データ）が複数の解析パラメータに対応付けられることになる。The above embodiment is an example and can be appropriately modified according to the gist of the present invention.
In the above embodiment, the learning parameter set determination unit 43 determines the optimum learning parameter set for each peak, but the learning parameter for which a score (evaluation value) equal to or higher than a predetermined value is obtained. The entire set may be a parameter set suitable for analysis. Alternatively, all the training parameter sets obtained with a score of a certain ratio (for example, 90%) or more with respect to the highest score obtained for the peak of the same retention time can be set as a parameter set suitable for analysis. In these cases, the same peak data (division reference data) is associated with a plurality of analysis parameters.

また、上記実施例では、未解析データ全体から分割未解析データを作成して学習モデルに入力する構成としたが、未解析データから予めピーク（解析対象）が存在する部分を抽出しておき、その部分のみから分割未解析データを作成するようにしてもよい。例えば、解析パラメータの初期値をそのまま用いてAMDISにより未解析データを解析してピークを抽出したり、あるいは別のピーク検出用のソフトウェアを用いて未解析データからピークが存在する部分を特定したりするようにしてもよい。さらには、ピークが存在すると考えられる範囲を使用者が自ら特定するようにしてもよい。 Further, in the above embodiment, the divided unanalyzed data is created from the entire unanalyzed data and input to the training model, but the portion where the peak (analysis target) exists is extracted in advance from the unanalyzed data. The divided unanalyzed data may be created only from that part. For example, the initial values of the analysis parameters are used as they are, and the unanalyzed data is analyzed by AMDIS to extract the peaks, or another peak detection software is used to identify the part where the peaks exist from the unanalyzed data. You may try to do it. Furthermore, the user may specify the range in which the peak is considered to exist.

さらに、上記実施例では、1乃至複数の解析パラメータの値の組を1つの学習用パラメータセットとし、複数の学習用パラメータの中から未解析データの解析に最も適したものを実解析用パラメータセットとした。つまり、予め用意された複数の学習用パラメータセットに対応するパラメータセット番号というカテゴリーの1つを予測する場合を例に説明した。これは機械学習の用語でいえば「識別」のアプローチである。 Further, in the above embodiment, a set of values of one or more analysis parameters is set as one learning parameter set, and the most suitable one among the plurality of learning parameters for analysis of unanalyzed data is the actual analysis parameter set. And said. That is, the case of predicting one of the categories of parameter set numbers corresponding to a plurality of training parameter sets prepared in advance has been described as an example. This is an "identification" approach in machine learning terms.

これに対し「回帰」のアプローチにより実解析用パラメータセットを決定することもできる。具体的には、参照データ群（各参照データ群は1つの参照データのみで構成されてもよい）に各解析パラメータの値を直接対応付け、未解析データの解析で用いるべき各解析パラメータの値を直接推定するというアプローチである。このアプローチでは、ある1つの解析パラメータについて、学習用パラメータセットに値が2つ（例えば5と10）しか含まれていない場合であっても、未解析データと各参照データ群（あるいは各参照データ）との共通性（例えばTICC波形の類似性）に基づく回帰分析を行い、前記2つの値のいずれでもない中間的な値（例えば7）を未解析データの解析に最適な解析パラメータの値として求めることができる。こうした回帰分析は、1乃至複数の解析パラメータのそれぞれについて個別に行うこともでき、あるいは1乃至複数の解析パラメータについて一括で（即ちパラメータセット単位で）行うこともできる。 On the other hand, the parameter set for real analysis can be determined by the "regression" approach. Specifically, the value of each analysis parameter is directly associated with the reference data group (each reference data group may be composed of only one reference data), and the value of each analysis parameter to be used in the analysis of unanalyzed data. Is the approach of directly estimating. This approach involves unanalyzed data and each reference data group (or each reference data) for a single analysis parameter, even if the training parameter set contains only two values (eg 5 and 10). ) Is performed, and an intermediate value (for example, 7) that is neither of the above two values is used as the optimum analysis parameter value for the analysis of unanalyzed data. Can be sought. Such regression analysis can be performed individually for each of one or more analysis parameters, or collectively (ie, in parameter set units) for one or more analysis parameters.

上記実施例では、ガスクロマトグラフ質量分析装置を用いた試料の測定により得られた三次元データをデータとして試料に含まれる物質を同定する場合を説明したが、本発明に係るデータ解析方法、データ解析装置、及び学習モデルの作成方法は、様々なデータの解析に広く用いることができる。 In the above example, the case of identifying the substance contained in the sample by using the three-dimensional data obtained by the measurement of the sample using the gas chromatograph mass analyzer as data has been described, but the data analysis method and data analysis according to the present invention have been described. The device and the method of creating a learning model can be widely used for analyzing various data.

例えば、試料の質量分析データを解析するソフトウェアの1つにMass++がある（非特許文献２参照）。Mass++は、ペプチドやタンパク質を含む試料を液体クロマトグラフ質量分析装置（MALDIを含む）により測定することにより得られたLCMSデータを読み込んで、クロマトグラムやマススペクトルのスムージング、ベースライン除去、ピーク検出などの処理を行い、マススペクトルのピークリストを作成してデータベース検索サーバ（Mascotサーバ）に送信してペプチドを同定し、同定されたペプチドから予測されるタンパク質を同定するという解析を行うことが可能なソフトウェアである。AMDISと同様に、Mass++でも同定されたペプチドやタンパク質の同定の信頼度を表すスコア（信頼度スコア）が求められる。 For example, Mass ++ is one of the software for analyzing mass spectrometric data of a sample (see Non-Patent Document 2). Mass ++ reads LCMS data obtained by measuring samples containing peptides and proteins with a liquid chromatograph mass spectrometer (including MALDI), smoothing chromatograms and mass spectra, removing baselines, detecting peaks, etc. It is possible to perform analysis by creating a peak list of mass spectrum and sending it to a database search server (Mascot server) to identify peptides and identify proteins predicted from the identified peptides. It is software. Similar to AMDIS, Mass ++ also requires a score (reliability score) that indicates the reliability of identification of identified peptides and proteins.

Mass++を用いてLCMSデータからマススペクトルのリストを作成する際には各種の解析パラメータが用いられる。また、作成したピークリストに対応する物質の同定にも各種の解析パラメータが用いられる。従来、これらの解析パラメータの初期値をそのまま使用して解析を行うか、使用者が自らの経験に基づいて解析パラメータを変更する必要があったが、本発明を適用することにより、簡便に最適な同定結果を得ることができる。 Various analysis parameters are used when creating a list of mass spectra from LCMS data using Mass ++. In addition, various analysis parameters are also used to identify the substance corresponding to the created peak list. Conventionally, it has been necessary to perform analysis using the initial values of these analysis parameters as they are, or to change the analysis parameters based on the user's own experience. However, by applying the present invention, it is easily optimized. Identification results can be obtained.

また、クロマトグラフや質量分析装置以外の分析装置、例えばフーリエ変換赤外分光光度計等の分光測定装置を用いた試料の測定により得られた分光スペクトルデータを所定の解析プログラムにより解析することにより試料に含まれる物質を同定する等の解析にも本発明を適用することができる。さらに、核磁気共鳴装置（NMR）、近赤外光脳機能イメージング装置（NIRS）等により得られたデータの解析にも用いることができる。さらに、過去の株価の変動データを元に、直近の株価の変動データから未来の株価の変動データを予測する等の解析にも用いることが可能である。即ち、最適な解析パラメータを用いて解析を行うことにより評価値が高くなる（例えば、与えられた問題に対する正答率が高くなる、目的物質の純度が上がる、消費電力が少なくなる、収益が大きくなる）ことを定義可能である限りにおいて、種々のデータ解析に本発明を適用することができる。 In addition, the sample is obtained by analyzing the spectral spectrum data obtained by measuring the sample using an analyzer other than a chromatograph or a mass spectrometer, for example, a spectroscopic measurement device such as a Fourier-converted infrared spectrophotometer, using a predetermined analysis program. The present invention can also be applied to analysis such as identification of a substance contained in. Furthermore, it can also be used for analysis of data obtained by a nuclear magnetic resonance apparatus (NMR), a near-infrared optical brain function imaging apparatus (NIRS), or the like. Furthermore, it can be used for analysis such as predicting future stock price fluctuation data from the latest stock price fluctuation data based on past stock price fluctuation data. That is, the evaluation value becomes high by performing the analysis using the optimum analysis parameters (for example, the correct answer rate for a given question becomes high, the purity of the target substance increases, the power consumption decreases, and the profit increases. The present invention can be applied to various data analyzes as long as it can be defined.

上記実施例では、参照データ（及び解析済データ）の全てについて複数の学習用パラメータセットを用いた網羅的な解析を実行し、全てのピークについて最適なパラメータセットが事前に分かっている学習データのみを用いる、いわゆる教師あり学習を行うことにより学習モデルを作成したが、こうした参照データに加えて、最適なパラメータセットが不明であるピークのデータを追加した学習データを用いる半教師あり学習により学習モデルを作成することもできる。 In the above embodiment, comprehensive analysis using a plurality of training parameter sets is performed for all the reference data (and analyzed data), and only the training data for which the optimum parameter set is known in advance for all peaks. A learning model was created by performing so-called supervised learning using Can also be created.

上記実施例及び変形例では、機械学習の手法としてバッチ学習とオンライン学習を用いる場合を説明したが、その他、転移学習（学習モデルが作成されたドメインと異なるドメインに属する学習データを用いて学習モデルを追加で学習するもの）や強化学習（入力に対する出力を明示的に示す教師が存在しない代わりに、一連の行動に対する結果の良し悪しの評価としての報酬が与えられるものであり、行動と結果の情報を更新しながら試行錯誤により報酬が最大化する行動を学習するもの）など、様々な手法を用いることができる。 In the above examples and modifications, the case of using batch learning and online learning as machine learning methods has been described, but in addition, transfer learning (learning model using learning data belonging to a domain different from the domain in which the learning model was created). In exchange for the absence of a teacher who explicitly indicates the output for the input, the reward is given as an evaluation of the quality of the result for a series of actions, and the action and the result are Various methods can be used, such as learning the behavior that maximizes the reward by trial and error while updating the information).

例えば、変形例の制御・処理装置３aであるパーソナルコンピュータの製造者が、その会社で培養したクローンの細胞の良否の判定を行うためのＣＮＮ３４をインストールしており、これを購入した者が、クローン細胞の良否の判定に加えて未分化維持培養の細胞の良否の判定にも用いるといった場合が考えられる。つまり、ＣＮＮ３４が作成された環境とは別の環境で取得されるデータの解析に用いられる場合、学習モデル更新部５１aは、クローン細胞の良否の判定用に作成されたＣＮＮ３４を別の環境で取得されたデータにより更新することになる。このような転移学習が行われる場合でも、上記実施例や変形例で説明した構成を用いることが可能である。また、光学顕微鏡により試料を撮像した画像データからノイズを除去して試料の特徴的な構造を検出する解析を行うために作成された学習モデルを、イメージング質量分析装置を用いた試料の質量分析により取得したデータからノイズを除去して試料の特徴的な構造を検出する等の解析に適用する場合にも転移学習が行われる。 For example, the manufacturer of the personal computer, which is the control / processing device 3a of the modified example, has installed CNN34 for judging the quality of the cells of the clone cultured at the company, and the person who purchased this has installed the clone. In addition to determining the quality of cells, it may be used to determine the quality of cells in undifferentiated maintenance culture. That is, when the CNN34 is used for analysis of data acquired in an environment different from the environment in which it was created, the learning model update unit 51a acquires the CNN34 created for determining the quality of the cloned cells in another environment. It will be updated according to the data. Even when such transfer learning is performed, it is possible to use the configurations described in the above examples and modifications. In addition, a learning model created to remove noise from the image data obtained by imaging the sample with an optical microscope and detect the characteristic structure of the sample is obtained by mass spectrometry of the sample using an imaging mass spectrometer. Transfer learning is also performed when applying it to analysis such as removing noise from the acquired data and detecting the characteristic structure of the sample.

また、質量分析装置の電圧や温度などの制御パラメータを変化させるという行動について、測定の結果得られるピークの強度を報酬として、その報酬を最大化するように制御パラメータを調整する行動を学習するような場合にも、本発明に係る方法及び装置を用いることができる。 Also, for the behavior of changing control parameters such as voltage and temperature of the mass spectrometer, learn the behavior of adjusting the control parameters so as to maximize the reward, using the intensity of the peak obtained as a result of the measurement as a reward. In such cases, the method and apparatus according to the present invention can be used.

１…ガスクロマトグラフ質量分析装置
１０…ガスクロマトグラフ部
１１…カラムオーブン
１２…試料気化室
１３…インジェクタ
１４…オートサンプラ
１５…キャピラリカラム
２０…質量分析部
２１…イオン源
２１１…イオン化室
２１２…フィラメント
２２…レンズ電極
２３…四重極マスフィルタ
２３…真空チャンバ
２４…イオン検出器
３、３a…制御・処理装置
３１、３１a…記憶部
３２…物質データベース
３３…解析プログラム
３４…ＣＮＮ
４０、４０a…データ解析用プログラム
４１…参照データ取得部
４２…学習用パラメータセット作成部
４３…学習用パラメータセット決定部
４４…参照データ分割部
４５…学習モデル作成部
４６…未解析データ入力受付部
４７…未解析データ分割部
４８…実解析用パラメータセット決定部
４９…実解析実行部
５０…解析結果出力部
５１、５１a…学習モデル更新部
６…入力部
７…表示部1 ... Gas chromatograph mass spectrometer 10 ... Gas chromatograph unit 11 ... Column oven 12 ... Sample vaporization chamber 13 ... Injector 14 ... Auto sampler 15 ... Capillary column 20 ... Mass spectrometer 21 ... Ion source 211 ... Ionization chamber 212 ... Filament 22 ... Lens electrode 23 ... Quadrupole mass filter 23 ... Vacuum chamber 24 ... Ion detectors 3, 3a ... Control / processing devices 31, 31a ... Storage unit 32 ... Material database 33 ... Analysis program 34 ... CNN
40, 40a ... Data analysis program 41 ... Reference data acquisition unit 42 ... Learning parameter set creation unit 43 ... Learning parameter set determination unit 44 ... Reference data division unit 45 ... Learning model creation unit 46 ... Unanalyzed data input reception unit 47 ... Unanalyzed data division unit 48 ... Actual analysis parameter set determination unit 49 ... Actual analysis execution unit 50 ... Analysis result output unit 51, 51a ... Learning model update unit 6 ... Input unit 7 ... Display unit

試料に含まれる目的化合物を同定したり定量したりするために、クロマトグラフと質量分析装置を組み合わせたクロマトグラフ質量分析装置が広く用いられている。クロマトグラフ質量分析装置ではクロマトグラフのカラムに試料を導入し、保持時間（Retention Time, RT）の違いにより試料に含まれる複数の物質を分離し質量分析装置（Mass Spectrometer, MS）に導入する。質量分析装置に導入された物質はイオン化された後、質量電荷比（m/z）に応じて分離され検出される。これにより、保持時間（RT）と質量電荷比（m/z）の二軸に対してイオンの検出強度をプロットした三次元データが得られる。この三次元データにおいて、各質量電荷比のイオンの検出強度（信号強度）は、その質量電荷比を有するイオンを生成する物質の、試料中の含有量を反映している。 A chromatograph mass spectrometer that combines a chromatograph and a mass spectrometer is widely used for identifying and quantifying a target compound contained in a sample. In the chromatograph mass spectrometer, a sample is introduced into a chromatograph column, and a plurality of substances contained in the sample are separated according to the difference in retention time (RT) and introduced into a mass spectrometer (Mass Spectrometer , MS) . After substances are introduced into the mass spectrometer that is ionized and detected is separated according to mass-to-charge ratio (m / z). As a result, three-dimensional data obtained by plotting the detection intensity of ions with respect to the two axes of retention time (RT) and mass-to-charge ratio (m / z) can be obtained. In this three-dimensional data, the detection intensity (signal intensity) of ions at each mass-to-charge ratio reflects the content of the substance that produces ions having that mass-to-charge ratio in the sample.

この三次元データの保持時間（RT）軸の各点において質量電荷比（m/z）軸の方向の信号強度を積算することによりトータルイオンカレント（Total Ion Current, TIC）が得られる。そして、トータルイオンカレントを保持時間軸に沿ってプロットすることによりトータルイオンカレントクロマトグラム（Total Ion Current Chromatogram, TICC）が得られる。 The total ion current (TIC) can be obtained by integrating the signal strength in the direction of the mass-to-charge ratio (m / z) axis at each point on the retention time (RT) axis of the three-dimensional data. The total ion current chromatogram (Total Ion Current Chromatogram, TICC) is obtained by plotting along the retention time axis total ion current.

試料に含まれる各物質がクロマトグラフのカラムで互いに十分に分離されていれば、TICCの波形（TICC波形）には、その物質の保持時間の位置に単峰性の釣鐘型のピークが現れる。その保持時間におけるマススペクトルから物質を同定することで、その保持時間に溶出した物質が何であるかを特定することができる。物質の同定は、同定対象のマススペクトルと、データベース（Data Base, DB）に保存されている既知物質の実測マススペクトルもしくは理論マススペクトルとを比較することにより行われる。その比較項目は、マスピークが存在する質量電荷比（m/z）値、マスピークの強度などである。マススペクトルの一致度（スコア）により、物質の同定結果がどの程度の信頼性を有するかを定量的に評価することができる。また、TICC波形のピークの面積や高さから、クロマトグラフにより分離された各物質の量を推定することができる。 If each substance contained in the sample is sufficiently separated from each other by the column of the chromatograph, a monomodal bell-shaped peak appears at the position of the retention time of the substance in the TICC waveform (TICC waveform). By identifying the substance from the mass spectrum at that retention time, it is possible to identify what the substance was eluted during that retention time. A substance is identified by comparing the mass spectrum to be identified with the measured mass spectrum or the theoretical mass spectrum of a known substance stored in a database (Data Base, DB). The comparison item, mass-to-charge ratio mass peaks are present (m / z) value, the intensity of Ma-speak and the like. The degree of concordance (score) of the mass spectrum can be used to quantitatively evaluate how reliable the identification result of the substance is. Also, from the area or height of the peak of TICC waveform, it is possible to estimate the amount of each material separated by chromatography.

そこで、信号処理や統計処理等によって重畳ピークを分離し、一つのマススペクトルに単一の物質に由来するマスピーク群しか含まれないようにTICCピークを純化するピーク分離（Peak Deconvolution）を行う。こうしてTICCピークを純化すると、測定データのTICC波形にどのようなTICCピークが重畳していたのかを推定することができる。多くの場合、ピーク分離を実行するために専用の解析プログラムが用いられる。ガスクロマトグラフィー／質量分析（GC/MS）で得られた測定データ（GCMSデータ）のピークを純化するために用いられる代表的な解析プログラムとして、アメリカ国立標準技術研究所（National Institute of Standards and Technology, NIST）から提供されているAMDIS（Automated Mass Spectral Deconvolution and Identification System）が知られている（非特許文献１参照）。AMDISでは、ピークを純化するために6つの解析パラメータ（ピーク幅、除外質量電荷比、近傍ピークの数、ピーク間隔、ピーク検出感度、及びモデル適合度）が用いられる。これらの解析パラメータにはそれぞれ初期値が用意されており、多くの場合、解析にはその初期値がそのまま用いられる。 Therefore, superposed peaks are separated by signal processing, statistical processing, etc., and peak separation (Peak Deconvolution) is performed to purify the TICC peaks so that one mass spectrum contains only a group of mass peaks derived from a single substance. By purifying the TICC peak in this way, it is possible to estimate what kind of TICC peak was superimposed on the TICC waveform of the measurement data. In many cases, a dedicated analysis program is used to perform peak separation. Typical analysis program is used to purify the peak of gas chromatography / mass spectrometry (GC / MS) measurement data obtained in (GCMS data), National Institute of Standards and Technology (National Institute of Standards and Technology AMDIS (Automated Mass Spectral Deconvolution and Identification System) provided by NIST) is known (see Non-Patent Document 1). AMDIS uses six analytical parameters (peak width, excluded mass-to-charge ratio, number of neighboring peaks, peak spacing, peak detection sensitivity, and model goodness of fit) to purify the peaks. Initial values are prepared for each of these analysis parameters, and in many cases, the initial values are used as they are for analysis.

実解析用パラメータセット決定ステップは、例えば、未解析データと最も高い共通性を持つ参照データ群を決定し、該参照データ群に対応する学習用パラメータセットをそのまま実解析用パラメータセットとすることにより行うことができる。複数の学習用パラメータセットに付されたパラメータセット番号の中の1つを予測する場合は、最も共通性が高い参照データ群に対応付けられているパラメータセット番号を選択する。これは前記1乃至複数のパラメータの組を1つの「パラメータセット」として扱い、どのパラメータセットで解析すべきかを考える場合である。これは機械学習の用語でいえば「識別」のアプローチである。 In the actual analysis parameter set determination step, for example, the reference data group having the highest commonality with the unanalyzed data is determined, and the learning parameter set corresponding to the reference data group is used as it is as the actual analysis parameter set. Can be done . When predicting the one of attached to learning parameter set multiple parameter set number, selects a parameter set number most common property is associated with the high reference data group. This is a case where the set of one or more parameters is treated as one "parameter set" and the parameter set to be analyzed should be considered. This is an "identification" approach in machine learning terms.

制御・処理装置３は、さらに、機能ブロックとして、参照データ取得部４１、学習用パラメータセット作成部４２、学習用パラメータセット決定部４３、参照データ分割部４４、学習モデル作成部４５、未解析データ入力受付部４６、未解析データ分割部４７、実解析用パラメータ決定部４８、実解析実行部４９、解析結果出力部５０、学習モデル更新部５１を備えている。制御・処理装置３の実体はコンピュータであり、これらの機能ブロックは、制御・処理装置３に予めインストールされたデータ解析用プログラム４０をプロセッサで実行することにより具現化される。また、制御・処理装置３にはマウスやキーボード等の入力部６と表示部７が接続されている。 Further, as functional blocks, the control / processing device 3 includes a reference data acquisition unit 41, a learning parameter set creation unit 42, a learning parameter set determination unit 43, a reference data division unit 44, a learning model creation unit 45, and unanalyzed data. It includes an input reception unit 46, an unanalyzed data division unit 47, an actual analysis parameter determination unit 48, an actual analysis execution unit 49, an analysis result output unit 50, and a learning model update unit 51. The substance of the control / processing device 3 is a computer, and these functional blocks are embodied by executing the data analysis program 40 pre-installed in the control / processing device 3 on the processor. Further, an input unit 6 such as a mouse or a keyboard and a display unit 7 are connected to the control / processing device 3.

使用者が入力部６を通じた操作により学習用パラメータセットの作成を指示すると、学習用パラメータセット作成部４２は、予め制御・処理装置３にインストールされている解析プログラム３３を実行し、パラメータを設定する画面を表示部７に表示する。本実施例の解析プログラムはAMDISである。AMDISでは、Component width, Omit m/z, Adjacent peak subtraction, Resolution, Sensitivity, Shape requirementという6つの解析パラメータが用いられる。図４に各パラメータの内容を示す。本実施例では、初期値（Parameter Set Number: 0）を基準としてAdjacent peak subtractionをTwoからOneに下げる、ResolutionをHighからMediumに下げるなどして45種類のパラメータセットを作成した。図５にその一部（初期値及び10種類のパラメータセット）を示す。ここでは、参照データの取得後に、学習用パラメータセットを作成する順で説明したが、両者の実行順は逆であってもよく、また両者を並行して行ってもよい。また、使用者による指示に従って、学習用パラメータセット作成部４２が事前に作成された学習用パラメータセットを記憶部３１から読み出すようにしてもよい。こうして複数の学習用パラメータセットが作成される（ステップＳ２）。 When the user instructs the creation of the learning parameter set by the operation through the input unit 6, the learning parameter set creation unit 42 executes the analysis program 33 installed in the control / processing device 3 in advance and sets the parameters. The screen to be displayed is displayed on the display unit 7. The analysis program of this example is AMDIS. AMDIS uses six analysis parameters: Component width, Omit m / z, Adjacent peak subtraction, Resolution, Sensitivity, and Shape requirement. FIG. 4 shows the contents of each parameter. In this example, 45 types of parameter sets were created by lowering the Adjacent peak subtraction from Two to One and the Resolution from High to Medium based on the initial value (Parameter Set Number: 0). FIG. 5 shows a part (initial value and 10 kinds of parameter sets). Here, the order in which the training parameter set is created after the reference data is acquired has been described, but the execution order of both may be reversed, or both may be performed in parallel. Further, according to the instruction by the user, the learning parameter set creating unit 42 may read the learning parameter set created in advance from the storage unit 31. In this way, a plurality of learning parameter sets are created (step S2).

しかし、保持時間が同じあるいは近い複数の物質が試料に含まれていると、それらの物質の保持時間やその前後の時間にクロマトグラフから溶出する溶出物に複数の物質が混在してしまう。すると、当該保持時間やその前後の時間におけるマススペクトルには複数の物質に由来するマスピークが混在し、それらのマスピークを積算することにより得られるTICC波形のピークも複数の物質に由来するピークが重畳したものになる。通常は、TICC波形に現れる単峰性のピークのピークトップの保持時間に何らかの物質が溶出したと考えるが、重畳ピークになっていると、ピークの形がいびつであったり、大きい単峰性ピークに小さい単峰性ピークが埋もれてしまっていたり、あるいはピークが多峰性になってしまったりする。こうした場合、ピークトップにあたる保持時間を正しく求めることができない。また、測定データにノイズが含まれていたり、信号強度にベースライン成分が含まれていたりすると、状況はより複雑になり、試料に少量しか含まれていない物質に由来する小さなTICCピークの保持時間を求めることがより困難になる。 However, if the sample contains a plurality of substances having the same or similar retention times, the plurality of substances will be mixed in the eluate eluted from the chromatograph during the retention time of these substances or the time before and after that. Then, mass peaks derived from a plurality of substances are mixed in the mass spectrum at the holding time and the time before and after the holding time, and peaks derived from a plurality of substances are superimposed on the peaks of the TICC waveform obtained by integrating those mass peaks. It will be the one that was done. Normally, it is considered that some substance was eluted during the retention time of the peak top of the monomodal peak appearing in the TICC waveform, but when it is a superposed peak, the shape of the peak is distorted or a large monomodal peak. Small monomodal peaks are buried in the area, or the peaks are multimodal. In such a case, it can not be determined correctly the retention time corresponding to pin Kutoppu. Also, if the measurement data contains noise or the signal strength contains baseline components, the situation becomes more complicated and the retention time for small TICC peaks from substances that are only present in small amounts in the sample. Becomes more difficult to find.

本発明の第２の態様であるデータ解析用学習モデルの作成方法では、第１の態様のデータ解析方法と同様の学習用パラメータセット作成ステップ、学習用パラメータセット決定ステップ、及び参照データ群作成ステップを行うことにより作成される、複数の学習用パラメータセットのそれぞれに前記参照データ群を対応付けたものを学習データとする機械学習により学習モデルを作成する。近年、機械学習の様々な手法が提案されており（例えば特許文献１）、前記機械学習には、例えばディープラーニング（Deep Learning）、そのディープラーニングの一形態である畳み込みニューラルネットワーク（Convolution Neural Network, CNN）、サポートベクターマシン（Support Vector Machine, SVM）、アダブースト（AdaBoost）を用いることができる。こうして作成した学習モデルは、本発明の第１の態様であるデータ解析方法の実解析用パラメータセット決定ステップにおいて好適に用いることができる。 The second in a is data analysis learning model creation method aspects, the first aspect of the data analysis similar parameter set creation step for learning and methods, learning parameter set determining step, and the reference data group creation step of the present invention A learning model is created by machine learning in which the reference data group is associated with each of a plurality of learning parameter sets created by performing the above as training data. In recent years, various methods of machine learning have been proposed (for example, Patent Document 1), and the machine learning includes, for example, deep learning, and a convolutional neural network (Convolution Neural Network,) which is a form of the deep learning. CNN), Support Vector Machine (SVM), and AdaBoost can be used. The learning model thus created can be suitably used in the parameter set determination step for actual analysis of the data analysis method according to the first aspect of the present invention.

学習用パラメータセット決定部４３は、次に、32個のGCMSデータのピークのそれぞれについて、同定に最適な学習用パラメータセットを決定する（ステップＳ４）。同じ保持時間のピークが複数の学習用パラメータセットを用いた解析により同定された場合には、それら複数の学習用パラメータセットのうち、最もスコアが高い解析結果が得られた学習用パラメータセットをそのピークの同定に最適な学習用パラメータセットとする。また、最高スコアが複数ある場合は、学習用パラメータセットの番号が小さい方を最適な学習用パラメータセットとする。こうした処理を全てのピーク（物質が同定されたピーク）について行い、最適な学習用パラメータセットを決定する。ただし、ピークの保持時間と、同定された物質の理論保持時間の差が0.25分よりも大きい場合には、スコアに関わらず誤同定とし、次にスコアが大きく保持時間の差が0.25分以下である学習用パラメータセットをそのピークの同定に最適な学習用パラメータセットとした。 Next, the learning parameter set determination unit 43 determines the optimum learning parameter set for identification for each of the peaks of the 32 GCMS data (step S4). If the peak with the same retention time is identified by analysis using multiple learning parameter sets, the learning parameter set with the highest score among the multiple learning parameter sets is selected. The optimum training parameter set for peak identification. When there are a plurality of maximum scores, the one with the smaller number of the learning parameter set is set as the optimum learning parameter set. Such processing is performed for all peaks (peaks in which substances have been identified) to determine the optimum training parameter set. However, if the difference between the peak retention time and the theoretical retention time of the identified substance is greater than 0.25 minutes, it will be misidentified regardless of the score, and the next largest score and the difference in retention time will be 0.25 minutes or less. A learning parameter set was set as the optimum learning parameter set for identifying the peak.

本実施例では、解析パラメータの初期値である学習用パラメータセット0が最適な学習用パラメータセットであるとされた分割参照データが667個、それ以外の学習用パラメータセットが最適な学習用パラメータセットであるとされた分割参照データが1,139個となった。このように、多くの場合、解析パラメータの初期値が最適ではないデータが一定の割合で存在している。 In this embodiment , there are 667 divided reference data in which the learning parameter set 0, which is the initial value of the analysis parameters, is the optimum learning parameter set, and the other learning parameter sets are the optimum learning parameter sets. The number of divided reference data determined to be is 1,139. As described above, in many cases, there is a certain percentage of data in which the initial values of the analysis parameters are not optimal.

Claims

This is a method of analyzing the data to be analyzed by setting values for one or more analysis parameters and using a predetermined analysis program.
A learning parameter set creation step for creating a plurality of learning parameter sets in which at least one value of at least one of the one or a plurality of analysis parameters is different from each other.
For each of the plurality of reference data, the learning parameter set determination step of executing the analysis by the analysis program using each of the plurality of learning parameter sets and determining the learning parameter set suitable for the analysis according to a predetermined criterion. When,
A reference data group creation step in which a reference data group, which is a group of reference data for which the learning parameter set is determined to be suitable for analysis in the learning parameter set determination step, is associated with each of the plurality of learning parameter sets. ,
An unanalyzed data input step for inputting unanalyzed data, which is unanalyzed data,
The commonality between the unanalyzed data and each reference data group is obtained according to a predetermined standard, and the one or more analysis parameters are obtained from the learning parameter set associated with each reference data group based on the commonality. The actual analysis parameter set determination step, which determines the actual analysis parameter set by obtaining a value suitable for the analysis of unanalyzed data, and the actual analysis parameter set determination step.
A data analysis method comprising: an actual analysis step of executing an analysis of the unanalyzed data by the analysis program using the parameter set for actual analysis.

Moreover,
It has a learning model creation step of creating a learning model by machine learning in which the reference data group is associated with each of the plurality of learning parameter sets as learning data.
The data analysis method according to claim 1, wherein a parameter set is determined using the learning model in the parameter selection step.

The data analysis method according to claim 2, wherein the machine learning uses deep learning, a support vector machine, and AdaBoost.

Moreover,
By executing the learning parameter set determination step using the unanalyzed data as the reference data, a learning parameter set suitable for the analysis is determined, and the unanalyzed data corresponds to the learning parameter set suitable for the analysis. The data analysis method according to claim 2, further comprising a learning model update step for performing the machine learning using the attached data as training data.

The data analysis method according to claim 1, wherein the reference data and the unanalyzed data are mass chromatograms, total ion current chromatograms, mass spectra, spectroscopic spectra, or image data.

In the parameter set determination step, a parameter set suitable for the analysis is determined for a part or all of the divided reference data obtained by dividing the reference data.
The data analysis method according to claim 1, wherein the reference data group is created by grouping the divided reference data in the reference data group creation step.

The analysis program extracts data of one or more peaks included in the unanalyzed data and collates it with a database of known substances to identify substances corresponding to the one or more peaks. The data analysis method according to claim 6.

The data analysis according to claim 7, wherein the degree of agreement between the identified substance and the data stored in the database is obtained for each of the data of one or a plurality of peaks included in the unanalyzed data. Method.

The data analysis method according to claim 8, wherein the predetermined criterion in the optimum parameter determination step is the one having the highest degree of agreement as the optimum learning parameter set.

Multiple divided unanalyzed data are created by dividing the data obtained by measuring the sample to be analyzed using an analyzer according to a predetermined standard.
The data analysis method according to claim 6, wherein a part or all of the plurality of divided unanalyzed data is input as the unanalyzed data in the unanalyzed data input step.

The measurement data analysis method according to claim 10, wherein the divided unanalyzed data is data of one or a plurality of peaks.

The first aspect of the present invention is to determine the actual analysis parameter set only when there is a reference data group having a high degree of commonality equal to or higher than a predetermined reference in the actual analysis parameter set determination step. The described measurement data analysis method.

A device that analyzes the data to be analyzed by setting values for one or more analysis parameters and using a predetermined analysis program.
A learning parameter set creation unit that creates a plurality of learning parameter sets in which at least one value of at least one of the one or a plurality of analysis parameters is different from each other.
A learning parameter set determination unit that executes analysis by the analysis program using each of the plurality of learning parameter sets for each of the plurality of reference data, and determines a learning parameter set suitable for analysis according to a predetermined criterion. When,
A reference data group creation unit that associates each of the plurality of learning parameter sets with a reference data group that is a group of reference data for which the learning parameter set is determined to be suitable for analysis by the learning parameter set determination unit. ,
Unanalyzed data input section for inputting unanalyzed measurement data,
The commonality between the unanalyzed data and each reference data group is obtained according to a predetermined standard, and the one or more analysis parameters are obtained from the learning parameter set associated with each reference data group based on the commonality. A parameter set determination unit for actual analysis that determines a parameter set for actual analysis by obtaining a value suitable for analysis of unanalyzed data, and a parameter set determination unit for actual analysis.
A measurement data analysis apparatus including an actual analysis execution unit that executes analysis of the unanalyzed data by the analysis program using the actual analysis parameter set.

A method of creating a learning model used to determine the values of one or more analysis parameters used when analyzing the data to be analyzed by a predetermined analysis program.
A learning parameter set creation step for creating a plurality of learning parameter sets in which at least one value of at least one of the one or a plurality of analysis parameters is different from each other.
For each of the plurality of reference data, the learning parameter set determination step of executing the analysis by the analysis program using each of the plurality of learning parameter sets and determining the learning parameter set suitable for the analysis according to a predetermined criterion. When,
A reference data group creation step in which a reference data group, which is a group of reference data for which the learning parameter set is determined to be suitable for analysis in the learning parameter set determination step, is associated with each of the plurality of learning parameter sets. ,
A learning model for data analysis, which comprises a learning model creation step for creating a learning model by machine learning, in which each of the plurality of learning parameter sets is associated with the reference data group as learning data. How to make.