CN116755724B

CN116755724B - PCA software installation method

Info

Publication number: CN116755724B
Application number: CN202310763998.XA
Authority: CN
Inventors: 刘敏; 李晔; 杨静; 何天豪; 黄晔; 何尔凯
Original assignee: East China Normal University
Current assignee: East China Normal University
Priority date: 2022-11-29
Filing date: 2023-06-27
Publication date: 2024-02-02
Anticipated expiration: 2043-06-27
Also published as: CN116755724A

Abstract

The invention discloses PCA software and an installation method, comprising the following steps: software and installation method. In particular to the technical field of PCA software application. The name of the software is advanced SpectraPCAToolbox, the download address and the use method are published in the application. The application provides a download address and a detailed download position of PCA software, and carries out complete and clear explanation on the installation, parameter modification and use description of the downloaded PCA software, thereby solving the problems that the conventional PCA software is inconvenient and fast to download and use, and errors are easily generated due to the fact that the PCA software is automatically installed in the process of self-installation after being downloaded and used, and the normal use of the PCA software is affected.

Description

PCA software installation method

Technical Field

The invention relates to the technical field of PCA software use, in particular to a PCA software installation method.

Background

PCA is an acronym for English Principal Component Analysis and is a commonly used data analysis method. PCA transforms raw data into a set of linearly independent representations of each dimension through linear transformation, and can be used for extracting main characteristic components of data and is commonly used for dimension reduction of high-dimension data. Data dimension reduction is another common problem of unsupervised learning.

The existing PCA software is inconvenient and fast to download and use, so that errors are easily generated due to installation errors in the process of self-installation after the software is downloaded and used, and normal use of the PCA software is affected.

Disclosure of Invention

In order to achieve the above purpose, the present invention provides the following technical solutions:

a PCA software comprising a Python script and an editable software file, the Python script using a software package comprising: PANDAS, NUMPY, SCIKIT-LEARN and MATLOTLIB, wherein the whole workflow comprises file input and output, interaction with a user, data preprocessing, principal component analysis, drawing of a visual image and calculation of measurement, wherein the input is text data of an original spectrogram, the text data comprises all peak positions and peak intensities, the input file is divided and arranged into different groups according to requirements to represent different sample groups, and the output principal component analysis result comprises a histogram of important proportion of each principal component, a fractional scatter diagram of the principal component as extraction characteristics and a factor load diagram of peak of each principal component; automatically reading mass spectrum information from txt data files in a specified format, extracting principal components, combining every two of the first 5 most important principal components extracted with each other and respectively serving as an x axis and a y axis, drawing a fractional scatter diagram, simultaneously calculating the average central value and variance of numerical values for the first 5 most important principal components extracted by a program, and then drawing a 90% confidence interval.

The name of the PCA software is Advanced Spectra PCA Toolbox, and the download address of the Advanced Spectra PCA Toolbox software is (https:// docs. Anaconda. Com/anaconda/install /);

the installation method of the PCA software comprises the following steps:

step S1: opening the installation package and decompressing the installation package after the software is downloaded;

step S2: finding out unit mass spectrum txt data from a decompressed file of software, and deriving the unit mass spectrum txt data;

step S3: finding out the PCA7.Py file and the PCA subfolder attached to the file, and then changing the path of the PCA7.Py file and the PCA subfolder attached to the file;

step S4: grouping the names of the mass spectrum files, and coding and replacing the mass spectrum files after the grouping is finished;

step S5: determining an output file, finding a graph of the output file, and arranging group names of groups in the graph;

step S6: find and open Anaconda Powershell Prompt.exe's run program, then run the program with the identity of the administrator;

step S7: enter the program and view the output results of the PCA.

Compared with the prior art, the invention has the beneficial effects that:

the application provides a download address and a detailed download position of PCA software, and carries out complete and clear explanation on the installation, parameter modification and use description of the downloaded PCA software, thereby solving the problems that the conventional PCA software is inconvenient and fast to download and use, and errors are easily generated due to the fact that the PCA software is automatically installed in the process of self-installation after being downloaded and used, and the normal use of the PCA software is affected.

Detailed Description

In the following, the technical solutions in the embodiments of the present invention will be clearly and completely described in conjunction with the embodiments of the present invention, and it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

In the embodiment of the invention, the following technical scheme is provided:

SIMS Mass Spectrometry data batch processing workflow and PCA automated analysis by custom-written Python scripts and editable software files (Editable Files of software. Zip), in a free, open source and portable Python-based scientific environment named WINPYTHON (v3.6.7.0, https:// WINPYTHON. Github. Io /). The software package used in the script includes: PANDAS (v0.23.4), NUMPY (v 1.15), SCIKIT-LEARN (v0.20.2) and MATPLOTLIB (v 3.0.2). The whole workflow comprises input and output of files, interaction with a user, data preprocessing, principal component analysis, drawing of visual images and calculation of metrics. Input is text data of the original spectrogram, which contains all peak positions and peak intensities. The input files are required to be divided into different groups according to requirements, and represent different sample groups. The output principal component analysis results include a histogram of the important proportions of the principal components, a fractional scatter diagram of the principal components as extraction features, and a factor load diagram of the peak values of the principal components. Details of the downloading and installation of the Python package are found in our manual (document s1. Manual docx, found in the support information).

The mass spectrum information can be automatically read from the txt data file with a specified format in a programmable manner (the detailed information of the txt data format is seen in an operation manual) and the main components are extracted. The first 5 most significant principal components extracted are combined with each other and taken as x-axis and y-axis, respectively, and a point scatter plot is drawn, while for each individual data set, the program will calculate the mean center value and variance of the values, and then draw a 90% confidence interval. In combination with the score and load map acquired by the new mode, deeper information of compositional differences between different samples can be found.

For verification, one embodiment of the software is set to be Advanced Spectra PCA Toolbox, the software is uploaded on the website on the application date, the download address of the Advanced Spectra PCA Toolbox software is (https:// docs. Anaconda. Com/anaconda/install /), then the program is downloaded according to the installation step in the web page, and after the downloading is completed, the menu is selected to be installed on microsoft.

The method for installing the software comprises the following steps:

after the software is installed. The PCA file and PCA folder are placed in one folder of the software (the application typically places these two files in the C-disc in the position of submenu scripts of the sub-Anaconda menu.

the default working path is changed by finding the PCA7.Py file and its accompanying PCA folder. For example: the PCA file stores the position of the submenu (script) of the subordinate (boa) menu in the path C disc, and the storage path of the attached PCA folder is in the branching menu of the PCA of the submenu (script) of the subordinate (boa) menu in the C disc, then opens the PCA7.Py file with a notepad, then finds the default working path code, the code pcaDir value is included in the branching menu of the PCA of the position of the submenu (script) of the subordinate (boa) menu in the C disc, and the path in the PCA7.Py file can be adjusted to be the same as the actual storage path, if necessary.

the group column names of the Mass spectrum files are that firstly, the txt data file is opened by using Excel, then the related data column is cut, then the Excel file is pasted to a new Excel file, the Mass spectrum data can be grouped by changing the column names, and the attention should be paid to ensure that the name of the first column is Mass (u), in the process, by adding Arabic number of each group, before the group number based on the Mass spectrum column name is needed,

reference example one:

(1) The original column name is a file written by English letters, arabic numerals are added in front of the file written by English letters after the file is recombined and named, the new file is required to be saved as a test (desktop definition) after the file is recombined and named, and the saved test (desktop definition) is a TXT format file;

(2) The new file is then pasted to the corresponding path: in the conventional data in the branched menu of the PCA of the submenu (script) of the subordinate (boa) menu in the C-disc.

reference example two:

(1) Determining each group name in the output diagram, wherein the specific process is to find the group name file in the branch menu of PCA of the submenu (script) of the subordinate (boa) menu in the path C disk;

(2) The Group names displayed in the last drawing are then opened, followed by renaming or entering the Group names, respectively, after the Arabic numerals, in the format of Arabic numerals preceded by 0.

Step S6: find and open Anaconda Powershell Prompt.exe or Anaconde.exe program, then run the program with the identity of the administrator;

reference example three:

(1) Find the program named Anaconda Powershell Prompt (Anaconda) exe in the start menu of the software and then run the program with administrator identity

(2) The command "cd C \Anacondas\scripts" is input, then the enter key is pressed to enter the next menu, then the command python pca7.Py is input in the menu, and finally the enter key is pressed to run.

Step S7: enter the program and view the output results of the PCA.

Reference example four:

entering a folder of output in a PCA branch menu of a submenu (script) of a subordinate (boa) menu in a C disc, and checking a PCA output result after entering the folder, wherein the PCA output result comprises the following parts:

a.10 Zhang Hanyou score plot of the PC1-PC5 two-dimensional principal components combined with each other for the confidence interval;

b, 10 score graphs of mutual combination of PC1-PC5 two-dimensional principal components without confidence intervals;

5 individual one-dimensional fractional graphs of PC1-PC 5;

a PC1-PC5 score table;

e.5 PC1-PC5 load maps;

f.5 PC1-PC5 front 20 load tables;

PC1-PC10 "percent explained variance" bar graph;

h.PC1-PC10 "percent explained variance" table;

a PC1-PC5 load table;

if the size of the picture coordinate system is to be changed, the picture resolution menu is opened and then the picture resolution menu is entered into the pca7.Py file, then the corresponding parameter positions of the 16 th-32 th rows are found from the pca7.Py file, then the parameter values are deleted, finally the values are saved and run again for use, and the values can be modified according to the following table confidence limits, and the details are shown in the following table.

TABLE 1 confidence limits

Reference example five:

if the number of principal PCs needs to be increased or decreased, the pca7.Py file may be opened, then the following corresponding content is found from the pca7.Py file, then the corresponding parameters are modified, and finally the pca7.Py file is saved and rerun.

Reference example six:

in the two-dimensional score map of 10 PC1-PC5 main components combined with each other and the two-dimensional score map of 10 PC1-PC5 main components combined with no confidence interval and the single one-dimensional score map of 5 PC1-PC5, if the size and the color in the picture are to be changed, a pca7.Py file can be opened, then highlight matters are found from the file and deleted, and the file is saved and operated again after the deletion.

Reference example seven:

in the two-dimensional fractional diagram with 10 PC1-PC5 main components combined with each other and containing a confidence interval, if the picture proportion, the label font size, the font model, the font thickness and the line shape, the line width, the color and the transparency of the confidence interval are to be changed, a pca7.Py file can be opened, and then parameters with high brightness are found to delete, and the file is saved and operated again after deletion.

Reference example eight:

in the two-dimensional fractional graphs of 10 PC1-PC5 main components which do not contain confidence intervals, if the proportion of the pictures, the font size of the labels, the font model and the font thickness are to be changed, a pca7.Py file can be opened, then highlighted parameter deletion is found, and the file is saved and is operated again after deletion.

Reference example nine:

in the single one-dimensional fractional diagram of 5 PCs 1-PC5, if the proportion of the picture, the font size of the label, the font model and the font thickness are to be changed, a pca7.py file can be opened, then the highlighted parameter deletion is found, and the file is saved and operated again after the deletion.

Reference example ten:

in the independent load diagram of 5 PCs 1-PC5, if the proportion of pictures, the number of extracted loads, the font size, the font model, the font thickness of labels, the size of columns in the histogram, the color and the text size are to be changed, a pca7.py file can be opened, then parameters with high brightness are found for deletion, and the file is saved and operated again after deletion.

Reference example eleven:

in the bar graph of PC1 to PC10 "interpret percent of variance", if the picture proportion, font size, font model, font thickness of the label, and size and color of the bar in the bar graph are to be changed, the pca7.Py file may be opened, and then the highlighted parameter deletion is found, and the file is saved and run again after deletion.

Reference twelve:

in the single load diagram of 5 PCs 1-PC5, if the marked load peak value number needs to be changed, a pca7.Py file can be opened, then highlighted parameter deletion is found, and the file is saved and operated again after deletion.

Reference example thirteen:

in the independent load tables of 5 PCs 1-PC5, if the number of positive and negative loads in the tables is to be changed, a pca7.Py file can be opened, then highlighted parameter deletion is found, and the file is saved and is operated again after deletion:

the foregoing description is only a preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art, who is within the scope of the present invention, should make equivalent substitutions or modifications according to the technical solution of the present invention and the inventive concept thereof, and should be covered by the scope of the present invention.

Claims

1. The method for installing PCA software is characterized by comprising the following steps of:

the PCA software comprises a Python script and an editable software file, wherein a software package used in the Python script comprises: PANDAS, NUMPY, SCIKIT-LEARN and MATLOTLIB, wherein the whole workflow comprises file input and output, interaction with a user, data preprocessing, principal component analysis, drawing of a visual image and calculation of measurement, wherein the input is text data of an original spectrogram, the text data comprises all peak positions and peak intensities, the input file is divided and arranged into different groups according to requirements to represent different sample groups, and the output principal component analysis result comprises a histogram of important proportion of each principal component, a fractional scatter diagram of the principal component as extraction characteristics and a factor load diagram of peak of each principal component; automatically reading mass spectrum information from a txt data file in a specified format, extracting principal components, combining every two of the extracted first 5 most important principal components, respectively serving as an x axis and a y axis, drawing a fractional scatter diagram, simultaneously for the extracted first 5 most important principal components, calculating an average central value and variance of numerical values by a program, and then drawing a 90% confidence interval;

the installation method of the PCA software comprises the following steps:

step S1: opening the installation package and decompressing the installation package after the PCA software is downloaded;

step S2: finding out unit mass spectrum txt data from a decompression file of PCA software, and deriving the unit mass spectrum txt data;

step S4: the names of the unit mass spectrum txt data files are listed, and after the listing is completed, the encoding and the replacement of the unit mass spectrum txt data files are carried out;

step S6: find and open a program named Anaconda Powershell Prompt. Exe, then run the program with the identity of the administrator;

step S7: entering a program and checking the output result of the PCA software.