CN114627979A - Method and system for determining biomass material characteristic probability distribution information - Google Patents
Method and system for determining biomass material characteristic probability distribution information Download PDFInfo
- Publication number
- CN114627979A CN114627979A CN202210305030.8A CN202210305030A CN114627979A CN 114627979 A CN114627979 A CN 114627979A CN 202210305030 A CN202210305030 A CN 202210305030A CN 114627979 A CN114627979 A CN 114627979A
- Authority
- CN
- China
- Prior art keywords
- probability distribution
- material characteristic
- probability
- data
- biomass
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000009826 distribution Methods 0.000 title claims abstract description 243
- 239000000463 material Substances 0.000 title claims abstract description 172
- 239000002028 Biomass Substances 0.000 title claims abstract description 75
- 238000000034 method Methods 0.000 title claims abstract description 47
- 238000012800 visualization Methods 0.000 claims abstract description 27
- 238000012360 testing method Methods 0.000 claims description 32
- 238000012545 processing Methods 0.000 claims description 31
- 238000007619 statistical method Methods 0.000 claims description 28
- 238000007689 inspection Methods 0.000 claims description 21
- 230000001186 cumulative effect Effects 0.000 claims description 18
- 238000012216 screening Methods 0.000 claims description 12
- 238000007476 Maximum Likelihood Methods 0.000 claims description 10
- 238000005315 distribution function Methods 0.000 claims description 10
- 230000008676 import Effects 0.000 claims description 9
- 238000000546 chi-square test Methods 0.000 claims description 6
- 239000010902 straw Substances 0.000 description 15
- 241000209140 Triticum Species 0.000 description 11
- 235000021307 Triticum Nutrition 0.000 description 11
- 238000004364 calculation method Methods 0.000 description 5
- 230000008569 process Effects 0.000 description 5
- 238000001276 Kolmogorov–Smirnov test Methods 0.000 description 4
- 238000010586 diagram Methods 0.000 description 3
- 210000003608 fece Anatomy 0.000 description 3
- 244000144972 livestock Species 0.000 description 3
- 239000010871 livestock manure Substances 0.000 description 3
- 244000144977 poultry Species 0.000 description 3
- 238000004458 analytical method Methods 0.000 description 2
- 238000013461 design Methods 0.000 description 2
- 238000004519 manufacturing process Methods 0.000 description 2
- 238000005065 mining Methods 0.000 description 2
- 238000013215 result calculation Methods 0.000 description 2
- 239000000126 substance Substances 0.000 description 2
- 240000007594 Oryza sativa Species 0.000 description 1
- 235000007164 Oryza sativa Nutrition 0.000 description 1
- 230000002159 abnormal effect Effects 0.000 description 1
- 230000005856 abnormality Effects 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 238000002485 combustion reaction Methods 0.000 description 1
- 238000007405 data analysis Methods 0.000 description 1
- 238000013523 data management Methods 0.000 description 1
- 238000007418 data mining Methods 0.000 description 1
- 230000007812 deficiency Effects 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000009776 industrial production Methods 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 239000002366 mineral element Substances 0.000 description 1
- 238000007781 pre-processing Methods 0.000 description 1
- 230000000750 progressive effect Effects 0.000 description 1
- 239000002994 raw material Substances 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 235000009566 rice Nutrition 0.000 description 1
- 238000004088 simulation Methods 0.000 description 1
- 238000010561 standard procedure Methods 0.000 description 1
- 238000007655 standard test method Methods 0.000 description 1
- 238000010998 test method Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16C—COMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
- G16C20/00—Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
- G16C20/30—Prediction of properties of chemical compounds, compositions or mixtures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/10—Complex mathematical operations
- G06F17/18—Complex mathematical operations for evaluating statistical data, e.g. average values, frequency distributions, probability functions, regression analysis
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16C—COMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
- G16C20/00—Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
- G16C20/70—Machine learning, data mining or chemometrics
Landscapes
- Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Bioinformatics & Computational Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Computing Systems (AREA)
- Mathematical Analysis (AREA)
- Mathematical Optimization (AREA)
- Crystallography & Structural Chemistry (AREA)
- Chemical & Material Sciences (AREA)
- Pure & Applied Mathematics (AREA)
- Computational Mathematics (AREA)
- Mathematical Physics (AREA)
- Databases & Information Systems (AREA)
- Software Systems (AREA)
- Algebra (AREA)
- General Engineering & Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Computation (AREA)
- General Health & Medical Sciences (AREA)
- Medical Informatics (AREA)
- Probability & Statistics with Applications (AREA)
- Operations Research (AREA)
- Evolutionary Biology (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The invention relates to a method and a system for determining biomass material characteristic probability distribution information, which belong to the field of biomass. The method adopts a frequency histogram and a classical probability density function fitting method to realize the acquisition and visualization of qualitative and quantitative information of a probability distribution model of material characteristic data.
Description
Technical Field
The invention relates to the field of biomass, in particular to a method and a system for determining biomass material characteristic probability distribution information.
Background
The biomass such as crop straws, livestock and poultry manure and the like has large output and wide distribution range, and is one of the green renewable resources with the most development potential. However, due to the influence of differences of varieties, geographical environments and the like, the characteristics of biomass materials present large variability and specificity, and the method is one of main obstacles for high-value, high-efficiency and industrial production of the biomass materials. The heterogeneous material is used as a production raw material, the property, the quality and the application prospect of the material are determined, the general rule of variability and specificity of the characteristics of the biomass material needs to be deeply known and mastered, and the general distribution trend and the characteristics of the biomass material are explored.
The probability distribution model is used for describing the distribution trend of random variable values, quantitatively disclosing the overall characteristic rule of data and describing the uncertainty of the data, is one of important big data mining analysis methods, and is applied to the fields of power system simulation, combustion reaction process modeling, hydrological observation, meteorological data analysis and the like. The difference of the probability distribution models can also influence the selection of the statistical analysis method and the accuracy of the statistical result, so that the deviation between the statistical result and the actual result is caused, and the actual production is difficult to guide. The conventional statistical method is based on the premise that the data is normally distributed on the whole, but research shows that the characteristics of biomass materials such as crop straws, livestock and poultry manure are influenced by various external factors and individual differences to present non-normal distribution characteristics. Therefore, the probability distribution model for obtaining the characteristics of the biomass material has important significance for industrial conversion.
In order to effectively obtain the probability distribution model of the characteristics of the biomass material, a method and means which are in accordance with the information mining of the probability distribution of the characteristics of the biomass material need to be adopted. The existing exploration probability distribution information mining method still has the following problems: the existing method mainly adopts a few common probability distribution models for fitting, and the related probability distribution models are few in types and difficult to explore and optimize the probability distribution models of the characteristics of the biomass materials; the prior method has less statistical information related to a probability distribution model, and model information is difficult to obtain according to requirements; and thirdly, no specific method for exploring the probability distribution information of the characteristics of the biomass material exists at present.
Disclosure of Invention
The invention aims to provide a method and a system for determining biomass material characteristic probability distribution information, so as to realize the acquisition and visualization of qualitative and quantitative information of a probability distribution model of characteristics, components and contents of biomass materials.
In order to achieve the purpose, the invention provides the following scheme:
a method for determining probability distribution information of characteristics of biomass materials comprises the following steps:
acquiring a material characteristic value sequence of biomass;
arranging the material characteristic values in the material characteristic value sequence from small to large, and determining a frequency distribution histogram of the ordered material characteristic value sequence;
selecting a plurality of probability density functions of the frequency distribution according with the frequency distribution histogram;
solving parameters of each probability density function by adopting maximum likelihood estimation according to the material characteristic value sequence, and fitting to obtain a plurality of probability distribution models;
performing goodness-of-fit inspection on each probability distribution model, and screening out probability distribution models meeting goodness-of-fit inspection conditions;
calculating the statistical analysis result of each screened probability distribution model; the statistical analysis result comprises expectation, variance, skewness, kurtosis, median and standard deviation;
and standardizing each screened probability distribution model to obtain a probability distribution model curve superposed with the frequency distribution histogram.
Optionally, the obtaining a material characteristic value sequence of the biomass further includes:
calculating the median and the quartile distance of all material characteristic values in the material characteristic value sequence;
removing [ x ] in the material characteristic value sequence50%-3IQR,x50%+3IQR]Material characteristic values outside the range are eliminated, and material characteristic values with empty numerical values are eliminated;
wherein x is50%Represents the median, IQR represents the interquartile range, IQR ═ x75%-x25%,c=(n-1)×p+1,xpDenotes the value of the quantile p, p being 75%, 50% or 25%, xcRepresenting the material property value at position c, c representing position, N representing the number of material property values in the sequence of material property values, N representing a positive integer, { c } representing the fractional part of c,andrespectively representing a previous material property value and a subsequent material property value at position c.
Optionally, the determining a frequency distribution histogram of the sorted material characteristic value sequence specifically includes:
according to the sorted material characteristic value sequence, determining the initial value, the final value and the number of divisions of the frequency distribution histogram by using the following formulas:
wherein SV represents a start value, EV represents an end value, k represents the number of divisions, xmin、xmaxRespectively representing the minimum and maximum material characteristic values in the sorted material characteristic value sequence, IQR representing a four-quadrant spacing, m representing the number of material characteristic values in the sorted material characteristic value sequence,meaning that the rounding is done down,represents rounding up;
dividing the interval from the starting value to the ending value into k sub-intervals according to the number k of the divisions;
counting the number of the material characteristic values falling into each subinterval in the sequenced material characteristic value sequence, and determining the ratio of the number of the material characteristic values falling into each subinterval to the number m of the material characteristic values in the sequenced material characteristic value sequence as the frequency of each subinterval;
and drawing a frequency distribution histogram by taking the material characteristic value as an abscissa and the frequency as an ordinate.
Optionally, carry out goodness-of-fit inspection to every probability distribution model, select the probability distribution model who satisfies goodness inspection condition, specifically include:
and performing goodness-of-fit inspection on each probability distribution model by adopting K-S inspection, and screening out the probability distribution models of which the goodness-of-fit inspection values are greater than or equal to a preset threshold value.
Optionally, carry out goodness-of-fit inspection to every probability distribution model, select the probability distribution model who satisfies goodness inspection condition, specifically include:
using a formula based on each probability distribution modelCalculating the cumulative probability value in each subinterval in the frequency histogram; wherein m isi、mi+1Respectively representing the start value and the end value of the ith interval,is represented by [ mi,mi+1]Cumulative probability values for the intervals, F representing a probability density distribution function, F representing a cumulative probability distribution function, theta representing a parameter of the probability distribution model, F (x; theta) representing a probability density distribution function with the parameter theta, F (m)i;θ)、F(mi+1(ii) a Theta) respectively represent the probability distribution model at miAnd mi+1Summary of the stationA value of the rate;
and performing goodness-of-fit test on the cumulative probability value in each subinterval in the frequency histogram and the frequency in each subinterval in the frequency histogram by adopting a chi-square test, and screening out a probability distribution model with the goodness-of-fit test value being greater than or equal to a preset threshold value.
Optionally, the normalizing each screened probability distribution model to obtain a probability distribution model curve coinciding with the frequency distribution histogram specifically includes:
using a formulaStandardizing each screened probability distribution model to obtain a probability distribution model curve superposed with the frequency distribution histogram;
wherein f isGF' is the screened probability distribution model curve, SV and EV are the initial value and the final value of the frequency distribution histogram respectively, and k is the number of divisions of the frequency distribution histogram.
A system for determining probability distribution information of a biomass material characteristic, the system comprising: the system comprises a data processing module and a visualization module;
the data processing module adopts the determination method of the biomass material characteristic probability distribution information;
the data processing module is connected with the visualization module; the visualization module is used for importing the material characteristic value sequence of the biomass and displaying a statistical analysis result obtained by the data processing module according to the material characteristic value sequence of the imported biomass and a probability distribution model curve superposed with the frequency distribution histogram.
Optionally, the visualization module includes: the system comprises a data import unit, a data display unit, a data selection unit and a model result output unit;
the data import unit is connected with the data display unit; the data import unit is used for importing material characteristic data of biomass and displaying the material characteristic data on the data display unit;
the data selection unit is connected with the data display unit; the data selection unit is used for receiving a data selection condition; the data display unit is used for displaying material characteristic data of the biomass meeting the data selection condition;
the data display unit and the model result output unit are connected with the data processing module; the data processing module is used for obtaining a statistical analysis result and a probability distribution model curve superposed with the frequency distribution histogram according to the material characteristic data of the biomass meeting the data selection condition, and outputting the statistical analysis result and the probability distribution model curve superposed with the frequency distribution histogram to the model result output unit for visualization.
Optionally, the visualization module further comprises: a function setting module;
the function setting module is connected with the data processing module; the function setting module is used for receiving the selection of the probability density function and transmitting the selected probability density function to the data processing module;
the data processing module is used for solving parameters of the selected probability density function by adopting maximum likelihood estimation according to material characteristic data of the biomass meeting the data selection condition, fitting to obtain a selected probability distribution model and obtaining a fitting goodness test result of the selected probability distribution model; when the goodness-of-fit test result meets the goodness-of-fit test condition, outputting a statistical analysis result and a probability distribution model curve superposed with the frequency distribution histogram; and when the goodness-of-fit test result does not meet the goodness-of-fit test condition, outputting a fitting error prompt.
Optionally, the data processing module includes:
the data acquisition unit is used for acquiring a material characteristic value sequence of the biomass;
the frequency distribution histogram acquisition unit is used for arranging the material characteristic values in the material characteristic value sequence from small to large and determining a frequency distribution histogram of the ordered material characteristic value sequence;
a probability density function selecting unit for selecting a plurality of probability density functions according with the frequency distribution shown by the frequency distribution histogram;
the model parameter fitting unit is used for solving the parameters of each probability density function by adopting maximum likelihood estimation according to the material characteristic value sequence, and fitting to obtain a plurality of probability distribution models;
the fitting result checking unit is used for carrying out goodness-of-fit checking on each probability distribution model and screening out the probability distribution models meeting goodness-of-fit checking conditions;
a statistical result calculation unit for calculating a statistical analysis result of each screened probability distribution model; the statistical analysis result comprises expectation, variance, skewness, kurtosis, median and standard deviation;
and the curve standardization unit is used for standardizing each screened probability distribution model to obtain a probability distribution model curve superposed with the frequency distribution histogram.
According to the specific embodiment provided by the invention, the invention discloses the following technical effects:
the invention discloses a method for determining biomass material characteristic probability distribution information, which comprises the steps of firstly determining a frequency distribution histogram of a material characteristic value sequence, then selecting a plurality of probability density functions which accord with frequency distribution displayed by the frequency distribution histogram, fitting a plurality of probability distribution models, screening the probability distribution models which pass through goodness of fit test, and finally calculating a statistical analysis result of each screened probability distribution model and obtaining a probability distribution model curve. The method adopts a frequency histogram and a classical probability density function fitting method to realize the acquisition of qualitative and quantitative information of a probability distribution model of material characteristic data.
According to the system for determining the biomass material characteristic probability distribution information, disclosed by the invention, the visualization module can be used for introducing the material characteristic value sequence of biomass, displaying the statistical analysis result and the probability distribution model curve superposed with the frequency distribution histogram, and realizing the visualization of the qualitative and quantitative information of the probability distribution model of the biomass material characteristics, components and contents.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without inventive exercise.
Fig. 1 is a flowchart of a method for determining probability distribution information of characteristics of biomass materials provided by the present invention;
FIG. 2 is a relational diagram of a method for determining probability distribution information of characteristics of a biomass material according to the present invention;
FIG. 3 is a schematic diagram showing the result of probability distribution information of ash content in wheat straw provided by the embodiment of the present invention;
fig. 4 is a schematic interface diagram of a visualization module provided by the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The invention aims to provide a method and a system for determining biomass material characteristic probability distribution information, so as to realize the acquisition and visualization of qualitative and quantitative information of a probability distribution model of characteristics, components and contents of biomass materials.
In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.
The invention provides a method for determining biomass material characteristic probability distribution information, which comprises the following steps of:
The method comprises the steps of obtaining characteristic data (such as chemical composition, industrial composition, element composition, mineral elements, heat value and the like) of biomass materials such as crop straws, livestock and poultry manure and the like by adopting a standard test method, wherein the quantity of data samples is recommended to be more than 30.
Illustratively, step 1 is followed by a pre-processing of the data: in the process of collecting, sorting and analyzing material characteristic data, the conditions of numerical value deficiency and data abnormality exist in partial characteristic values, and the values need to be removed.
Calculating the median and the quartile distance of all material characteristic values in the material characteristic value sequence; eliminating [ x ] in material characteristic value sequence50%-3IQR,x50%+3IQR]And (4) material characteristic values outside the range, and eliminating material characteristic values with empty numerical values.
Median and interquartile range calculations: arranging the data from small to large, calculating the positions of 25%, 50% and 75% quantiles in the data, and finding out the data of the corresponding positions. Let x be data that has been arranged in order from small to large, the number of bits (50% fractional value) and the quartile range can be calculated by the following formula:
c=(n-1)×p+1
IQR=x75%-x25%
wherein x is50%Representing the median, IQR representing the interquartile range, xpDenotes the value of the quantile p, p being 75%, 50% or 25%, xcRepresenting the material property value at position c, c representing position, N representing the number of material property values in the sequence of material property values, N representing a positive integer, { c } representing the fractional part of c,andrespectively represent the characteristic value andthe latter material property value.
And 2, arranging the material characteristic values in the material characteristic value sequence from small to large, and determining a frequency distribution histogram of the sequenced material characteristic value sequence.
And arranging the cleaned data again according to the sequence from small to large, acquiring a probability distribution histogram of the data, and determining parameters related to the frequency histogram, wherein the parameters include a starting value, an ending value and a division number. Wherein, the calculation of the starting value and the ending value is divided into the following 2 cases: 1. when the maximum value of the data is more than or equal to 3, rounding the minimum value downwards to be an initial value, and rounding the maximum value upwards to be a final value; 2. when the maximum value of the data is less than 3, the minimum value of 2 significant digits is reserved and rounded down to the initial value, and the maximum value of 2 significant digits is reserved and rounded up to the final value. The number of cells is calculated according to the starting value, the ending value, the four-bit distance and the data amount, and the number of cells is required to be more than or equal to 5.
Illustratively, the determining step of the frequency distribution histogram is:
2-1, determining the initial value, the final value and the number of divisions of the frequency distribution histogram by using the following formulas according to the sorted material characteristic value sequence:
wherein SV represents a start value, EV represents an end value, k represents the number of divisions, xmin、xmaxRespectively representing the minimum and maximum material characteristic values in the sorted material characteristic value sequence, IQR representing a four-quadrant spacing, m representing the number of material characteristic values in the sorted material characteristic value sequence,meaning that the rounding is done down,represents rounding up;
2-2, dividing the interval from the starting value to the ending value into k sub-intervals according to the number k of the divisions;
2-3, counting the number of the material characteristic values falling into each subinterval in the sequenced material characteristic value sequence, and determining the ratio of the number of the material characteristic values falling into each subinterval to the number m of the material characteristic values in the sequenced material characteristic value sequence as the frequency of each subinterval;
and 2-4, drawing a frequency distribution histogram by taking the material characteristic value as an abscissa and the frequency as an ordinate.
The obtained frequency histogram is used for qualitatively analyzing and displaying the measured data distribution, qualitatively obtaining the distribution characteristics and the distribution types of the data, and counting and analyzing the data. Furthermore, the probability distribution model can be examined by means of frequency histograms.
And 3, selecting a plurality of probability density functions which accord with the frequency distribution shown by the frequency distribution histogram.
Different probability density function forms are selected based on the frequency histogram results (the approximate shape formed by each rectangle in the frequency histogram), and the selected probability density function forms approximate the approximate shape of the frequency histogram.
And 4, solving the parameters of each probability density function by adopting maximum likelihood estimation according to the material characteristic value sequence, and fitting to obtain a plurality of probability distribution models.
Taking normal distribution as an example, the material characteristic value sequence is taken as an input value, and normal distribution is selected for fitting.
The normal distribution includes a position parameter mu and a proportion parameter sigma, and the mu and the sigma are obtained by adopting an MLE (Maximum Likelihood Estimation) method2The value, formula is as follows:
taking the logarithm of the likelihood function and deriving it, and making the derivative 0 can be written as follows:
solving the formula can obtain the result:
when the number of selected probability distribution models is large, parallel calculation can be selected to improve the operation efficiency.
And 5, performing goodness-of-fit inspection on each probability distribution model, and screening out the probability distribution models meeting goodness inspection conditions.
After the parameter fitting of the probability distribution model is completed, the fitting goodness check needs to be performed on the fitting results in sequence. Common test methods include the Kolmogorov-Smirnov test (K-S test), the chi-square test, and the like.
And when the K-S test is adopted, performing goodness-of-fit test on each probability distribution model, and screening out the probability distribution models of which the goodness-of-fit test values are greater than or equal to a preset threshold value.
When chi-square test is used, a formula is used according to each probability distribution modelCalculating a cumulative summary within each subinterval in a frequency histogramA value of the rate; and performing goodness-of-fit test on the cumulative probability value in each subinterval in the frequency histogram and the frequency in each subinterval in the frequency histogram by adopting a chi-square test, and screening out a probability distribution model with the goodness-of-fit test value being greater than or equal to a preset threshold value. Wherein m isi、mi+1Respectively representing the start value and the end value of the ith interval,is represented by [ mi,mi+1]Cumulative probability values for the intervals, F representing a probability density distribution function, F representing a cumulative probability distribution function, theta representing a parameter of the probability distribution model, F (x; theta) representing a probability density distribution function with the parameter theta, F (m)i;θ)、F(mi+1(ii) a Theta) respectively represent the probability distribution model at miAnd mi+1A cumulative probability value of (d);
The calculation methods of expectation, variance, skewness, kurtosis and standard deviation are determined by different probability distribution models, the median is an independent variable value corresponding to the cumulative distribution density curve value of 50%, and the median can be calculated by adopting a percentage function in the probability distribution models. The cumulative distribution function can completely describe the cumulative probability distribution characteristic of a real random variable X and is the integral of the probability density function.
And 7, standardizing each screened probability distribution model to obtain a probability distribution model curve superposed with the frequency distribution histogram.
Because the curve corresponding to the screened probability distribution model and the frequency distribution histogram cannot coincide in the same coordinate system, the screened probability distribution model needs to be scaled, i.e., standardized. Using a formulaStandardizing each screened probability distribution model to obtain the probability of coincidence with the frequency distribution histogramA distribution model curve; wherein f isGF' is the screened probability distribution model curve, SV and EV are the initial value and the final value of the frequency distribution histogram respectively, and k is the number of divisions of the frequency distribution histogram.
According to the invention, firstly, the characteristic information of the biomass material with large sample volume and representativeness, such as physical indexes, chemical components or contents, biological indexes and the like, is obtained by a standard method. And secondly, removing null value and abnormal value information contained in the data. And thirdly, arranging the data in a descending order, calculating characteristic parameters through the data, dividing the data into a plurality of groups according to the parameters, obtaining frequency information of each group, and obtaining a frequency histogram. And fourthly, fitting the classical probability density function through the frequency information of each group to obtain a probability distribution model of the material characteristics. And fifthly, adopting methods such as Kolmogorov-Smimov test (K-S test), Chi-square test and the like to test the frequency histogram and the probability distribution model result. And finally, selecting and obtaining model information such as a statistical result, visualization and the like of the probability distribution model.
And then selecting wheat straws as crop biomass to be researched, taking the ash content of the wheat straws as the characteristic of the biomass material to be researched, and analyzing model information such as a statistical result, visualization and the like of the ash content of the wheat straws.
Taking the ash content of the wheat straws as an example, 778 samples of the wheat straws of different varieties and growth periods are obtained from different regions, and the ash content data of the rice straws are measured by adopting an ASTM E1755-01(2007) method.
The ash content data of the wheat straws after data pretreatment is 758.
The characteristic parameters of the wheat straw ash data are shown in the table 1, and the frequency histogram results are shown in the figure 3.
TABLE 1 wheat straw Ash data characteristic parameters
Minimum value | Maximum value of | Number of samples | Four-bit pitch | Initial value | End value | Number of divisions |
3.32 | 16.30 | 758 | 2.92 | 3 | 17 | 17 |
The wheat straw ash content data is used as an input value, a probability distribution model is solved, and finally the wheat straw ash probability distribution information is shown in figure 3. Three probability density functions are shown in fig. 3: the values in the Logistic function, Normal function, and Right-skewed Gumbel function before the comma represent the position parameters, the values after the comma represent the shape parameters, and p represents the goodness-of-fit test values (9.10, 1.21), (9.17, 2.10), and (8.14, 1.96). Ash represents the Ash content.
The invention adopts a frequency histogram and a classical probability density function fitting method to obtain qualitative and quantitative information of a probability distribution model of material characteristic data, and designs a program to realize the qualitative and quantitative information, simplifies the calculation process of the probability distribution model under large sample quantity and related data management, contains more than 100 probability distribution models for fitting and selection, and can efficiently, quickly and accurately obtain the probability distribution model information of biomass material characteristics and visually display the probability distribution model information.
The invention also provides a system for determining the probability distribution information of the characteristics of the biomass material, which comprises the following steps: the device comprises a data processing module and a visualization module.
The data processing module adopts the method for determining the biomass material characteristic probability distribution information of any one of claims 1 to 6. The data processing module is connected with the visualization module; the visualization module is used for importing the material characteristic value sequence of the biomass and displaying a statistical analysis result obtained by the data processing module according to the material characteristic value sequence of the imported biomass and a probability distribution model curve superposed with the frequency distribution histogram.
Referring to fig. 4, the visualization module includes: the device comprises a data import unit, a data display unit, a data selection unit and a model result output unit.
The data import unit is connected with the data display unit; the data import unit is used for importing material characteristic data of the biomass and displaying the material characteristic data on the data display unit. The data selection unit is connected with the data display unit; the data selection unit is used for receiving a data selection condition; the data display unit is used for displaying the material characteristic data of the biomass meeting the data selection condition. The data display unit and the model result output unit are connected with the data processing module; the data processing module is used for obtaining a statistical analysis result and a probability distribution model curve superposed with the frequency distribution histogram according to the material characteristic data of the biomass meeting the data selection condition, and outputting the statistical analysis result and the probability distribution model curve superposed with the frequency distribution histogram to the model result output unit for visualization.
The data presentation unit can display the imported data and the fitting result data. The data selection unit may select a table, a column in a table, and a start row and an end row.
Illustratively, the visualization module further comprises: and a function setting module. The function setting module is connected with the data processing module; the function setting module is used for receiving the selection of the probability density function and transmitting the selected probability density function to the data processing module. The data processing module is used for solving parameters of the selected probability density function by adopting maximum likelihood estimation according to material characteristic data of biomass meeting data selection conditions, fitting to obtain a selected probability distribution model and obtaining a fitting goodness test result of the selected probability distribution model; when the goodness-of-fit test result meets the goodness-of-fit test condition, outputting a statistical analysis result and a probability distribution model curve superposed with the frequency distribution histogram; and when the goodness-of-fit test result does not meet the goodness-of-fit test condition, outputting a fitting error prompt.
The function setup module enumerates all probability density functions for selection by the user.
Illustratively, the visualization module further comprises: and the initial value, the final value, the number of divisions and the goodness-of-fit test value p of the frequency distribution histogram in the operation process of the data processing module can be displayed on the parameter setting unit.
Illustratively, the data processing module includes: the device comprises a data acquisition unit, a frequency distribution histogram acquisition unit, a probability density function selection unit, a model parameter fitting unit, a fitting result inspection unit, a statistical result calculation unit and a curve standardization unit.
The data acquisition unit acquires a material characteristic value sequence of the biomass. The frequency distribution histogram acquisition unit arranges the material characteristic values in the material characteristic value sequence from small to large, and determines the frequency distribution histogram of the ordered material characteristic value sequence. The probability density function selecting unit selects a plurality of probability density functions which accord with the frequency distribution shown by the frequency distribution histogram. And the model parameter fitting unit adopts maximum likelihood estimation to solve the parameters of each probability density function according to the material characteristic value sequence, and a plurality of probability distribution models are obtained through fitting. And the fitting result inspection unit performs goodness-of-fit inspection on each probability distribution model and screens out the probability distribution models meeting goodness-of-fit inspection conditions. The statistical result calculating unit calculates the statistical analysis result of each screened probability distribution model; statistical analysis results include expectations, variance, skewness, kurtosis, median, and standard deviation. And the curve standardization unit standardizes each screened probability distribution model to obtain a probability distribution model curve superposed with the frequency distribution histogram.
The invention writes programs based on python 3.8, and the program dependent modules comprise numpy (> ═ 1.19.5), scipy (> ═ 1.5.4), wxPython (> ═ 4.1.1), xlwins (> ═ 0.25.2) and the like. The body design comprises five parts: menu bar, data display, parameter selection and operation, model result output and status bar. The main interface is shown in fig. 4.
Compared with the existing calculation program or software containing the probability density model, such as the mathematical analysis software like Origin, the method has the following advantages:
1. the invention relates to 100 probability density function models, which have a wide model range, and the probability density function models related to other software only have about 10 common probability density function models at present, so that model information can be obtained more quickly, accurately and comprehensively.
2. The invention provides various goodness-of-fit inspection methods, which can ensure the authenticity and accuracy of probability distribution model information from multiple angles.
3. The method can directly obtain the statistical results (such as expectation, variance, skewness, kurtosis, median, standard deviation and the like), model parameters, probability distribution model curves (such as probability density function curves and cumulative probability distribution curves) and other information of the material characteristic probability distribution model.
4. The invention develops a program for acquiring the probability distribution model information, has a good user interface, refines the model analysis process, provides flexible parameter selection and model selection space, prevents a misoperation mechanism and improves the model acquisition efficiency.
In the present specification, the embodiments are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. For the system disclosed by the embodiment, the description is relatively simple because the system corresponds to the method disclosed by the embodiment, and the relevant points can be referred to the method part for description.
The principle and the embodiment of the present invention are explained by applying specific examples, and the above description of the embodiments is only used to help understanding the method and the core idea of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, the specific embodiments and the application range may be changed. In view of the above, the present disclosure should not be construed as limiting the invention.
Claims (10)
1. A method for determining probability distribution information of characteristics of biomass materials is characterized by comprising the following steps:
acquiring a material characteristic value sequence of biomass;
arranging the material characteristic values in the material characteristic value sequence from small to large, and determining a frequency distribution histogram of the ordered material characteristic value sequence;
selecting a plurality of probability density functions of the frequency distribution according with the frequency distribution histogram;
solving parameters of each probability density function by adopting maximum likelihood estimation according to the material characteristic value sequence, and fitting to obtain a plurality of probability distribution models;
performing goodness-of-fit inspection on each probability distribution model, and screening out probability distribution models meeting goodness-of-fit inspection conditions;
calculating the statistical analysis result of each screened probability distribution model; the statistical analysis result comprises expectation, variance, skewness, kurtosis, median and standard deviation;
and standardizing each screened probability distribution model to obtain a probability distribution model curve superposed with the frequency distribution histogram.
2. The method for determining the probability distribution information of the biomass material characteristics according to claim 1, wherein the step of obtaining the material characteristic value sequence of the biomass further comprises the following steps:
calculating the median and the quartile distance of all material characteristic values in the material characteristic value sequence;
eliminating [ x ] in the material characteristic value sequence50%-3IQR,x50%+3IQR]Material characteristic values outside the range are eliminated, and material characteristic values with empty numerical values are eliminated;
wherein x is50%Represents the median, IQR represents the interquartile range, IQR ═ x75%-x25%,c=(n-1)×p+1,xpDenotes the value of the quantile p, p being 75%, 50% or 25%, xcRepresenting the material property value at position c, c representing position, N representing the number of material property values in the sequence of material property values, N representing a positive integer, { c } representing the fractional part of c,andrespectively representing a previous material property value and a subsequent material property value at position c.
3. The method for determining the probability distribution information of the biomass material characteristics according to claim 1, wherein the determining the frequency distribution histogram of the sorted material characteristic value sequence specifically includes:
according to the sorted material characteristic value sequence, determining the initial value, the final value and the number of divisions of the frequency distribution histogram by using the following formulas:
wherein SV represents a start value, EV represents an end value, k represents the number of divisions, xmin、xmaxRespectively representing the minimum and maximum material characteristic values in the sorted material characteristic value sequence, IQR representing a four-quadrant spacing, m representing the number of material characteristic values in the sorted material characteristic value sequence,meaning that the rounding is done down,represents rounding up;
dividing the interval from the starting value to the ending value into k sub-intervals according to the number k of the divisions;
counting the number of the material characteristic values falling into each subinterval in the sorted material characteristic value sequence, and determining the proportion of the number of the material characteristic values falling into each subinterval to the number m of the material characteristic values in the sorted material characteristic value sequence as the frequency of each subinterval;
and drawing a frequency distribution histogram by taking the material characteristic value as an abscissa and the frequency as an ordinate.
4. The method for determining the biomass material characteristic probability distribution information according to claim 1, wherein the goodness-of-fit test is performed on each probability distribution model, and the probability distribution model satisfying the goodness-of-fit test condition is screened out, specifically comprising:
and performing goodness-of-fit inspection on each probability distribution model by adopting K-S inspection, and screening out the probability distribution models with goodness-of-fit inspection values larger than or equal to a preset threshold value.
5. The method for determining the biomass material characteristic probability distribution information according to claim 1, wherein the goodness-of-fit test is performed on each probability distribution model, and the probability distribution model satisfying the goodness-of-fit test condition is screened out, specifically comprising:
using a formula based on each probability distribution modelCalculating the cumulative probability value in each subinterval in the frequency histogram; wherein m isi、mi+1Respectively representing the start value and the end value of the ith interval,is represented by [ mi,mi+1]Cumulative probability values for the intervals, F representing a probability density distribution function, F representing a cumulative probability distribution function, theta representing a parameter of the probability distribution model, F (x; theta) representing a probability density distribution function with the parameter theta, F (m)i;θ)、F(mi+1(ii) a Theta) respectively represent the probability distribution model at miAnd mi+1A cumulative probability value of (d);
and performing goodness-of-fit test on the cumulative probability value in each subinterval in the frequency histogram and the frequency in each subinterval in the frequency histogram by adopting a chi-square test, and screening out a probability distribution model with the goodness-of-fit test value being greater than or equal to a preset threshold value.
6. The method for determining the probability distribution information of the characteristics of the biomass material according to claim 1, wherein the step of standardizing each screened probability distribution model to obtain a probability distribution model curve coinciding with a frequency distribution histogram specifically comprises the steps of:
using formulasStandardizing each screened probability distribution model to obtain a probability distribution model curve superposed with the frequency distribution histogram;
wherein f isGF' is the selected probability distribution model curve, SV and EV are the initial value and the final value of the frequency distribution histogram, k is the frequencyThe number of bins of the distribution histogram.
7. A system for determining probability distribution information of characteristics of a biomass material, the system comprising: the system comprises a data processing module and a visualization module;
the data processing module adopts a determination method of biomass material characteristic probability distribution information as set forth in any one of claims 1-6;
the data processing module is connected with the visualization module; the visualization module is used for importing the material characteristic value sequence of the biomass and displaying a statistical analysis result obtained by the data processing module according to the material characteristic value sequence of the imported biomass and a probability distribution model curve superposed with the frequency distribution histogram.
8. The system for determining probability distribution information of biomass material characteristics according to claim 7, wherein the visualization module comprises: the system comprises a data import unit, a data display unit, a data selection unit and a model result output unit;
the data import unit is connected with the data display unit; the data import unit is used for importing material characteristic data of biomass and displaying the material characteristic data on the data display unit;
the data selection unit is connected with the data display unit; the data selection unit is used for receiving a data selection condition; the data display unit is used for displaying material characteristic data of the biomass meeting the data selection condition;
the data display unit and the model result output unit are connected with the data processing module; the data processing module is used for obtaining a statistical analysis result and a probability distribution model curve superposed with the frequency distribution histogram according to the material characteristic data of the biomass meeting the data selection condition, and outputting the statistical analysis result and the probability distribution model curve superposed with the frequency distribution histogram to the model result output unit for visualization.
9. The system for determining probability distribution information of biomass material characteristics according to claim 8, wherein the visualization module further comprises: a function setting module;
the function setting module is connected with the data processing module; the function setting module is used for receiving the selection of the probability density function and transmitting the selected probability density function to the data processing module;
the data processing module is used for solving parameters of the selected probability density function by adopting maximum likelihood estimation according to material characteristic data of the biomass meeting the data selection condition, fitting to obtain a selected probability distribution model and obtaining a fitting goodness test result of the selected probability distribution model; when the goodness-of-fit test result meets the goodness-of-fit test condition, outputting a statistical analysis result and a probability distribution model curve superposed with the frequency distribution histogram; and when the goodness-of-fit test result does not meet the goodness-of-fit test condition, outputting a fitting error prompt.
10. The system for determining probability distribution information of biomass material characteristics according to claim 7, wherein the data processing module comprises:
the data acquisition unit is used for acquiring a material characteristic value sequence of the biomass;
a frequency distribution histogram obtaining unit, configured to arrange the material characteristic values in the material characteristic value sequence in a descending order, and determine a frequency distribution histogram of the ordered material characteristic value sequence;
a probability density function selecting unit for selecting a plurality of probability density functions according with the frequency distribution shown by the frequency distribution histogram;
the model parameter fitting unit is used for solving the parameters of each probability density function by adopting maximum likelihood estimation according to the material characteristic value sequence, and fitting to obtain a plurality of probability distribution models;
the fitting result checking unit is used for carrying out goodness-of-fit checking on each probability distribution model and screening out the probability distribution models meeting goodness-of-fit checking conditions;
the statistical result calculating unit is used for calculating the statistical analysis result of each screened probability distribution model; the statistical analysis result comprises expectation, variance, skewness, kurtosis, median and standard deviation;
and the curve standardization unit is used for standardizing each screened probability distribution model to obtain a probability distribution model curve superposed with the frequency distribution histogram.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210305030.8A CN114627979B (en) | 2022-03-25 | 2022-03-25 | Method and system for determining probability distribution information of biomass material characteristics |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210305030.8A CN114627979B (en) | 2022-03-25 | 2022-03-25 | Method and system for determining probability distribution information of biomass material characteristics |
Publications (2)
Publication Number | Publication Date |
---|---|
CN114627979A true CN114627979A (en) | 2022-06-14 |
CN114627979B CN114627979B (en) | 2024-07-16 |
Family
ID=81903580
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210305030.8A Active CN114627979B (en) | 2022-03-25 | 2022-03-25 | Method and system for determining probability distribution information of biomass material characteristics |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114627979B (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115797734A (en) * | 2023-02-07 | 2023-03-14 | 慧铁科技有限公司 | Method for representing and processing discrete data of railway train fault form |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106251238A (en) * | 2016-07-27 | 2016-12-21 | 华北电力大学(保定) | Choosing and Model Error Analysis method of wind energy turbine set modeling series of discrete step-length |
CN108982409A (en) * | 2018-08-08 | 2018-12-11 | 浙江工业大学 | A method of quickly detecting three constituent content of kelp lignocellulosic based near infrared spectrum |
CN110287601A (en) * | 2019-06-27 | 2019-09-27 | 浙江农林大学 | A kind of Bamboo Diameter Breast-high age binary combination distribution Accurate Estimation Method |
-
2022
- 2022-03-25 CN CN202210305030.8A patent/CN114627979B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106251238A (en) * | 2016-07-27 | 2016-12-21 | 华北电力大学(保定) | Choosing and Model Error Analysis method of wind energy turbine set modeling series of discrete step-length |
CN108982409A (en) * | 2018-08-08 | 2018-12-11 | 浙江工业大学 | A method of quickly detecting three constituent content of kelp lignocellulosic based near infrared spectrum |
CN110287601A (en) * | 2019-06-27 | 2019-09-27 | 浙江农林大学 | A kind of Bamboo Diameter Breast-high age binary combination distribution Accurate Estimation Method |
Non-Patent Citations (2)
Title |
---|
ALBERTO GALLIFUOCO ET AL.,: "Modeling biomass hydrothermal carbonization by the maximum information entropy criterion", REACTION CHEMISTRY & ENGINEERING, 11 March 2021 (2021-03-11), pages 920 - 928 * |
毕煜;任晓卫;姜庆五;赵耐青;: "血吸虫感染率概率模型及其估计方法研究", 中国卫生统计, no. 05, 25 October 2008 (2008-10-25), pages 457 - 460 * |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115797734A (en) * | 2023-02-07 | 2023-03-14 | 慧铁科技有限公司 | Method for representing and processing discrete data of railway train fault form |
Also Published As
Publication number | Publication date |
---|---|
CN114627979B (en) | 2024-07-16 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109325218B (en) | Data screening statistical method and device, electronic equipment and storage medium | |
CN113792825A (en) | Fault classification model training method and device for electricity information acquisition equipment | |
CN109891508A (en) | Single cell type detection method, device, equipment and storage medium | |
CN109597968A (en) | Paste solder printing Performance Influence Factor analysis method based on SMT big data | |
CN111400366B (en) | Interactive outpatient quantity prediction visual analysis method and system based on Catboost model | |
CN111402017A (en) | Credit scoring method and system based on big data | |
CN114595956B (en) | Eucalyptus soil fertility analysis method based on gray scale correlation method fuzzy clustering algorithm | |
CN114627979A (en) | Method and system for determining biomass material characteristic probability distribution information | |
CN115563477A (en) | Harmonic data identification method and device, computer equipment and storage medium | |
CN117035563B (en) | Product quality safety risk monitoring method, device, monitoring system and medium | |
CN114548494A (en) | Visual cost data prediction intelligent analysis system | |
CN113793057A (en) | Building bidding and tendering data generation method based on regression analysis model | |
CN112256681A (en) | Air traffic control digital index application system and method | |
CN115186776B (en) | Method, device and storage medium for classifying ruby producing areas | |
Wirawan et al. | Application of data mining to prediction of timeliness graduation of students (a case study) | |
CN108961071A (en) | The method and terminal device of automatic Prediction composite service income | |
CN112102882B (en) | Quality control system and method for NGS detection process of tumor sample | |
CN111654853B (en) | Data analysis method based on user information | |
CN113205274A (en) | Quantitative ranking method for construction quality | |
CN112598228A (en) | Enterprise competitiveness analysis method, device, equipment and storage medium | |
CN113707218A (en) | Intelligent reading method and system for human genetic disease gene detection | |
CN114927239B (en) | Automatic decision rule generation method and system applied to drug analysis | |
CN116050773B (en) | Industry fusion method and system based on carbon emission evaluation | |
CN111855930B (en) | Grain nutrient detection device and method | |
Joung et al. | Verifying the Classification Accuracy for Korea's Standardized Classification System of Research F&E by using LDA (Linear Discriminant Analysis) |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |