Disclosure of Invention
In view of the above drawbacks of the prior art, an object of the present application is to provide a method and a system for automatically identifying a pollution source analysis result, which are used for solving the problems of complex analysis process and low accuracy of analysis result in the prior art.
The embodiment of the application provides a method for automatically identifying a pollution source analysis result, which comprises the following steps of: obtaining analysis results of various pollutant sources in a literature, and forming a sample set from the analysis results of the various pollutant sources in the literature, wherein the analysis results comprise the category of the pollutant; dividing the sample set, and generating a test data set and a training data set according to a division result; processing the training data set by using super-parameter test, and obtaining an optimal parameter k value according to a processing result; acquiring an instance data set of a category to be determined; determining distances between the instance dataset and individual samples in a sample set; obtaining samples of the distance array in the first k training data sets according to the optimal parameter k value; the belonging category of the instance dataset is determined from the belonging categories of the samples in the first k training datasets.
The embodiment of the application also provides a system for automatically identifying the analysis result of the pollution source, which comprises the following steps: the acquisition module is used for acquiring analysis results of various pollutant sources in a literature and forming a sample set from the analysis results of the various pollutant sources in the literature, wherein the analysis results comprise the category of the pollutant, and acquiring an instance data set of the category to be determined; the processing module is used for dividing the sample set and generating a test data set and a training data set according to the division result; processing the training data set by using super-parameter test, and obtaining an optimal parameter k value according to a processing result; a determining module for determining a distance between the instance dataset and each sample in a sample set; obtaining samples of the distance array in the first k training data sets according to the optimal parameter k value; the belonging category of the instance dataset is determined from the belonging categories of the samples in the first k training datasets.
The embodiment of the application also provides a server, which comprises: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of automatically identifying a contamination source resolution as described above.
Embodiments of the present application also provide a computer-readable storage medium storing a computer program which, when executed by a processor, implements a method for automatically identifying a contamination source parsing result as described above.
Compared with the prior art, the embodiment of the application has the main differences and effects that: the method comprises the steps of processing a training data set in a sample set formed by analysis results of various pollutant sources in a document by using superparameter test, obtaining an optimal parameter k value according to the processing result, determining distances between the sample data set and each sample in the sample set, obtaining samples of which the distances are arranged in the first k training data sets according to the optimal parameter k value, and finally obtaining the belonging category of the sample data set according to the belonging category of the samples in the first k training data sets, thereby obtaining the analysis result of the pollutant sources efficiently and accurately, reducing the technical barrier of source analysis, and solving the problems of complex prior art and dependence on hardware resource allocation.
As a further improvement, after the training data set is processed by using the hyper-parameter test and the optimal parameter K value is obtained according to the processing result, before the obtaining the instance data set to be determined to belong to the category, the method includes: and testing the optimal parameter k value by using the test data set, and determining whether the optimal parameter k value is the optimal parameter k value according to a test result.
As a further refinement, the determining the distance between the example dataset and each sample in the sample set comprises: distances between the instance dataset and respective samples in the sample set are determined according to a metric distance manner.
As a further improvement, the metric distance mode includes an euclidean distance mode, a Min Shi distance mode, and a manhattan distance mode.
As a further refinement, the determining the distance between the instance dataset and the sample in the sample set according to a metric distance manner comprises: determining the distance between the example data set and the sample in the sample set according to a formula corresponding to the Euclidean distance mode:wherein x is a characteristic factor in an example data set of a category to be determined, and y is a characteristic factor of each sample in a sample set; or determining the distance between the sample in the sample set and the example data set according to a formula corresponding to the Min Shi distance mode: />Wherein x is a characteristic factor in an example data set of a category to be determined, y is a characteristic factor of each sample in the sample set, p is a variable, and when p=2, the formula is a formula corresponding to the Euclidean distance mode; or determining the distance between the sample in the sample set and the example data set according to a formula corresponding to the Manhattan distance mode: />Wherein x is a characteristic factor in the example data set of the category to be determined, y is a characteristic factor of each sample in the sample set, p is a variable, and when p=1, the formula is a formula corresponding to a Min Shi distance mode.
According to the scheme, the distance between the instance data set and each sample in the sample set can be calculated through the Euclidean distance mode, the Min Shi distance mode or the Manhattan distance mode, so that the category of the instance data set can be conveniently obtained according to the category of the sample in the sample set close to the instance data set.
As a further improvement, after said determining the distance between the instance dataset and each sample in the sample set, before said deriving samples of the training dataset for which said distance is arranged in the first k according to the optimal parameter k value, comprising: the distances between the sample in the sample set and the example dataset are arranged in order of decreasing size.
As a further refinement, the determining the belonging category of the example data set from the belonging categories of the samples in the first k training data sets includes: determining the most belonged category in the samples in the first k training data sets according to the majority voting principle; the category to which the instance dataset belongs is determined from the most belonging categories in the samples in the first k training datasets.
According to the scheme, the most affiliated categories in the samples in the first k training data sets can be obtained according to the majority voting principle, and then the affiliated categories of the instance data sets are obtained according to the most affiliated categories in the samples in the first k training data sets, so that the aim of obtaining the affiliated categories of the instance data sets is fulfilled.
Detailed Description
Other advantages and effects of the present application will become apparent to those skilled in the art from the following disclosure, which describes the embodiments of the present application with reference to specific examples. The application may be practiced or carried out in other embodiments that depart from the specific details, and the details of the present description may be modified or varied from the spirit and scope of the present application. It should be noted that the following embodiments and features in the embodiments may be combined with each other without conflict.
It should be noted that the illustrations provided in the following embodiments merely illustrate the basic concept of the present application by way of illustration, and only the components related to the present application are shown in the drawings and are not drawn according to the number, shape and size of the components in actual implementation, and the form, number and proportion of the components in actual implementation may be arbitrarily changed, and the layout of the components may be more complicated.
The first embodiment of the application relates to a method for automatically identifying a pollution source analysis result. The flow is shown in fig. 1, and is specifically as follows:
step 101, obtaining analysis results of various pollutant sources in a document, and forming a sample set from the analysis results of various pollutant sources in the document;
specifically, the analysis result includes the category of the contaminant, a part of the sample set is shown in fig. 5, and the contaminant is mainly the atmospheric contaminant in the present application.
And 102, dividing the sample set, and generating a test data set and a training data set according to the division result.
Specifically, 20% of the sample set is used as a test data set, 80% of the sample set is used as a training data set, and in practical application, the sample set can be divided into test data sets and training data sets in other proportions according to specific conditions.
And 103, processing the training data set by using the hyper-parameter test, and obtaining the optimal parameter k value according to the processing result.
Specifically, k is equal to or greater than 1 and equal to or less than 20, and k is a positive integer, the super-parameter test is to create a possible value list for the parameters, randomly select a value from the parameters, process the training data set by using the random combination of the super-parameters, and then compare the different parameter tests, select the parameter with the highest accuracy, namely the optimal parameter k value, and shorten or increase the threshold range of the k value initial super-parameter according to experience.
Step 104, obtaining an instance data set of the category to be determined.
Specifically, the instance data of the category to be determined is acquired, and the instance data is formed into an instance data set.
Step 105, determining a distance between the instance dataset and each sample in the sample set.
Specifically, the distances between the example dataset and each sample in the sample set are determined according to a metric distance manner, wherein if a sample is in the feature space and most of the k most similar (i.e., nearest neighbor) samples in the feature space belong to a certain class, the sample also belongs to the class, and the method of computing the nearest neighbor of the sample is the metric distance manner, which includes the Euclidean distance manner, the Min Shi distance manner, and the Manhattan distance manner.
And 106, obtaining samples which are arranged in the first k training data sets according to the optimal parameter k values.
Specifically, the distance herein refers to the distance between the example dataset and each sample in the sample set.
Step 107, determining the belonging category of the instance data set according to the belonging categories of the samples in the first k training data sets.
Specifically, the most belonging category in the samples in the first k training data sets is determined according to the majority voting principle, and then the belonging category of the instance data set is determined according to the most belonging category in the samples in the first k training data sets.
According to the method, the training data set in the sample set formed by analysis results of various pollutant sources in the literature can be processed through super-parameter test, then the optimal parameter k value is obtained according to the processing result, the distances between the example data set and each sample in the sample set are determined, then the samples in the first k training data sets are obtained according to the optimal parameter k value, finally the category of the example data set is obtained according to the category of the sample in the first k training data sets, so that the analysis result of the pollutant sources can be obtained efficiently and accurately, the technical barrier of source analysis is reduced, and the problems of complex prior art and dependence on hardware resource allocation are solved.
A second embodiment of the present application relates to a method for automatically identifying a pollution source analysis result, and the second embodiment is a detailed discussion of the whole first embodiment, and mainly includes: in a second embodiment of the application, an embodiment is specified which discusses a specific procedure of determining the belonging class of the instance dataset from the belonging class of the samples in the first k training datasets.
Referring to fig. 2, the present embodiment includes the following steps, which are described as follows:
steps 201 to 203 are similar to steps 101 to 103 in the first embodiment, and are not described here again.
And 204, testing the optimal parameter k value by using the test data set, and determining whether the optimal parameter k value is the optimal parameter k value according to the test result.
Step 205 is similar to step 104 of the first embodiment, and will not be described again.
Step 206, determining the distance between the example data set and each sample in the sample set according to the metric distance approach.
Specifically, the distance between the sample in the sample set and the example data set is determined according to a formula corresponding to the Euclidean distance mode:wherein x is a characteristic factor in an example data set of a category to be determined, and y is a characteristic factor of each sample in a sample set; or determining the distance between the sample in the sample set and the example data set according to the formula corresponding to the Min Shi distance mode: />Wherein x is a characteristic factor in the example data set of the category to be determined, y is a characteristic factor of each sample in the sample set,p is a variable, and when p=2, the formula is a formula corresponding to the Euclidean distance mode; or determining the distance between the sample in the sample set and the example data set according to a formula corresponding to the Manhattan distance mode: />Wherein x is a characteristic factor in the example data set of the category to be determined, y is a characteristic factor of each sample in the sample set, p is a variable, and when p=1, the formula is a formula corresponding to a Min Shi distance mode.
Step 207, arranging the distances between the sample in the sample set and the example dataset in order of decreasing size.
Specifically, the distances between the example data set and each sample in the sample set are calculated according to the euclidean distance method, the Min Shi distance method or the manhattan distance method, then the distances between the example data set and the samples in the sample set are arranged in order from small to large, the samples with the distances arranged in the first k training data sets are obtained, and the set represented by the k sample points is denoted as n_k (b).
Step 208 is similar to step 106 of the first embodiment and will not be described again.
Step 209, determining the most belonging category in the samples in the first k training data sets according to the majority voting principle.
Step 210, determining the category of the instance data set according to the most belonging category in the samples in the first k training data sets.
Specifically, the most affiliated category in the samples in the first k training data sets is selected according to the majority voting principle, then the affiliated category of the example data set is w, and then the affiliated category w of the example data set is updated into the sample set, so that the simulation precision of the application can be improved and optimized continuously.
In practical application, firstly, a sample set is constructed: collecting analysis results of various pollutant sources in a finishing document, wherein the analysis results comprise the category of the pollutant, namely labels, the analysis results also comprise contribution characteristic values of various components of the various pollutant sources, and then respectively manufacturing the analysis results of the various pollutant sources in the document into two initial sample sets (one sample set is 10 categories of a major category and one sample set is 15 categories of a subdivision minor category), wherein the two sample sets correspond to the first category and are numbered by the labels, and the 2 nd-n columns are contribution characteristic values of the components;
then build an instance dataset: inputting the VOCs component data of Nanjing city to be analyzed into PMF software, and simulating to obtain contribution characteristics of each component of unidentified category; dividing the two sample sets (taking 20% as a test data set and 80% as a training data set), and selecting an optimal parameter k value by using a super-parameter test, wherein the samples of the training set are of a large class: metric distance = euclidean distance, k = 3; subclass training set samples: metric distance = euclidean distance, k = 5; then, simulating the example data set by using the optimal parameter k value, wherein the artificial experience recognition result is as follows: paint solvents, natural sources, LPG emissions, gasoline vehicle emissions, petrochemical; the major classes are identified as paint solvents, natural sources, LPG emissions, motor vehicle exhaust emissions (motor vehicle emissions characteristics in the major classes include gasoline vehicle emissions), petrochemical; the subdivision subclasses are identified as paint solvents, natural sources, LPG emissions, motor vehicle exhaust emissions (the subdivision subclasses refine motor vehicle exhaust emissions and gasoline vehicle exhaust emissions characteristics, fall into two tag characteristics), petrochemical; therefore, the accuracy of the large class identification is 100 percent, and the accuracy of the subdivision subclass is 80 percent; finally, the large-class recognition result (label+contribution characteristic value of each component) of the instance dataset is newly added into the large-class training set, the sub-division small-class recognition result is also newly added into the sub-division small-class training set after being manually corrected, and the analysis result of the instance dataset is shown in fig. 6.
According to the embodiment, the distance between the instance data set and each sample in the sample set can be calculated through the Euclidean distance mode, the Min Shi distance mode or the Manhattan distance mode, so that the category of the instance data set can be conveniently obtained according to the category of the sample in the sample set close to the instance data set; the most affiliated categories in the samples in the first k training data sets can be obtained according to the majority voting principle, and then the affiliated categories of the instance data sets are obtained according to the most affiliated categories in the samples in the first k training data sets, so that the aim of obtaining the affiliated categories of the instance data sets is fulfilled.
A third embodiment of the present application relates to a system for automatically identifying a pollution source analysis result, referring to fig. 3, including:
the acquisition module is used for acquiring analysis results of various pollutant sources in the literature and forming a sample set from the analysis results of the various pollutant sources in the literature, wherein the analysis results comprise the category of the pollutant, and acquiring an instance data set of the category to be determined;
the processing module is used for dividing the sample set and generating a test data set and a training data set according to the dividing result; processing the training data set by using super-parameter test, and obtaining an optimal parameter k value according to a processing result;
a determining module for determining a distance between the instance data set and each sample in the sample set; samples in the first k training data sets are arranged in distance according to the optimal parameter k values; the belonging category of the instance dataset is determined from the belonging categories of the samples in the first k training datasets.
It is to be noted that this embodiment is a system example corresponding to the first embodiment, and can be implemented in cooperation with the first embodiment. The related technical details mentioned in the first embodiment are still valid in this embodiment, and in order to reduce repetition, a detailed description is omitted here. Accordingly, the related art details mentioned in the present embodiment can also be applied to the first embodiment.
It should be noted that each module in this embodiment is a logic module, and in practical application, one logic unit may be one physical unit, or may be a part of one physical unit, or may be implemented by a combination of multiple physical units. In addition, in order to highlight the innovative part of the present application, units that are not so close to solving the technical problem presented by the present application are not introduced in the present embodiment, but this does not indicate that other units are not present in the present embodiment.
A fourth embodiment of the present application relates to a server, referring to fig. 4, including:
at least one processor; and a memory communicatively coupled to the at least one processor; the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of automatically identifying a contamination source resolution as described above.
Where the memory and the processor are connected by a bus, the bus may comprise any number of interconnected buses and bridges, the buses connecting the various circuits of the one or more processors and the memory together. The bus may also connect various other circuits such as peripherals, voltage regulators, and power management circuits, which are well known in the art, and therefore, will not be described any further herein. The bus interface provides an interface between the bus and the transceiver. The transceiver may be one element or may be a plurality of elements, such as a plurality of receivers and transmitters, providing a means for communicating with various other apparatus over a transmission medium. The data processed by the processor is transmitted over the wireless medium via the antenna, which further receives the data and transmits the data to the processor.
The processor is responsible for managing the bus and general processing and may also provide various functions including timing, peripheral interfaces, voltage regulation, power management, and other control functions. And memory may be used to store data used by the processor in performing operations.
A fifth embodiment of the present application relates to a computer-readable storage medium storing a computer program which, when executed by a processor, implements the above-described method embodiments.
That is, it will be understood by those skilled in the art that all or part of the steps in implementing the methods of the embodiments described above may be implemented by a program stored in a storage medium, where the program includes several instructions for causing a device (which may be a single-chip microcomputer, a chip or the like) or a processor (processor) to perform all or part of the steps in the methods of the embodiments of the application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.
In summary, the application processes the training data set in the sample set formed by the analysis results of various pollutant sources in the literature by using the hyper-parametric test, then obtains the optimal parameter k value according to the processing result, determines the distances between the example data set and each sample in the sample set, then obtains the samples in the training data sets with the distances arranged in the first k according to the optimal parameter k value, and finally obtains the belonging category of the example data set according to the belonging category of the samples in the training data sets with the first k, thereby efficiently and accurately obtaining the analysis result of the pollutant sources, reducing the technical barrier of source analysis, and solving the problems of complex prior art and dependence on hardware resource allocation. Therefore, the application effectively overcomes various defects in the prior art and has high industrial utilization value.
The above embodiments are merely illustrative of the principles of the present application and its effectiveness, and are not intended to limit the application. Modifications and variations may be made to the above-described embodiments by those skilled in the art without departing from the spirit and scope of the application. Accordingly, it is intended that all equivalent modifications and variations of the application be covered by the claims, which are within the ordinary skill of the art, be within the spirit and scope of the present disclosure.