CN112348644A - Abnormal logistics order detection method by establishing monotonous positive correlation filter screen - Google Patents

Abnormal logistics order detection method by establishing monotonous positive correlation filter screen Download PDF

Info

Publication number
CN112348644A
CN112348644A CN202011282131.5A CN202011282131A CN112348644A CN 112348644 A CN112348644 A CN 112348644A CN 202011282131 A CN202011282131 A CN 202011282131A CN 112348644 A CN112348644 A CN 112348644A
Authority
CN
China
Prior art keywords
distribution
filter screen
fee
abnormal
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202011282131.5A
Other languages
Chinese (zh)
Other versions
CN112348644B (en
Inventor
杨云丽
杨雪荣
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Pinjian Intelligent Technology Co ltd
Original Assignee
Shanghai Pinjian Intelligent Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Pinjian Intelligent Technology Co ltd filed Critical Shanghai Pinjian Intelligent Technology Co ltd
Priority to CN202011282131.5A priority Critical patent/CN112348644B/en
Publication of CN112348644A publication Critical patent/CN112348644A/en
Application granted granted Critical
Publication of CN112348644B publication Critical patent/CN112348644B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/06Buying, selling or leasing transactions
    • G06Q30/0601Electronic shopping [e-shopping]
    • G06Q30/0633Lists, e.g. purchase orders, compilation or processing
    • G06Q30/0635Processing of requisition or of purchase orders
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/2433Single-class perspective, e.g. one-against-all classification; Novelty detection; Outlier detection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/08Logistics, e.g. warehousing, loading or distribution; Inventory or stock management
    • G06Q10/083Shipping
    • G06Q10/0834Choice of carriers
    • G06Q10/08345Pricing

Landscapes

  • Engineering & Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Economics (AREA)
  • General Physics & Mathematics (AREA)
  • Development Economics (AREA)
  • Strategic Management (AREA)
  • General Business, Economics & Management (AREA)
  • Marketing (AREA)
  • Data Mining & Analysis (AREA)
  • Finance (AREA)
  • Accounting & Taxation (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Quality & Reliability (AREA)
  • Tourism & Hospitality (AREA)
  • Human Resources & Organizations (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Operations Research (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses an abnormal logistics order detection method by establishing a monotone positive correlation filter screen, which comprises the following steps: acquiring a data source, including basic data of a goods source and a delivery fee; performing necessary data cleaning, including missing value processing and necessary format conversion; calculating the correlation among the variables, analyzing the independent variables with positive correlation with the distribution fee, and determining important influence variables; each dependent variable is subjected to box separation, and grade division among cells is respectively carried out; analyzing distribution of distribution fees on each interval combination and determining upper and lower boundary values of each group, wherein the upper and lower boundary values comprise respective variables of each group, analyzing distribution of distribution fees on each interval combination and determining the upper and lower boundary values, and preliminarily forming a filter screen; the filter screen is corrected by utilizing positive correlation, and the lower bound value of the distribution fee is ensured not to be reduced along with the increase of the interval value; the invention effectively improves the accuracy of abnormal logistics order detection on the premise of ensuring interpretability and flexibility.

Description

Abnormal logistics order detection method by establishing monotonous positive correlation filter screen
Technical Field
The invention relates to the technical field of abnormal value detection, in particular to an abnormal logistics order detection method by establishing a monotone positive correlation filter screen.
Background
Outlier detection, also known as outlier detection, is one of the core problems of data mining. Outlier identification of a one-dimensional sequence of observation points is usually relatively easy based on statistical distributions, boxplots, etc., but if the data is multidimensional, complex models are often built between multidimensional variables to detect outliers.
In logistics transportation and delivery, whether the delivery fee of a goods delivery order is real and reasonable directly influences the upper layer analysis and application of a logistics transaction order. Abnormal value detection methods are commonly used to detect abnormal delivery fees in delivery orders.
The idea of the commonly used abnormal value detection methods, such as 3 σ criterion based on a probabilistic statistical model method, isolated forest based on a machine learning method, and the like, is to detect abnormal values from the perspective of a data distribution situation or a data density situation, and the like. However, the goods distribution fee generally increases within a certain interval range along with important influence factors such as the weight and the mileage of the goods, and therefore, the commonly used abnormal value detection method cannot effectively detect the abnormal logistics orders of which the distribution fee decreases inversely along with the increase of the influence factors.
Disclosure of Invention
The invention aims to provide an abnormal logistics order detection method by establishing a monotone positive correlation filter screen, so as to solve the problems in the background technology.
In order to achieve the purpose, the invention provides the following technical scheme:
an abnormal logistics order detection method by establishing a monotone positive correlation filter screen comprises the following steps:
and acquiring a data source comprising basic data of the goods source and the distribution fee.
Necessary data cleansing is performed, including missing value processing and necessary format conversion.
And calculating the correlation among the variables, analyzing the independent variables with positive correlation with the distribution fee, and determining the important influence variables.
And (4) carrying out box separation on each dependent variable, and carrying out grade division among cells respectively.
Analyzing distribution of distribution fees on each interval combination and determining upper and lower boundary values of each combination, including combining respective variables, analyzing distribution of distribution fees on each interval combination and determining upper and lower boundary values, and preliminarily forming a filter screen.
The filter screen is corrected by utilizing positive correlation, and the lower bound value of the distribution fee is ensured not to be reduced along with the increase of the interval value; and along with the reduction of the interval value, the upper bound value of the distribution fee is not increased, and a filter screen model is formed.
And establishing a distribution fee prediction model, and comparing the effect of the model to explain the necessity and effectiveness of eliminating abnormal samples through the filter screen model, wherein the step of detecting the logistics transaction order data comprises the steps of adjusting parameters by using the transaction order from which dirty data is screened out, and optimizing the detection effect of the filter screen model.
1. A data source is acquired. The method is characterized in that auxiliary fields such as delivery mileage (mileage between a starting place and a destination) and goods categories are derived by taking basic information of a goods source, such as an originating place, a destination, a required vehicle length, goods weight, goods volume, goods type, vehicle type (such as unlimited, ordinary, flat, high-column, van type and the like), delivery mileage, high speed, charging mode (such as trip, ton, square, piece and the like), loading mode and the like as main information. For each committed transaction, a wide table is formed in association with its distribution fee.
2. Cleaning necessary data, such as screening samples of cargo weight, required vehicle length and non-missing delivery mileage; rejecting a transaction sample occurring on a holiday of a special festival; one-to-one encoding mapping is performed on the discrete fields, data format conversion necessary for analyzing correlation, and the like. And meanwhile, drawing a box line graph, a histogram and a bar graph to check the distribution condition of each variable and the dependent variable. Processing the field which is still missing, wherein A, the variable with the independent variable missing proportion higher than 30 percent is removed; B. and (3) missing the independent variable with the proportion less than 30%, filling the continuous independent variable with a median, and filling the discrete independent variable with a mode.
3. Calculating correlations between variables
Figure BDA0002779696580000021
Where X represents any argument and Y represents the delivery fee freight. Three independent variables with the highest positive correlation with the distribution fee freight and the correlation not lower than 0.8 are analyzed. If there are more independent variables with larger positive correlation with the distribution feeThe principal component analysis of the independent variable can be considered to be carried out with dimension reduction treatment; if the independent variable meeting the condition cannot be found, continuously screening the sub-sample set for similar analysis until the important influence variable can be determined. In this embodiment, the independent variables having high correlation with the dependent variable delivery fee freight, such as the cargo weight, the required vehicle length, and the delivery mileage, are obtained by the correlation calculation.
4. And (4) carrying out box separation on each dependent variable, and carrying out grade division among cells respectively. The distribution mileage is divided into 7 interval grades, the cargo weight is divided into 7 interval grades, the required vehicle length is divided into 6 interval grades, and the divided critical value and the number of the grades are determined according to the distribution condition of the service and the cargo source.
5. The distribution of distribution fees over each combination of intervals is analyzed and upper and lower bounds for each group are determined.
Suppose that within the distribution mileage i grade, the cargo weight j grade and the required vehicle length grade k, there are N logistics transaction samples for the deal, and the distribution fee freight is recorded as
Figure BDA0002779696580000031
Standard deviation of
Figure BDA0002779696580000032
Then the fee freight is distributed in the combined interval
Figure BDA0002779696580000033
Inner, i.e. delivery fee lower bound of
Figure BDA0002779696580000034
Upper bound value of
Figure BDA0002779696580000035
Samples with delivery fees exceeding three times the standard deviation of the mean are all considered anomalous samples.
Based on the method, the upper and lower limits of the distribution cost under each combination grade are determined to form a primary filter screen.
6. And correcting the filter screen by utilizing the monotonous positive correlation.
The basic idea of the monotonous positive correlation filter screen is that for logistics orders with the same distribution mileage grade and the same cargo weight grade, the higher the required vehicle length grade is, the distribution fee lower limit value of the order is not correspondingly reduced; similarly, for logistics orders with the same distribution mileage grade and the same required vehicle length grade, the lower limit value of the distribution fee of the orders with the larger goods weight grade is correspondingly not reduced; the logistics orders with the same cargo weight grade and the same required vehicle length grade have the advantage that the lower limit value of the delivery fee of the orders with the farther delivery mileage is correspondingly not reduced. On the contrary, the upper bound of the distribution fee is not increased correspondingly, and the upper bound of the distribution fee under any combination level is not less than the lower bound.
And correcting the filter screen preliminarily formed in the last step based on the same training set data. If the goods weight is within 0-1 ton and the required vehicle length is within 0-4.2 m, the lowest/high distribution cost of the distribution mileage within 0-100 km cannot be higher than that of the distribution mileage within 100-300 km. If the lowest delivery cost of the mileage of 0-100 kilometers is higher than the lowest delivery cost of the mileage of 100-300 kilometers, the lowest delivery cost of the mileage of 100-300 kilometers is changed into the lowest delivery cost of the mileage of 0-100 kilometers; if the highest distribution fee of the mileage of 0-100 kilometers is higher than the highest distribution fee of the mileage of 100-300 kilometers, the highest distribution fee of the mileage of 0-100 kilometers is changed into the highest distribution fee of the mileage of 100-300 kilometers.
And traversing the three independent variable levels in sequence, and gradually correcting the upper and lower bounds of the distribution fee, thereby determining the lower bound value and the upper bound value of the distribution fee in different grids, and judging whether the distribution fee of a certain logistics order is in the upper and lower bound interval of the model according to the learned monotone positive correlation filter network model, thereby judging whether the logistics order is abnormal. And correcting the parameters of the filter screen by using the data set with the abnormal samples removed again to form a finally optimized filter screen model.
7. And establishing a distribution fee prediction model, and comparing the effect of the model to explain the necessity and effectiveness of eliminating abnormal samples through the filter screen model.
Carrying out manual check on an abnormal order sample detected by the single-alignment related filter screen model, searching and analyzing the reason and the rationality of the order abnormality, and preliminarily determining the effect of the filter screen detection method by comparing the coverage rates of the two, wherein the coverage rate is defined as follows:
Figure BDA0002779696580000041
where # denotes count.
By the control variable method, the same data is used in two ways: 1) and directly establishing a regression model for predicting the distribution fee. 2) Firstly, establishing a filter screen model based on data, screening out abnormal transaction samples, then establishing a delivery cost prediction model, comparing Accuracy of predicted delivery cost under a twice regression model, and judging the necessity and effectiveness of the filter screen model.
Using an Xgboost model as a regressor of a delivery cost prediction model, defining an evaluation index MAPE of a model prediction effect:
Figure BDA0002779696580000042
when alpha belongs to [0,0.5), the penalty for the sample with larger predicted value is larger than the penalty for the sample with smaller predicted value;
when alpha is 0.5, the penalty of the sample with smaller predicted value is the same as that of the sample with larger predicted value;
when α is (0.5, 1), the penalty for samples with smaller predictors is greater than the penalty for samples with larger predictors.
For the ith sample, the MAPE value is MAPEiDefining Accuracy Accuracy as MAPEiLess than a certain threshold, i.e.
Figure BDA0002779696580000043
Where I (·) is a schematic function, N is the sample size, and β represents the maximum tolerable relative error, which is the threshold for evaluating accuracy. Smaller β means more severe evaluation of the model prediction effect, and β may be generally 5%, 10%, 20%, 30%, or the like. Experiments prove that a higher Accuracy can be obtained by establishing a delivery fee prediction model by using the data of the abnormal transaction removal, and further the necessity and effectiveness of removing the abnormal sample by using the filter screen model are explained.
The determination of the upper and lower limit values of the distribution fee in a certain combined interval grade in the filter network model for detecting abnormal transaction data is determined according to the distribution condition of transaction samples, and the business fact that the farther the distribution mileage is, the heavier the goods are, the larger the required vehicle length is, and the higher the distribution fee is met. Meanwhile, a distribution fee prediction model for controlling whether to screen abnormal samples is a comparative experiment, so that an evaluation method is provided for explaining the effectiveness of the monotonous positive correlation filter network model detection method.
Compared with the prior art, the invention has the beneficial effects that: the method is based on logistics order business data, box separation processing is carried out on several important factors influencing distribution fees by combining with business experience, a monotone positive correlation filter screen is established on the basis, iterative optimization is carried out, and then abnormal logistics orders are screened out according to the learned filter screen.
Drawings
FIG. 1 is a general flow diagram of the present invention.
FIG. 2 shows basic data and fields according to an embodiment of the present invention.
Fig. 3 is an argument binning rule provided by an embodiment of the present invention.
FIG. 4 is a diagram illustrating a method for determining upper and lower distribution fees for a certain interval combination according to an embodiment of the present invention.
Fig. 5 is a schematic diagram of a monotone positive correlation filter screen model according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
In the description of the present invention, it should be noted that the terms "center", "upper", "lower", "left", "right", "vertical", "horizontal", "inner", "outer", etc., indicate orientations or positional relationships based on the orientations or positional relationships shown in the drawings, and are only for convenience of description and simplicity of description, but do not indicate or imply that the device or element being referred to must have a particular orientation, be constructed and operated in a particular orientation, and thus, should not be construed as limiting the present invention. Furthermore, the terms "first," "second," and "third" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance.
In the description of the present invention, it should be noted that, unless otherwise explicitly specified or limited, the terms "mounted," "connected," and "connected" are to be construed broadly, e.g., as meaning either a fixed connection, a removable connection, or an integral connection; can be mechanically or electrically connected; they may be connected directly or indirectly through intervening media, or they may be interconnected between two elements. The specific meanings of the above terms in the present invention can be understood in specific cases to those skilled in the art.
Referring to fig. 1, in an embodiment of the present invention, a method for detecting an abnormal logistics order by establishing a monotone positive correlation filter screen includes the following steps:
and acquiring a data source comprising basic data of the goods source and the distribution fee.
Necessary data cleansing is performed, including missing value processing and necessary format conversion.
And calculating the correlation among the variables, analyzing the independent variables with positive correlation with the distribution fee, and determining the important influence variables.
And (4) carrying out box separation on each dependent variable, and carrying out grade division among cells respectively.
Analyzing distribution of distribution fees on each interval combination and determining upper and lower boundary values of each combination, including combining respective variables, analyzing distribution of distribution fees on each interval combination and determining upper and lower boundary values, and preliminarily forming a filter screen.
The filter screen is corrected by utilizing positive correlation, and the lower bound value of the distribution fee is ensured not to be reduced along with the increase of the interval value; and along with the reduction of the interval value, the upper bound value of the distribution fee is not increased, and a filter screen model is formed.
And establishing a distribution fee prediction model, and comparing the effect of the model to explain the necessity and effectiveness of eliminating abnormal samples through the filter screen model, wherein the step of detecting the logistics transaction order data comprises the steps of adjusting parameters by using the transaction order from which dirty data is screened out, and optimizing the detection effect of the filter screen model.
And (4) analyzing the abnormal detection effect of the filter screen model compared with the artificial abnormal detection result. Meanwhile, a distribution fee prediction model is established, and the effectiveness and the necessity of the abnormal value detection method improved in the embodiment of the invention are further explained by controlling the influence of whether a filter network model is used for abnormal data detection on the model effect or not through a variable method test.
The abnormal logistics order detection method is designed based on the inspiration that positive correlation exists between goods source related variables (goods weight, required vehicle length, delivery mileage, and the like) and delivery fee freight, and is characterized in that three independent variables are respectively boxed into a plurality of sections, the upper and lower boundaries of the delivery fee in each specific combination section are analyzed, and the upper and lower boundaries in each combination section are corrected by utilizing the monotone positive correlation to form a set of filter screens. Meanwhile, under the strategy of continuous screening iteration, the accuracy of detecting abnormal samples is improved while the interpretability and the flexibility are effectively ensured.
In the embodiment of the invention, the detection of the abnormal logistics order needs to meet the following constraint conditions:
1. within a certain range of cargo weight, required vehicle length and distribution mileage, the distribution cost is relatively stable and approximately follows a normal distribution.
2. When any two of the three variables (e.g., cargo weight, required vehicle length) are fixed, the delivery cost is relatively increased as the third variable (delivery cost) is increased.
3. In a committed logistics order, the phenomenon that distribution cost is not matched with basic information of a goods source due to an acquisition process or human factors exists, namely abnormal data exists.
4. For the logistics order sample data used for obtaining the filter screen model, other factors with strong correlation except the weight of goods, the required vehicle length and the distribution cost do not exist, otherwise, the sample is divided based on the characteristics, and the data set of the trainable filter screen model is extracted.
In addition, an independent variable segmentation scheme in the filter screen model is designed according to logistics distribution business so as to meet the actual application requirements.
In the embodiment of the invention, the historically collected trading order information is used for analyzing the variables positively correlated with the distribution fee, and the filter network model is determined based on a control variable method, a 3 sigma principle, a monotone positive correlation and the like, so that the abnormal detection and screening are carried out on the trading data, the data is more real, and the upper-layer application is more reliable.
And the data is more real by testing and screening, and the data is more reliable when being applied to an upper layer.
The following describes a model in an embodiment of the present invention.
1. A data source is acquired. The method is characterized in that auxiliary fields such as delivery mileage (mileage between a starting place and a destination) and goods categories are derived by taking basic information of a goods source, such as an originating place, a destination, a required vehicle length, goods weight, goods volume, goods type, vehicle type (such as unlimited, ordinary, flat, high-column, van type and the like), delivery mileage, high speed, charging mode (such as trip, ton, square, piece and the like), loading mode and the like as main information. For each committed transaction, a wide table is formed in association with its distribution fee.
2. Cleaning necessary data, such as screening samples of cargo weight, required vehicle length and non-missing delivery mileage; rejecting a transaction sample occurring on a holiday of a special festival; one-to-one encoding mapping is performed on the discrete fields, data format conversion necessary for analyzing correlation, and the like. And meanwhile, drawing a box line graph, a histogram and a bar graph to check the distribution condition of each variable and the dependent variable. Processing the field which is still missing, wherein A, the variable with the independent variable missing proportion higher than 30 percent is removed; B. and (3) missing the independent variable with the proportion less than 30%, filling the continuous independent variable with a median, and filling the discrete independent variable with a mode.
3. Calculating correlations between variables
Figure BDA0002779696580000081
Where X represents any argument and Y represents the delivery fee freight. Three independent variables with the highest positive correlation with the distribution fee freight and the correlation not lower than 0.8 are analyzed. If a plurality of independent variables with larger positive correlation with the distribution fee exist, the main component analysis of the independent variables can be considered to be carried out for dimensionality reduction; if the independent variable meeting the condition cannot be found, continuously screening the sub-sample set for similar analysis until the important influence variable can be determined. In this embodiment, the independent variables having high correlation with the dependent variable delivery fee freight, such as the cargo weight, the required vehicle length, and the delivery mileage, are obtained by the correlation calculation.
4. And (4) carrying out box separation on each dependent variable, and carrying out grade division among cells respectively. As shown in fig. 3, the monotone positive correlation independent variable box-separating rule divides the distribution mileage into 7 interval levels, divides the cargo weight into 7 interval levels, divides the required vehicle length into 6 interval levels, and determines the number of the divided critical values and levels according to the distribution conditions of the business and the cargo source.
5. The distribution of distribution fees over each combination of intervals is analyzed and upper and lower bounds for each group are determined.
In the case of abnormal condition monitoring, the overall level of the indicator can be measured by mean square, the variance gives the normal fluctuation range allowed by the value, the probability of the data point exceeding three times the standard deviation from the overall level is very small, once the determination scheme called 'small probability event' occurs, as shown in fig. 4, distributing the upper and lower bounds of the feeShown in the figure. Suppose that within the distribution mileage i grade, the cargo weight j grade and the required vehicle length grade k, there are N logistics transaction samples for the deal, and the distribution fee freight is recorded as
Figure BDA0002779696580000082
Standard deviation of
Figure BDA0002779696580000083
Then the fee freight is distributed in the combined interval
Figure BDA0002779696580000084
Inner, i.e. delivery fee lower bound of
Figure BDA0002779696580000085
Upper bound value of
Figure BDA0002779696580000091
Samples with delivery fees exceeding three times the standard deviation of the mean are all considered anomalous samples.
Based on the method, the upper and lower limits of the distribution cost under each combination grade are determined to form a primary filter screen.
6. And correcting the filter screen by utilizing the monotonous positive correlation.
As shown in fig. 5, a schematic diagram of a monotone positive correlation filter screen model. The basic idea of the monotonous positive correlation filter screen is that for logistics orders with the same distribution mileage grade and the same cargo weight grade, the higher the required vehicle length grade is, the distribution fee lower limit value of the order is not correspondingly reduced; similarly, for logistics orders with the same distribution mileage grade and the same required vehicle length grade, the lower limit value of the distribution fee of the orders with the larger goods weight grade is correspondingly not reduced; the logistics orders with the same cargo weight grade and the same required vehicle length grade have the advantage that the lower limit value of the delivery fee of the orders with the farther delivery mileage is correspondingly not reduced. On the contrary, the upper bound of the distribution fee is not increased correspondingly, and the upper bound of the distribution fee under any combination level is not less than the lower bound.
And correcting the filter screen preliminarily formed in the last step based on the same training set data. If the goods weight is within 0-1 ton and the required vehicle length is within 0-4.2 m, the lowest/high distribution cost of the distribution mileage within 0-100 km cannot be higher than that of the distribution mileage within 100-300 km. If the lowest delivery cost of the mileage of 0-100 kilometers is higher than the lowest delivery cost of the mileage of 100-300 kilometers, the lowest delivery cost of the mileage of 100-300 kilometers is changed into the lowest delivery cost of the mileage of 0-100 kilometers; if the highest distribution fee of the mileage of 0-100 kilometers is higher than the highest distribution fee of the mileage of 100-300 kilometers, the highest distribution fee of the mileage of 0-100 kilometers is changed into the highest distribution fee of the mileage of 100-300 kilometers.
And traversing the three independent variable levels in sequence, and gradually correcting the upper and lower bounds of the distribution fee, thereby determining the lower bound value and the upper bound value of the distribution fee in different grids, and judging whether the distribution fee of a certain logistics order is in the upper and lower bound interval of the model according to the learned monotone positive correlation filter network model, thereby judging whether the logistics order is abnormal. And correcting the parameters of the filter screen by using the data set with the abnormal samples removed again to form a finally optimized filter screen model.
7. And establishing a distribution fee prediction model, and comparing the effect of the model to explain the necessity and effectiveness of eliminating abnormal samples through the filter screen model.
Carrying out manual check on an abnormal order sample detected by the single-alignment related filter screen model, searching and analyzing the reason and the rationality of the order abnormality, and preliminarily determining the effect of the filter screen detection method by comparing the coverage rates of the two, wherein the coverage rate is defined as follows:
Figure BDA0002779696580000101
where # denotes count.
By the control variable method, the same data is used in two ways: 1) and directly establishing a regression model for predicting the distribution fee. 2) Firstly, establishing a filter screen model based on data, screening out abnormal transaction samples, then establishing a delivery cost prediction model, comparing Accuracy of predicted delivery cost under a twice regression model, and judging the necessity and effectiveness of the filter screen model.
Using an Xgboost model as a regressor of a delivery cost prediction model, defining an evaluation index MAPE of a model prediction effect:
Figure BDA0002779696580000102
when alpha belongs to [0,0.5), the penalty for the sample with larger predicted value is larger than the penalty for the sample with smaller predicted value;
when alpha is 0.5, the penalty of the sample with smaller predicted value is the same as that of the sample with larger predicted value;
when α is (0.5, 1), the penalty for samples with smaller predictors is greater than the penalty for samples with larger predictors.
For the ith sample, the MAPE value is MAPEiDefining Accuracy Accuracy as MAPEiLess than a certain threshold, i.e.
Figure BDA0002779696580000103
Where I (·) is a schematic function, N is the sample size, and β represents the maximum tolerable relative error, which is the threshold for evaluating accuracy. Smaller β means more severe evaluation of the model prediction effect, and β may be generally 5%, 10%, 20%, 30%, or the like. Experiments prove that a higher Accuracy can be obtained by establishing a delivery fee prediction model by using the data of the abnormal transaction removal, and further the necessity and effectiveness of removing the abnormal sample by using the filter screen model are explained.
The determination of the upper and lower limit values of the distribution fee in a certain combined interval grade in the filter network model for detecting abnormal transaction data is determined according to the distribution condition of transaction samples, and the business fact that the farther the distribution mileage is, the heavier the goods are, the larger the required vehicle length is, and the higher the distribution fee is met. Meanwhile, a distribution fee prediction model for controlling whether to screen abnormal samples is a comparative experiment, so that an evaluation method is provided for explaining the effectiveness of the monotonous positive correlation filter network model detection method.
It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential attributes thereof. The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference sign in a claim should not be construed as limiting the claim concerned.
Furthermore, it should be understood that although the present description refers to embodiments, not every embodiment may contain only a single embodiment, and such description is for clarity only, and those skilled in the art should integrate the description, and the embodiments may be combined as appropriate to form other embodiments understood by those skilled in the art.

Claims (8)

1. An abnormal logistics order detection method by establishing a monotone positive correlation filter screen is characterized by comprising the following steps:
acquiring a data source, including basic data of a goods source and a delivery fee;
performing necessary data cleaning, including missing value processing and necessary format conversion;
calculating the correlation among the variables, analyzing the independent variables with positive correlation with the distribution fee, and determining important influence variables;
each dependent variable is subjected to box separation, and grade division among cells is respectively carried out;
analyzing distribution of distribution fees on each interval combination and determining upper and lower boundary values of each group, wherein the upper and lower boundary values comprise respective variables of each group, analyzing distribution of distribution fees on each interval combination and determining the upper and lower boundary values, and preliminarily forming a filter screen;
the filter screen is corrected by utilizing positive correlation, and the lower bound value of the distribution fee is ensured not to be reduced along with the increase of the interval value; along with the reduction of the interval value, the upper bound value of the distribution fee is not increased, and a filter screen model is formed;
and establishing a distribution fee prediction model, and comparing the effect of the model to explain the necessity and effectiveness of eliminating abnormal samples through the filter screen model, wherein the step of detecting the logistics transaction order data comprises the steps of adjusting parameters by using the transaction order from which dirty data is screened out, and optimizing the detection effect of the filter screen model.
2. The method for detecting the abnormal logistics order through the establishment of the monotone positive correlation filter screen as claimed in claim 1, wherein the step of obtaining the data source derives the auxiliary fields of the delivery mileage, the goods category and the like by taking the basic information of the goods source, including the origin, the destination, the required vehicle length, the weight of the goods, the volume of the goods, the type of the goods, the vehicle type, the delivery mileage, whether the vehicle is high-speed, the charging mode, the loading mode and the like as the main information, and associates the delivery fee with each deal made for each order to form a large-width table.
3. The abnormal logistics order detection method through the establishment of the monotone positive correlation filter screen according to claim 2, characterized in that necessary data cleaning steps are carried out, including screening samples of cargo weight, required vehicle length and non-missing distribution mileage; rejecting a transaction sample occurring on a holiday of a special festival; performing one-to-one coding mapping on discrete fields, performing necessary data format conversion for analyzing correlation and the like, simultaneously drawing a box line graph, a histogram and a bar graph to check the distribution condition of respective variables and dependent variables, and processing the fields which are still missing, wherein A, the variable with the independent variable missing proportion higher than 30 percent is removed; B. and (3) missing the independent variable with the proportion less than 30%, filling the continuous independent variable with a median, and filling the discrete independent variable with a mode.
4. The method of claim 3, wherein the abnormal logistics are achieved by establishing a monotone positive correlation filter screenThe order detection method is characterized in that the step of calculating the correlation among variables adopts a formula:
Figure FDA0002779696570000021
wherein X represents any independent variable, Y represents delivery fee freight, three independent variables with the highest positive correlation with the delivery fee freight and the correlation of not less than 0.8 are analyzed, and if a plurality of independent variables with larger positive correlation with the delivery fee exist, the main component analysis of the independent variables can be considered to be carried out for dimensionality reduction; if the independent variable meeting the condition cannot be found, continuously screening the sub-sample set for similar analysis until the important influence variable can be determined, and obtaining the independent variables with high correlation with the dependent variable delivery fee freight, such as the cargo weight, the required vehicle length and the delivery mileage.
5. The method according to claim 4, wherein the dependent variables are classified into groups, the distribution mileage is classified into 7 classes, the cargo weight is classified into 7 classes, the required vehicle length is classified into 6 classes, and the cut critical values and the number of the classes are determined according to the distribution of the traffic and the cargo resources.
6. The abnormal logistics order detection method through the establishment of the monotone positive correlation filter screen according to claim 5, characterized in that the step of analyzing distribution of delivery fees on each interval combination and determining upper and lower boundary values of each group is carried out:
suppose that within the distribution mileage i grade, the cargo weight j grade and the required vehicle length grade k, there are N logistics transaction samples for the deal, and the distribution fee freight is recorded as
Figure FDA0002779696570000022
Standard deviation of
Figure FDA0002779696570000023
Then distributed in the combined intervalIn intervals of Fisher
Figure FDA0002779696570000024
Inner, i.e. delivery fee lower bound of
Figure FDA0002779696570000025
Upper bound value of
Figure FDA0002779696570000026
Samples with the distribution cost exceeding three times of the standard deviation of the mean value are all regarded as abnormal samples; based on the above, the upper and lower limits of the distribution cost at each combination level are determined to form a preliminary filter screen.
7. The abnormal logistics order detection method through the establishment of the monotone positive correlation filter screen according to claim 6, characterized in that the positive correlation is utilized to correct the filter screen by the steps of: for logistics orders with the same distribution mileage grade and the same required vehicle length grade, the lower limit value of the distribution fee of the orders with the larger goods weight grade is correspondingly not reduced; logistics orders with the same cargo weight grade and the same required vehicle length grade, and orders with farther delivery mileage, the lower threshold value of the delivery fee should not be reduced correspondingly; otherwise, the upper bound value of the distribution fee is not increased correspondingly, and the upper bound value of the distribution fee under any combination level is not less than the lower bound value;
correcting the filter screen preliminarily formed in the previous step based on the same training set data; and traversing the three independent variable grades in sequence, and gradually correcting the upper and lower bounds of the distribution fee so as to determine the lower bound value and the upper bound value of the distribution fee falling in different grids, so as to judge whether the distribution fee of a certain logistics order is in the upper and lower bound interval of the model according to the learned monotone positive correlation filter screen model, so as to judge whether the logistics order is abnormal, and correcting the parameters of the filter screen by using the data set with the abnormal samples removed again to form a finally optimized filter screen model.
8. The abnormal logistics order detection method through the establishment of the monotone positive correlation filter screen according to claim 7, characterized in that the establishment of the delivery fee prediction model comprises the steps of: carrying out manual check on an abnormal order sample detected by the single-alignment related filter screen model, searching and analyzing the reason and the rationality of the order abnormality, and preliminarily determining the effect of the filter screen detection method by comparing the coverage rates of the two, wherein the coverage rate is defined as follows:
Figure FDA0002779696570000031
wherein # represents count;
by the control variable method, the same data is used in two ways: 1) directly establishing a regression model for predicting the distribution cost; 2) firstly, establishing a filter screen model based on data and screening out abnormal transaction samples, then establishing a delivery cost prediction model, comparing Accuracy of predicted delivery cost under a twice regression model, and judging the necessity and effectiveness of the filter screen model;
using an Xgboost model as a regressor of a delivery cost prediction model, defining an evaluation index MAPE of a model prediction effect:
Figure FDA0002779696570000032
when alpha belongs to [0,0.5), the penalty for the sample with larger predicted value is larger than the penalty for the sample with smaller predicted value;
when alpha is 0.5, the penalty of the sample with smaller predicted value is the same as that of the sample with larger predicted value;
when alpha is equal to (0.5, 1), the penalty for the sample with smaller predicted value is larger than that for the sample with larger predicted value;
for the ith sample, the MAPE value is MAPEiDefining Accuracy Accuracy as MAPEiLess than a certain threshold, i.e.
Figure FDA0002779696570000041
Wherein, I (·) is a demonstration function, N is a sample size, and β represents the maximum tolerable relative error, which is a threshold value for evaluating accuracy; smaller beta means more severe evaluation of the model prediction effect, and beta may be 5%, 10%, 20%, 30% or the like; experiments prove that a higher Accuracy can be obtained by establishing a delivery fee prediction model by using the data of the abnormal transaction removal, and further the necessity and effectiveness of removing the abnormal sample by using the filter screen model are explained.
CN202011282131.5A 2020-11-16 2020-11-16 Abnormal logistics order detection method by establishing monotonic positive correlation filter screen Active CN112348644B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011282131.5A CN112348644B (en) 2020-11-16 2020-11-16 Abnormal logistics order detection method by establishing monotonic positive correlation filter screen

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011282131.5A CN112348644B (en) 2020-11-16 2020-11-16 Abnormal logistics order detection method by establishing monotonic positive correlation filter screen

Publications (2)

Publication Number Publication Date
CN112348644A true CN112348644A (en) 2021-02-09
CN112348644B CN112348644B (en) 2024-04-02

Family

ID=74362893

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011282131.5A Active CN112348644B (en) 2020-11-16 2020-11-16 Abnormal logistics order detection method by establishing monotonic positive correlation filter screen

Country Status (1)

Country Link
CN (1) CN112348644B (en)

Citations (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2004169989A (en) * 2002-11-20 2004-06-17 Daikin Ind Ltd Abnormality diagnosis system
US20050193281A1 (en) * 2004-01-30 2005-09-01 International Business Machines Corporation Anomaly detection
CN106368813A (en) * 2016-08-30 2017-02-01 北京协同创新智能电网技术有限公司 Abnormal alarm data detection method based on multivariate time series
CN109031374A (en) * 2018-08-06 2018-12-18 北京理工大学 Difference pseudo-range corrections abnormal signal monitoring method suitable for continuous operation of the reference station
CN109086324A (en) * 2018-07-04 2018-12-25 中国科学院地理科学与资源研究所 A kind of Oil/gas Geochemical Anomalies extracting method for dividing shape based on S-A
CN109784668A (en) * 2018-12-21 2019-05-21 国网江苏省电力有限公司南京供电分公司 A kind of sample characteristics dimension-reduction treatment method for electric power monitoring system unusual checking
CN111339297A (en) * 2020-02-21 2020-06-26 广州天懋信息系统股份有限公司 Network asset anomaly detection method, system, medium, and device
CN111507374A (en) * 2020-02-13 2020-08-07 华北电力大学 Power grid mass data anomaly detection method based on random matrix theory
JP2020142353A (en) * 2019-03-08 2020-09-10 ファナック株式会社 Abnormality detection device and abnormality detection method for joint of robot
CN112347230A (en) * 2020-11-16 2021-02-09 上海品见智能科技有限公司 Enterprise public opinion data analysis method based on Word2Vec
CN113807762A (en) * 2021-02-09 2021-12-17 北京京东振世信息技术有限公司 Method and system for assisting logistics abnormity decision
US20210397175A1 (en) * 2018-10-30 2021-12-23 Japan Aerospace Exploration Agency Abnormality detection device, abnormality detection method, and program
US20230243352A1 (en) * 2020-07-16 2023-08-03 Kobelco Compressors Corporation Oiling device and abnormality detection method of the same
CN116776271A (en) * 2023-06-30 2023-09-19 闽江学院 Polluted time sequence unsupervised anomaly detection method based on negative correlation

Patent Citations (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2004169989A (en) * 2002-11-20 2004-06-17 Daikin Ind Ltd Abnormality diagnosis system
US20050193281A1 (en) * 2004-01-30 2005-09-01 International Business Machines Corporation Anomaly detection
CN106368813A (en) * 2016-08-30 2017-02-01 北京协同创新智能电网技术有限公司 Abnormal alarm data detection method based on multivariate time series
CN109086324A (en) * 2018-07-04 2018-12-25 中国科学院地理科学与资源研究所 A kind of Oil/gas Geochemical Anomalies extracting method for dividing shape based on S-A
CN109031374A (en) * 2018-08-06 2018-12-18 北京理工大学 Difference pseudo-range corrections abnormal signal monitoring method suitable for continuous operation of the reference station
US20210397175A1 (en) * 2018-10-30 2021-12-23 Japan Aerospace Exploration Agency Abnormality detection device, abnormality detection method, and program
CN109784668A (en) * 2018-12-21 2019-05-21 国网江苏省电力有限公司南京供电分公司 A kind of sample characteristics dimension-reduction treatment method for electric power monitoring system unusual checking
JP2020142353A (en) * 2019-03-08 2020-09-10 ファナック株式会社 Abnormality detection device and abnormality detection method for joint of robot
CN111507374A (en) * 2020-02-13 2020-08-07 华北电力大学 Power grid mass data anomaly detection method based on random matrix theory
CN111339297A (en) * 2020-02-21 2020-06-26 广州天懋信息系统股份有限公司 Network asset anomaly detection method, system, medium, and device
US20230243352A1 (en) * 2020-07-16 2023-08-03 Kobelco Compressors Corporation Oiling device and abnormality detection method of the same
CN112347230A (en) * 2020-11-16 2021-02-09 上海品见智能科技有限公司 Enterprise public opinion data analysis method based on Word2Vec
CN113807762A (en) * 2021-02-09 2021-12-17 北京京东振世信息技术有限公司 Method and system for assisting logistics abnormity decision
CN116776271A (en) * 2023-06-30 2023-09-19 闽江学院 Polluted time sequence unsupervised anomaly detection method based on negative correlation

Non-Patent Citations (7)

* Cited by examiner, † Cited by third party
Title
"企业网络订单的异常检测项目", pages 2 - 3, Retrieved from the Internet <URL:https://blog.csdn.net/m0_46211325/article/details/104163581> *
GAOJUN XU: "Intelligent Identification of Electricity Stealing Based on the Correlation of Line Loss", 2022 7TH ASIA CONFERENCE ON POWER AND ELECTRICAL ENGNEERING, 1 June 2022 (2022-06-01), pages 1 - 6 *
KIM, YH: "Empirical study on the outliers of compressed nagural gas (CNG) refueling behaviors", 《PROCEEDINGS OF THE 2016 5TH INTERNATIONAL CONFERENCE ON CIVIL, ARCHITECTURAL AND HYDRAULIC ENGINEERING (ICCAHE 2016)》, vol. 95, 25 January 2017 (2017-01-25), pages 276 - 280 *
朱凯云, 黄焰, 吕冰清: "小儿神经疾病患者的脑电图与单胺神经递质代谢", 脑与神经疾病杂志, no. 01, 10 February 1997 (1997-02-10) *
王;: "超声检测NT值异常与胎儿心脏畸形及染色体异常之间的相关性探讨", 影像研究与医学应用, no. 15, 1 November 2017 (2017-11-01) *
赵 刚: "基于向量升维的农情异常数据实时检测方法", 安徽农业大学学报, vol. 48, no. 2, 31 December 2021 (2021-12-31), pages 304 - 311 *
赵曼: "基于数据相关性的异常检测算法研究", 中国优秀硕士学位论文全文数据库信息科技辑, no. 06, 15 June 2017 (2017-06-15), pages 138 - 935 *

Also Published As

Publication number Publication date
CN112348644B (en) 2024-04-02

Similar Documents

Publication Publication Date Title
CN111310786B (en) Traffic detector abnormality diagnosis method and device based on random forest classifier
CN108492555A (en) A kind of city road net traffic state evaluation method and device
CN105374209B (en) A kind of urban area road network running status characteristics information extraction method
CN109816031B (en) Transformer state evaluation clustering analysis method based on data imbalance measurement
CN112949715A (en) SVM (support vector machine) -based rail transit fault diagnosis method
CN107679734A (en) It is a kind of to be used for the method and system without label data classification prediction
CN111210621B (en) Signal green wave coordination route optimization control method and system based on real-time road condition
CN111179592B (en) Urban traffic prediction method and system based on spatio-temporal data flow fusion analysis
CN112330067B (en) Financial big data analysis system based on block chain
CN115691120A (en) Congestion identification method and system based on highway running water data
CN105426441B (en) A kind of automatic preprocess method of time series
CN113159374B (en) Data-driven urban traffic flow rate mode identification and real-time prediction early warning method
WO2020108219A1 (en) Traffic safety risk based group division and difference analysis method and system
CN114267173B (en) Multisource data fusion method, device and equipment for space-time characteristics of expressway
CN114529226B (en) Underground water pollution monitoring method and system based on industrial Internet of things
CN114446064A (en) Method, device, storage medium and terminal for analyzing traffic of expressway service area
Thomas Multi-state and multi-sensor incident detection systems for arterial streets
CN116739376A (en) Highway pavement preventive maintenance decision method based on data mining
CN112364910B (en) Highway charging data abnormal event detection method and device based on peak clustering
CN106910334B (en) Method and device for predicting road section conditions based on big data
US20110015967A1 (en) Methodology to identify emerging issues based on fused severity and sensitivity of temporal trends
CN117472893A (en) Method for systematically improving traffic flow data quality
CN111341096B (en) Bus running state evaluation method based on GPS data
CN112348644A (en) Abnormal logistics order detection method by establishing monotonous positive correlation filter screen
CN114419894B (en) Method and system for setting and monitoring parking positions in road

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant