CN112215640A - Network retail platform shop sampling method based on statistical estimation - Google Patents

Network retail platform shop sampling method based on statistical estimation Download PDF

Info

Publication number
CN112215640A
CN112215640A CN202011071055.3A CN202011071055A CN112215640A CN 112215640 A CN112215640 A CN 112215640A CN 202011071055 A CN202011071055 A CN 202011071055A CN 112215640 A CN112215640 A CN 112215640A
Authority
CN
China
Prior art keywords
data
sampling
platform
retail
sample
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202011071055.3A
Other languages
Chinese (zh)
Other versions
CN112215640B (en
Inventor
李起昊
张强
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chaozhou Zhuoshu Big Data Industry Development Co Ltd
Original Assignee
Chaozhou Zhuoshu Big Data Industry Development Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chaozhou Zhuoshu Big Data Industry Development Co Ltd filed Critical Chaozhou Zhuoshu Big Data Industry Development Co Ltd
Priority to CN202011071055.3A priority Critical patent/CN112215640B/en
Publication of CN112215640A publication Critical patent/CN112215640A/en
Application granted granted Critical
Publication of CN112215640B publication Critical patent/CN112215640B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/02Marketing; Price estimation or determination; Fundraising
    • G06Q30/0201Market modelling; Market analysis; Collecting market data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
    • G06F16/9566URL specific, e.g. using aliases, detecting broken or misspelled links
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/02Marketing; Price estimation or determination; Fundraising
    • G06Q30/0201Market modelling; Market analysis; Collecting market data
    • G06Q30/0202Market predictions or forecasting for commercial activities

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • Data Mining & Analysis (AREA)
  • Strategic Management (AREA)
  • Finance (AREA)
  • Development Economics (AREA)
  • Accounting & Taxation (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Entrepreneurship & Innovation (AREA)
  • General Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Marketing (AREA)
  • Evolutionary Computation (AREA)
  • Databases & Information Systems (AREA)
  • General Business, Economics & Management (AREA)
  • Economics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Game Theory and Decision Science (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Medical Informatics (AREA)
  • Mathematical Physics (AREA)
  • Computing Systems (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a network retail platform shop sampling method based on statistical calculation, which belongs to the field of data processing and statistical calculation, and aims to solve the technical problem of how to sample network retail platform shops and comprehensively, accurately and timely know the development conditions of national electric commerce industry so as to calculate the network retail amount of each field and each category and further calculate the network retail amount of a national retail platform, wherein the technical scheme is as follows: the method comprises the steps of extracting sample data by adopting data of all shops with shop addresses URL in a merchant platform, adopting a multi-stage sampling method combining comprehensive investigation of a key platform and two steps of multilevel, and then calculating total macroscopic data by using the sample data and controlling the data quality; the method comprises the following specific steps: a data acquisition stage; a data processing stage; a sample extraction stage: a sampling method combining comprehensive investigation of a key platform and two-step multilevel sampling is adopted to extract shop samples; determining a sample stage; and a data calculation stage.

Description

Network retail platform shop sampling method based on statistical estimation
Technical Field
The invention relates to the field of data processing and statistical calculation, in particular to a network retail platform shop sampling method based on statistical calculation.
Background
Statistical estimation is also called statistical inference. The method is a method for deducing the overall characteristics from the investigation result of a sample by applying a scientific method in a sampling investigation mode. Statistical inferences are an important component of statistical analysis. It is a very important method for social research researchers. Since the objects involved in social research are usually a large number of populations, some populations may even be an unlimited number, investigators generally do not generally conduct general research on the whole social phenomena to be researched, but rather choose a few representatives of the population to study, and then use a statistical estimation method to estimate or judge the situation of the population from the analysis results of the samples, thereby recognizing the population.
Statistical inference the problem under study has a defined population, the distribution of which is unknown or partially unknown, and some conclusion about the unknown distribution is made by samples (observations) taken from the population. Individuals are part of the population, and local characteristics can reflect global features, which in turn makes the sample unable to accurately reflect the population due to the heterogeneity of the population and the randomness of the sample. Therefore, the extracted part of individuals are analyzed to draw a conclusion about the whole that errors and unreliability exist. Theoretically, there are two ways to eliminate and reduce this error: 1) the uniformity is as much as possible; 2) ensuring sample representativeness; and adopting a proper sampling method to ensure the 'representativeness' of the sampling, thereby effectively controlling and improving the reliability and correctness of the statistical inference.
At present, most of the network retail platform shop samples are selected by using a simple random sampling method or a hierarchical sampling method. The method for sampling the data by the hierarchical sampling method has the advantages that the particularity of the samples can be ignored by using the simple random sampling method, the defect of insufficient data representativeness exists, although the defect of the simple random sampling method is made up by the hierarchical sampling method, the correctness of an analysis result is seriously influenced by the rationality of the hierarchy, and a larger improvement space still exists. Therefore, how to sample the network platform retail platform stores can comprehensively, accurately and timely know the development conditions of the national e-commerce industry, so that the network retail amount of each field and each category can be calculated, and the network retail amount of the national retail platform can be calculated.
Disclosure of Invention
The invention provides a network retail platform store sampling method based on statistical estimation, and aims to solve the problems that how to sample network platform retail platform stores, the development conditions of national e-commerce industry are comprehensively, accurately and timely known, network retail amount of each field and each category is estimated, and then the network retail amount of the national retail platform is estimated.
The technical task of the invention is realized in the following way, the method for sampling the network retail platform stores based on statistical estimation comprises the steps of acquiring data of all stores of store addresses URL in a merchant platform, extracting sample data by adopting a multi-stage sampling method combining comprehensive investigation of a key platform and two steps and multiple layers, and estimating total macroscopic data and controlling data quality by utilizing the sample data; the method comprises the following specific steps:
a data acquisition stage: the method comprises the steps that shop information is collected on each e-commerce platform, and a sampling target is selected according to a preset confidence threshold;
and (3) a data processing stage: using the name lists of each sampling layer and basic information of a sampling unit as a uniform sampling frame, and completing, removing or correcting abnormal data according to the information of the sampling frame in a machine learning or linear interpolation mode;
a sample extraction stage: a sampling method combining comprehensive investigation of a key platform and two-step multilevel sampling is adopted to extract shop samples;
determining a sample stage: extracting samples in a preset proportion from the samples extracted in the sample extraction stage, screening the locations of the shops and the industry information to determine the reliability of the shop information;
a data calculation stage: and calculating total macroscopic data according to the sample data.
Preferably, in the process of selecting the sampling target, the sampling error of the sales of the local and classified stores is preset to be 0-5%, preferably 3%, and the sampling error of the sales of the nationwide stores generated in a gathering manner is controlled to be 0-5%.
Preferably, in the data processing stage, each layer of directory is sampled to comprise a region (city), a main marketing type, a sales interval and the sample amount required by crossing;
the basic information of the sampling unit comprises the names, serial numbers, regions (cities) to which the shop belongs, the types of the major and the subordinated annual sales volume intervals of all shops to be sampled.
Preferably, the comprehensive investigation of the key platform lays a foundation for obtaining a next-year sampling frame, and the specific steps are as follows:
comprehensively surveying the e-commerce platforms, and acquiring regional information, main-camp type information and annual sales volume information of all shops of each e-commerce platform in the near term;
adding a latest region label, a latest main business type label and a recent sales volume section label of the shop.
Preferably, the two-step multi-level sampling is a comprehensive sampling method considering a two-eight principle and a representative principle, and comprises the following specific steps:
the sampling frame is divided into two parts, the two parts are sorted according to the sales volume of the previous year, the first 10% of shops are all collected, the sales volume distribution of the shops is left-leaning thick tail, the sampling of the parts is equal to the catching of the main stream of the shops, but the sampling is only based on the deduction, and the representativeness of the region or the industry is lacked;
the rest shops are subjected to layered sampling, and the total amount of samples (only the lowest sample amount is designated, and 1% of sample amount can be added for spare samples in each layer) is determined according to preset errors and confidence coefficients on an e-commerce platform;
and determining the number of sample shops of the regions and the types according to the comprehensively investigated regions and the types, and sampling in each most subdivided layer according to the sales rate sequence and an equidistant sampling method to obtain samples.
Preferably, the control data quality is specifically as follows:
in the data acquisition stage, store information is acquired and compared with store classification and region information during comprehensive investigation, whether sampling conditions are met or not is checked, and stores which do not meet the conditions any more are replaced by standby samples; or the missed-picking commodities show that the fluctuation of the number of the commodities is too large in the collecting process, and the commodities in the shop are timely picked;
in the data processing stage, the data abnormal condition or commodity missing is found to be systematic, and the data abnormal condition or commodity missing can be completed according to a machine learning or linear interpolation mode;
in the process of comprehensively investigating a key platform in a sample extraction stage, whether the number of the collected shops is full is verified in many aspects, including whether the number of the shops is missing or not by using third-party data, whether the published total sales volume is consistent with the self-collected data or not is verified, and if the sales volume is smaller due to missed stores, the shops can be filled;
in the stage of determining the sample, extracting a sample with a preset proportion from the sample, screening and inspecting the location of the shop and the information of the affiliated industry, and performing statistical investigation or call return visit to determine whether the webpage published information is consistent with the actual information, such as whether the latest regional information on the network is real or not and whether the sales volume of the network is accurate or not in the last month;
in the data calculation stage, the cross-platform store categories or region names need to be standardized, the store types of different platforms cannot be simply corresponded, the large categories need to be split, and the categories are gradually unified from small to large; meanwhile, calculation is carried out, and scientific basis is needed for each step of calculation.
Preferably, the data estimation stage is specifically as follows:
determining the basis for calculation: according to the theorem of majorities, under the condition that the sample is large enough, the statistic distribution obtained according to the sample is asymptotically to the overall distribution; the large sample data is used for calculating the same ratio and the occupation ratio of each commodity large class, each region and province which are similar to the same ratio and the occupation ratio of each commodity large class, each region and province of the network retail population;
preparing calculation sample data: detail data is the basis for the calculations;
calculating national network retail amount;
and calculating the network retail amount of the subdivision dimension.
More preferably, the basis for determining the estimate is as follows:
acquiring large sample data reflecting the overall trend and structure of the network retail industry;
calculating the national online retail amount in the current period by taking the online retail amount published by the statistical bureau as a reference;
calculating the online retail amount of each commodity of the current period, each transaction type, each region and each province;
the preparation of the estimation sample data is as follows:
screening out the shops which can be compared with each other at the same period on each platform, so that the data has comparability;
and (4) eliminating shops with abnormal pulling rates in various commodity categories to obtain comparable large sample data which can be summarized and analyzed, so as to avoid interference of abnormal values.
Preferably, the national network retail amount is calculated as follows:
calculating the online retail amount of the physical commodity: eliminating large sample data of virtual commodities by using an online shopping platform, calculating the current-period comparison acceleration of the online retail amount of the physical commodities, and calculating the online retail amount of the current-period physical commodities by taking the online retail amount of historical physical commodities published by a statistical bureau as a reference;
calculating the online retail amount of non-physical commodities: calculating the current on-line retail amount of non-physical commodities by using the large sample data of the life service platform and calculating the current on-line retail amount of the non-physical commodities on the basis of the historical on-line retail amount of the non-physical commodities published by the statistical bureau;
national online retail amount calculation: the sum of the physical commodity online retail amount and the non-physical commodity online retail amount is the national online retail amount.
Preferably, the network retail amount of the calculation subdivision dimension is specifically as follows:
calculating the ratio of each period of the retail amount of each commodity category on the physical commodity network by using the large sample data;
calculating the online retail amount of each commodity category in the current period and the historical period by combining the online retail amount of the real commodities calculated in the current period and the online retail amount of the historical period real commodities published by a statistical bureau;
and calculating the geometric acceleration of each commodity.
The network retail platform shop sampling method based on statistical estimation has the following advantages: the invention can comprehensively, accurately and timely know the development conditions of the national e-commerce industry by sampling all shops (14 e-commerce platforms such as Tianmao, Jingdong, Sunning, Taobao, one shop, national America and the like) with the shop address URL in the e-commerce platform, objectively monitor the development and change trends of the main operation types of each region and each e-commerce, better meet the requirement of researching and customizing the e-commerce industry policy, collect data for the network e-commerce platform according to a statistical sampling method, calculate the network retail amount of each region and each category, and further calculate the network retail amount of the full-electric commerce platform.
Drawings
The invention is further described below with reference to the accompanying drawings.
FIG. 1 is a block diagram of a statistical-inference-based web-based retail platform store sampling method;
FIG. 2 is a block diagram of the data estimation stage.
Detailed Description
The network retail platform shop sampling method based on statistical estimation is described in detail with reference to the drawings and specific embodiments.
Example 1:
as shown in the attached figure 1, the network retail platform shop sampling method based on statistical estimation is characterized in that data of all shops of shop address URLs in a merchant platform are acquired, sample data are extracted by adopting a multi-stage sampling method combining comprehensive investigation of a key platform and two steps and multiple levels, and then, the sample data are used for estimating total macroscopic data and controlling the data quality; the method comprises the following specific steps:
s1, data acquisition stage: the method comprises the steps that shop information is collected on each e-commerce platform, and a sampling target is selected according to the fact that a preset confidence threshold is 95%;
s2, data processing stage: using the name lists of each sampling layer and basic information of a sampling unit as a uniform sampling frame, and completing, removing or correcting abnormal data according to the information of the sampling frame in a machine learning or linear interpolation mode;
s3, sample extraction stage: a sampling method combining comprehensive investigation of a key platform and two-step multilevel sampling is adopted to extract shop samples;
s4, determining a sample stage: extracting samples in a preset proportion from the samples extracted in the sample extraction stage, screening the locations of the shops and the industry information to determine the reliability of the shop information;
s5, data estimation stage: and calculating total macroscopic data according to the sample data.
In this embodiment, the control data quality is specifically as follows:
firstly, in a data acquisition stage, acquiring store information, comparing the store information with store classification and region information during comprehensive investigation, checking whether sampling conditions are met, and replacing stores which do not meet the conditions with standby samples; or the missed-picking commodities show that the fluctuation of the number of the commodities is too large in the collecting process, and the commodities in the shop are timely picked;
in the data processing stage, the data abnormal condition or commodity missing is found to have systematicness, and the data abnormal condition or commodity missing can be completed according to a machine learning or linear interpolation mode;
in the process of comprehensively investigating a key platform in a sample extraction stage, whether the number of the collected shops is full is checked in multiple aspects, including whether the number of the shops is lost or not by using third-party data, whether the total sales volume of the published shops is consistent with the self-collected data or not is checked, and if the sales volume is smaller due to missed stores, the shops can be filled;
fourthly, in the stage of determining the samples, extracting samples in a preset proportion from the samples, screening and inspecting the information of the places and the industries where the shops are located, and carrying out statistical investigation or call return visit to determine whether the webpage published information is consistent with the actual information, such as whether the latest regional information on the network is real or not and whether the sales volume of the network is accurate or not in the next month;
in a data calculation stage, cross-platform store categories or region names need to be standardized, store types of different platforms cannot be simply corresponded, large categories need to be split, and the categories are gradually unified from small to large; meanwhile, calculation is carried out, and scientific basis is needed for each step of calculation.
In the present embodiment, in the process of selecting the sampling target in step S1, the sampling error of the sales of the local stores and the classified stores is preset to be within 3% (the areas with few individual stores are within 5%), and the sampling error of the sales of the nationwide stores generated in a lump is controlled to be within 3%.
In this embodiment, in the data processing stage in step S2, each layer directory is sampled to include a region (city), a main business type, a sales interval, and a sample size required for intersection;
the basic information of the sampling unit comprises the names, serial numbers, regions (cities) to which the shop belongs, the types of the major and the subordinated annual sales volume intervals of all shops to be sampled.
In this embodiment, in the sample extraction stage in step S3, the key platform comprehensively surveys and lays a foundation for obtaining the next year sampling frame, and the specific steps are as follows:
s301-1, comprehensively surveying the e-commerce platforms, and collecting regional information, main-camping type information and annual sales volume information of all shops of each e-commerce platform in the near term;
s301-2, adding a latest region label, a latest main type label and a recent sales volume label of the shop.
In the sample extraction stage in step S3, the two-step multi-level sampling is a comprehensive sampling method considering the "two-eight principle" and the "representative principle", and includes the following specific steps:
s302-1, dividing a sampling frame into two parts, sorting according to the sales volume of the previous year, collecting all the first 10% of shops, wherein the sales volume distribution of the shops is left-leaning thick tails, sampling the parts to be equal to seizing the main stream of the shops, and deducing that the representativeness of the region or the industry is lacked only according to the results;
s302-2, performing layered sampling on the rest shops, and determining the total amount of samples (only the lowest sample size is specified, and 1% of sample size can be added for a standby sample at each layer) according to a preset error and a confidence coefficient on an e-commerce platform;
s302-3, determining the number of sample shops of the regions and the types according to the fully investigated regions and the types, and sampling in each most subdivided layer according to the sales rate sequence and an equidistant sampling method to obtain samples.
Examples are: by taking the total forecast of the electric power suppliers in each month in 2017 as an example, in 2016, a total of 14 main-electricity electric power supplier platforms (accounting for about 98% of the total electric power supplier share) and each platform accounting for the total electric power supplier platform are obtained according to an expert evaluation method and an electric power supplier industry anonymous questionnaire survey method. (sample approximates overall method, estimating 100% from 98% approximately zero error).
Key comprehensive investigation steps: the 14 household electrical appliance merchant platforms are comprehensively investigated, regional information, main-camp type information and annual sales volume information of all shops of each platform in the last year are collected, and a shop nearest regional label, a nearest main-camp type label and a near-annual sales volume interval label are added. And designing a sampling frame for the next multilevel sampling.
Two multi-level sampling steps: the sampling frame is divided into two parts, the samples are sorted according to the sales volume of the last year, the first 10% of shops are all collected, the sales volume distribution of the shops is left inclined and thick, so that the obtained part is equal to the main stream of the shops, but the representativeness of the region or industry is lacked according to the deduction, so that the rest parts are hierarchically sampled, the total amount of the samples (only the lowest sample amount is appointed and 1% of the sample amount can be added for the standby sample in each layer) is determined according to the acceptable error and the confidence coefficient on a certain platform, the number of the sample shops of the region and the category is determined according to the region and the category proportion of the comprehensive investigation, and the samples are sampled according to the sales volume sorting in each most subdivided layer by an equidistant sampling method.
In this embodiment, the data estimation stage in step S5 is specifically as follows:
s501, determining an estimation basis: according to the theorem of majorities, under the condition that the sample is large enough, the statistic distribution obtained according to the sample is asymptotically to the overall distribution; the large sample data is used for calculating the same ratio and the occupation ratio of each commodity large class, each region and province which are similar to the same ratio and the occupation ratio of each commodity large class, each region and province of the network retail population; the method comprises the following specific steps:
s50101, acquiring large sample data reflecting the overall trend and structure of the network retail industry;
s50102, calculating the national online retail amount in the current period by taking the online retail amount published by the statistical bureau as a reference;
s50103, calculating the current commodity categories, the current transaction types, the current district and the current province online retail amount;
s502, preparing estimation sample data: detail data is the basis for the calculations; the method comprises the following specific steps:
s50201, screening out the shops which can be compared with each other at the same period on each platform, and enabling data to be comparable;
s50202, removing shops with abnormal pulling rates in the commodity categories to obtain comparable large sample data capable of being analyzed in a summary mode, and avoiding interference of abnormal values.
S503, calculating the national network retail amount; the method comprises the following specific steps:
s50301, calculating the actual commodity online retail amount: eliminating large sample data of virtual commodities by using an online shopping platform, calculating the current-period comparison acceleration of the online retail amount of the physical commodities, and calculating the online retail amount of the current-period physical commodities by taking the online retail amount of historical physical commodities published by a statistical bureau as a reference;
s50302, calculating the online retail amount of non-physical commodities: calculating the current on-line retail amount of non-physical commodities by using the large sample data of the life service platform and calculating the current on-line retail amount of the non-physical commodities on the basis of the historical on-line retail amount of the non-physical commodities published by the statistical bureau;
s50303, national online retail amount calculation: the sum of the online retail amount of the physical commodities and the online retail amount of the non-physical commodities is the national online retail amount;
s504, calculating the network retail amount of the subdivision dimension; the method comprises the following specific steps:
s50401, calculating the period proportion of the retail amount of each commodity category on the physical commodity network by using the large sample data;
s50402, calculating the online retail amount of each commodity category in the current period and the historical period according to the online retail amount of the physical commodities calculated in the current period and the online retail amount of the historical period physical commodities published by a statistical bureau;
s50403, calculating the geometric acceleration of each commodity; similarly, the online retail amount and the proportional acceleration of each transaction type, each region and each province are calculated according to the method.
Finally, it should be noted that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present invention.

Claims (10)

1. A network retail platform store sampling method based on statistical estimation is characterized in that the method comprises the steps of collecting data of all stores of store addresses URL in a merchant platform, extracting sample data by adopting a multi-stage sampling method combining key platform comprehensive investigation and two steps and multiple levels, and estimating total macroscopic data and controlling data quality by utilizing the sample data; the method comprises the following specific steps:
a data acquisition stage: the method comprises the steps that shop information is collected on each e-commerce platform, and a sampling target is selected according to a preset confidence threshold;
and (3) a data processing stage: using the name lists of each sampling layer and basic information of a sampling unit as a uniform sampling frame, and completing, removing or correcting abnormal data according to the information of the sampling frame in a machine learning or linear interpolation mode;
a sample extraction stage: a sampling method combining comprehensive investigation of a key platform and two-step multilevel sampling is adopted to extract shop samples;
determining a sample stage: extracting samples in a preset proportion from the samples extracted in the sample extraction stage, screening the locations of the shops and the industry information to determine the reliability of the shop information;
a data calculation stage: and calculating total macroscopic data according to the sample data.
2. The statistical estimation-based network retail platform store sampling method as claimed in claim 1, wherein in the process of selecting the sampling target, the sampling error of the sales of the local and classified stores is preset to 0-5%, and the sampling error of the sales of the nationwide stores generated in a summary mode is controlled to 0-5%.
3. The statistical estimation-based network retail platform store sampling method as claimed in claim 1, wherein in the data processing stage, each layer of directory is sampled to include a region, a main type, a sales volume interval and a sample amount required for crossing;
the basic information of the sampling unit comprises the names, serial numbers, regions, types and annual sales intervals of all shops to be sampled.
4. The statistical estimation-based network retail platform store sampling method as claimed in claim 1, wherein the key platform comprehensive survey lays a foundation for obtaining a next year sampling frame, and comprises the following specific steps:
comprehensively surveying the e-commerce platforms, and acquiring regional information, main-camp type information and annual sales volume information of all shops of each e-commerce platform in the near term;
adding a latest region label, a latest main business type label and a recent sales volume section label of the shop.
5. The statistical-computation-based network retail platform store sampling method as claimed in claim 1, wherein the two-step multi-level sampling is a comprehensive sampling method considering a "two eight principle" and a "representative principle", and comprises the following specific steps:
dividing a sampling frame into two parts, sorting according to the sales volume of the last year, and collecting all the top 10% of shops, wherein the sales volume distribution of the shops is left-inclined thick tails;
performing layered sampling on the rest shops, and determining the total amount of samples according to preset errors and confidence coefficients on an e-commerce platform;
and determining the number of sample shops of the regions and the types according to the comprehensively investigated regions and the types, and sampling in each most subdivided layer according to the sales rate sequence and an equidistant sampling method to obtain samples.
6. The statistical-inference-based web-based retail platform store sampling method according to any one of claims 1-5, wherein the control data quality is specifically as follows:
in the data acquisition stage, store information is acquired and compared with store classification and region information during comprehensive investigation, whether sampling conditions are met or not is checked, and stores which do not meet the conditions any more are replaced by standby samples; or the missed-picking commodities show that the fluctuation of the number of the commodities is too large in the collecting process, and the commodities in the shop are timely picked;
in the data processing stage, the data abnormal condition or commodity missing is found to be systematic, and the data abnormal condition or commodity missing can be completed according to a machine learning or linear interpolation mode;
in the process of comprehensively investigating a key platform in a sample extraction stage, whether the number of the collected shops is full is verified in many aspects, including whether the number of the shops is missing or not by using third-party data, whether the published total sales volume is consistent with the self-collected data or not is verified, and if the sales volume is smaller due to missed stores, the shops can be filled;
in the stage of determining the sample, extracting a sample with a preset proportion from the sample, screening and inspecting the location of the shop and the information of the affiliated industry, and performing statistical investigation or call return visit to determine whether the webpage published information is consistent with the actual information, such as whether the latest regional information on the network is real or not and whether the sales volume of the network is accurate or not in the last month;
in the data calculation stage, the cross-platform store categories or region names need to be standardized, the store types of different platforms cannot be simply corresponded, the large categories need to be split, and the categories are gradually unified from small to large; meanwhile, calculation is carried out, and scientific basis is needed for each step of calculation.
7. The statistical estimation-based web-based retail platform store sampling method according to claim 1, wherein the data estimation stage is as follows:
determining the basis for calculation: according to the theorem of majorities, under the condition that the sample is large enough, the statistic distribution obtained according to the sample is asymptotically to the overall distribution; the large sample data is used for calculating the same ratio and the occupation ratio of each commodity large class, each region and province which are similar to the same ratio and the occupation ratio of each commodity large class, each region and province of the network retail population;
preparing calculation sample data: detail data is the basis for the calculations;
calculating national network retail amount;
and calculating the network retail amount of the subdivision dimension.
8. The statistical estimation-based web-based retail platform store sampling method according to claim 7, wherein the estimation is determined according to the following:
acquiring large sample data reflecting the overall trend and structure of the network retail industry;
calculating the national online retail amount in the current period by taking the online retail amount published by the statistical bureau as a reference;
calculating the online retail amount of each commodity of the current period, each transaction type, each region and each province;
the preparation of the estimation sample data is as follows:
screening out the shops which can be compared with each other at the same period on each platform, so that the data has comparability;
and (4) eliminating shops with abnormal pulling rates in various commodity categories to obtain comparable large sample data which can be summarized and analyzed, so as to avoid interference of abnormal values.
9. The statistical estimation-based web-based retail platform store sampling method of claim 7, wherein the estimation of national web retail sales is specifically as follows:
calculating the online retail amount of the physical commodity: eliminating large sample data of virtual commodities by using an online shopping platform, calculating the current-period comparison acceleration of the online retail amount of the physical commodities, and calculating the online retail amount of the current-period physical commodities by taking the online retail amount of historical physical commodities published by a statistical bureau as a reference;
calculating the online retail amount of non-physical commodities: calculating the current on-line retail amount of non-physical commodities by using the large sample data of the life service platform and calculating the current on-line retail amount of the non-physical commodities on the basis of the historical on-line retail amount of the non-physical commodities published by the statistical bureau;
national online retail amount calculation: the sum of the physical commodity online retail amount and the non-physical commodity online retail amount is the national online retail amount.
10. The statistical estimation-based cyber retail platform store sampling method according to any one of claims 7 to 9, wherein the network retail amount of the estimated subdivision dimension is as follows:
calculating the ratio of each period of the retail amount of each commodity category on the physical commodity network by using the large sample data;
calculating the online retail amount of each commodity category in the current period and the historical period by combining the online retail amount of the real commodities calculated in the current period and the online retail amount of the historical period real commodities published by a statistical bureau;
and calculating the geometric acceleration of each commodity.
CN202011071055.3A 2020-10-09 2020-10-09 Network retail platform shop sampling method based on statistical estimation Active CN112215640B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011071055.3A CN112215640B (en) 2020-10-09 2020-10-09 Network retail platform shop sampling method based on statistical estimation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011071055.3A CN112215640B (en) 2020-10-09 2020-10-09 Network retail platform shop sampling method based on statistical estimation

Publications (2)

Publication Number Publication Date
CN112215640A true CN112215640A (en) 2021-01-12
CN112215640B CN112215640B (en) 2022-07-26

Family

ID=74052925

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011071055.3A Active CN112215640B (en) 2020-10-09 2020-10-09 Network retail platform shop sampling method based on statistical estimation

Country Status (1)

Country Link
CN (1) CN112215640B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113919865A (en) * 2021-09-26 2022-01-11 浪潮卓数大数据产业发展有限公司 Network retail amount statistical method
CN114282951A (en) * 2021-12-28 2022-04-05 浪潮卓数大数据产业发展有限公司 Network retail prediction method, equipment and medium
CN114387017A (en) * 2022-01-04 2022-04-22 河南省烟草公司开封市公司 Retail customer demand information sample selection method and device based on customer domain
CN116701914A (en) * 2023-06-21 2023-09-05 广东星云开物科技股份有限公司 Hardware equipment abnormal use identification method, device, storage device and system

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108960927A (en) * 2018-07-12 2018-12-07 山东汇贸电子口岸有限公司 A kind of e-tailing development index system based on web crawlers and economic statistics
CN109961324A (en) * 2019-03-19 2019-07-02 山东浪潮云信息技术有限公司 A kind of electric business enterprise stamps the standardization processing method and system of region label
CN110458199A (en) * 2019-07-16 2019-11-15 中国传媒大学 Based on the kohonen neural network clustering methods of sampling

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108960927A (en) * 2018-07-12 2018-12-07 山东汇贸电子口岸有限公司 A kind of e-tailing development index system based on web crawlers and economic statistics
CN109961324A (en) * 2019-03-19 2019-07-02 山东浪潮云信息技术有限公司 A kind of electric business enterprise stamps the standardization processing method and system of region label
CN110458199A (en) * 2019-07-16 2019-11-15 中国传媒大学 Based on the kohonen neural network clustering methods of sampling

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113919865A (en) * 2021-09-26 2022-01-11 浪潮卓数大数据产业发展有限公司 Network retail amount statistical method
CN113919865B (en) * 2021-09-26 2023-07-07 浪潮卓数大数据产业发展有限公司 Network retail sales statistics method
CN114282951A (en) * 2021-12-28 2022-04-05 浪潮卓数大数据产业发展有限公司 Network retail prediction method, equipment and medium
CN114282951B (en) * 2021-12-28 2024-05-14 浪潮卓数大数据产业发展有限公司 Network retail prediction method, device and medium
CN114387017A (en) * 2022-01-04 2022-04-22 河南省烟草公司开封市公司 Retail customer demand information sample selection method and device based on customer domain
CN116701914A (en) * 2023-06-21 2023-09-05 广东星云开物科技股份有限公司 Hardware equipment abnormal use identification method, device, storage device and system

Also Published As

Publication number Publication date
CN112215640B (en) 2022-07-26

Similar Documents

Publication Publication Date Title
CN112215640B (en) Network retail platform shop sampling method based on statistical estimation
Maghfuriyah et al. Market structure and Islamic banking performance in Indonesia: An error correction model
WO2023024259A1 (en) Digital twin-based partial discharge monitoring system, method and apparatus
Bode Neural networks for cost estimation: simulations and pilot application
CN107818344A (en) The method and system that user behavior is classified and predicted
CN105556557A (en) Shipment-volume prediction device, shipment-volume prediction method, recording medium, and shipment-volume prediction system
JP4890806B2 (en) Prediction program and prediction device
JP6459968B2 (en) Product recommendation device, product recommendation method, and program
US20200074486A1 (en) Information processing system, information processing device, prediction model extraction method, and prediction model extraction program
US20120123994A1 (en) Analyzing data quality
US11599892B1 (en) Methods and systems to extract signals from large and imperfect datasets
Mahto et al. Short‐Term Forecasting of Agriculture Commodities in Context of Indian Market for Sustainable Agriculture by Using the Artificial Neural Network
CN107679734A (en) It is a kind of to be used for the method and system without label data classification prediction
CN111079014A (en) Recommendation method, system, medium and electronic device based on tree structure
CN110738527A (en) feature importance ranking method, device, equipment and storage medium
CN110941648A (en) Abnormal data identification method, system and storage medium based on cluster analysis
CN105556558A (en) Order-volume determination device, order-volume determination method, recording medium, and order-volume determination system
CN114463091A (en) Information push model training and information push method, device, equipment and medium
WO2023134188A1 (en) Index determination method and apparatus, and electronic device and computer-readable medium
CN113313538A (en) User consumption capacity prediction method and device, electronic equipment and storage medium
CN112785057A (en) Component prediction method, device, equipment and storage medium based on exponential smoothing
CN116203352A (en) Fault early warning method, device, equipment and medium for power distribution network
CN115115157A (en) Overdue risk prediction method, overdue risk prediction device, computer equipment and storage medium
CN112906896A (en) Information processing method and device and computing equipment
Poornima et al. Prediction of water consumption using machine learning algorithm

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant