CN116610662B - Filling method, filling device, computer equipment and medium for missing classification data - Google Patents

Filling method, filling device, computer equipment and medium for missing classification data Download PDF

Info

Publication number
CN116610662B
CN116610662B CN202310869517.3A CN202310869517A CN116610662B CN 116610662 B CN116610662 B CN 116610662B CN 202310869517 A CN202310869517 A CN 202310869517A CN 116610662 B CN116610662 B CN 116610662B
Authority
CN
China
Prior art keywords
classification
identification
data
filling
missing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202310869517.3A
Other languages
Chinese (zh)
Other versions
CN116610662A (en
Inventor
董方
金宏伟
闫锋
常星
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jinrui Tongchuang Beijing Technology Co ltd
Original Assignee
Jinrui Tongchuang Beijing Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jinrui Tongchuang Beijing Technology Co ltd filed Critical Jinrui Tongchuang Beijing Technology Co ltd
Priority to CN202310869517.3A priority Critical patent/CN116610662B/en
Publication of CN116610662A publication Critical patent/CN116610662A/en
Application granted granted Critical
Publication of CN116610662B publication Critical patent/CN116610662B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2282Tablespace storage structures; Management thereof
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/283Multi-dimensional databases or data warehouses, e.g. MOLAP or ROLAP
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • G06F16/285Clustering or classification
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The embodiment of the application provides a filling method, a filling device, computer equipment and a medium of missing classified data, which relate to the technical field of data processing, wherein the method comprises the following steps: acquiring classification initial data; the multi-level identification corresponding to the identification name of each line in the initial data is classified and unfolded into a plurality of arrays, the arrays form a first classification set, and a second classification set is generated after the arrays of the first classification set are de-duplicated; forming a triplet by the identification name, the classification level corresponding to the identification classification in the second classification set and each tuple in the second classification set, wherein each triplet forms a third classification set; judging the maximum classification level number corresponding to each identification name in the third classification set, and taking an array included in the triplet where the maximum classification level number is located as a reference classification; and filling the identification classification missing in the classification initial data by using the reference classification. According to the scheme, the missing classification is filled through the initial data, and the accuracy of the classification data is improved.

Description

Filling method, filling device, computer equipment and medium for missing classification data
Technical Field
The present application relates to the field of data processing technologies, and in particular, to a method, an apparatus, a computer device, and a medium for filling missing classification data.
Background
When commodity data is collected from an external platform, the external platform can hide the classification of partial commodities. Particularly, when the classified data are in a plurality of levels which are mutually related, the loss of the intermediate classified collection can cause confusion of the data of the whole commodity, and the accuracy of the data is greatly influenced.
Disclosure of Invention
In view of this, the embodiment of the application provides a filling method of missing classification data, so as to solve the technical problem of inaccurate commodity data caused by data classification loss in the prior art. The method comprises the following steps:
acquiring classification initial data and deleting abnormal data in the classification initial data, wherein a first column of the classification initial data is an identification name, the second column to the last column of the classification initial data are respectively multi-stage identification classifications corresponding to the identification names, the classification stages corresponding to the second column to the last column are sequentially increased, the identification names in the classification initial data are different names of similar objects, and each of the classification initial data is the same identification name and the multi-stage identification classification corresponding to the identification name;
the method comprises the steps of (1) classifying multi-stage identification corresponding to identification names of each row in initial data, expanding the multi-stage identification into a plurality of arrays according to different number combinations of the identification classifications to obtain a plurality of arrays corresponding to each identification name, forming a first classification set by the plurality of arrays corresponding to all the identification names, and generating a second classification set after de-duplicating the arrays of the first classification set;
forming a triplet from the identification name, the classification level corresponding to the identification classification in the second classification set and each data group in the second classification set, wherein each triplet forms a third classification set, and the third classification set is used for counting the classification level corresponding to each identification classification;
judging the maximum classification level number corresponding to each identification name in the third classification set, and taking an array included in the triplet where the maximum classification level number is located as a reference classification;
and filling the identification classification missing in the classification initial data by using the reference classification, and generating filled data.
The embodiment of the application also provides a filling device for missing classified data, so as to solve the technical problem of inaccurate commodity data caused by the classified loss of the data in the prior art. The device comprises:
the system comprises an initial classification data acquisition module, a classification initial data generation module and a classification data generation module, wherein the initial classification data acquisition module is used for acquiring initial classification data and deleting abnormal data in the initial classification data, a first column of the initial classification data is an identification name, a second column to a last column of the initial classification data are respectively multi-stage identification classifications corresponding to the identification names, classification stages corresponding to the second column to the last column are sequentially increased, the identification names in the initial classification data are different names of similar objects, and each row of the initial classification data is the same identification name and the multi-stage identification classification corresponding to the identification name;
the classification data unfolding module is used for classifying the multi-stage identifiers corresponding to the identifier names of each row in the classification initial data, unfolding the multi-stage identifiers into a plurality of arrays according to different number combinations of the identifier classifications to obtain a plurality of arrays corresponding to each identifier name, forming a first classification set by the plurality of arrays corresponding to all the identifier names, and generating a second classification set after de-duplicating the arrays of the first classification set;
the classification layer number statistics module is used for forming a triplet from the identification name, the classification level corresponding to the identification classification in the second classification set and each data group in the second classification set, and each triplet forms a third classification set, wherein the third classification set is used for counting the classification level corresponding to each identification classification;
the reference classification generation module is used for judging the maximum classification level number corresponding to each identification name in the third classification set, and taking an array included in the triplet where the maximum classification level number is located as a reference classification;
and the data filling module is used for filling the identification classification missing in the classification initial data by utilizing the reference classification and generating filled data.
The embodiment of the application also provides computer equipment, which comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the method for filling any missing classified data is realized when the processor executes the computer program, so that the technical problem of inaccurate commodity data caused by the classified loss of data in the prior art is solved.
The embodiment of the application also provides a computer readable storage medium which stores a computer program for executing the method for filling any missing classified data, so as to solve the technical problem of inaccurate commodity data caused by the data classification loss in the prior art.
Compared with the prior art, the beneficial effects that above-mentioned at least one technical scheme that this description embodiment adopted can reach include at least:
through the classification initial data, the multi-stage identification classifications corresponding to the identification names of each row in the classification initial data are unfolded into a plurality of arrays according to different number combinations of the identification classifications, a second classification set is generated after duplication removal, and all possible classification hierarchy relations are acquired; forming a triplet from the identification name, the classification level corresponding to the identification classification in the second classification set and each group in the second classification set to count the classification level corresponding to each identification classification; finding the deepest level of each identification classification by judging the maximum classification level and taking the classification of the row of the deepest level as the reference classification; the missing classification data is populated using the reference classification. The method achieves the aim of filling the missing classified data according to the classified initial data, solves the problem of inaccurate commodity data caused by the loss of part of the classified data of the commodity, and ensures that the commodity data is kept accurate.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings can be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flowchart of a missing categorical data filling method according to an embodiment of the present application;
FIG. 2 is a block diagram of a computer device according to an embodiment of the present application;
fig. 3 is a block diagram of a missing classification data filling device according to an embodiment of the present application.
Detailed Description
Embodiments of the present application will be described in detail below with reference to the accompanying drawings.
Other advantages and effects of the present application will become apparent to those skilled in the art from the following disclosure, which describes the embodiments of the present application with reference to specific examples. It will be apparent that the described embodiments are only some, but not all, embodiments of the application. The application may be practiced or carried out in other embodiments that depart from the specific details, and the details of the present description may be modified or varied from the spirit and scope of the present application. It should be noted that the following embodiments and features in the embodiments may be combined with each other without conflict. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.
In an embodiment of the present application, a method for filling missing classification data is provided, as shown in fig. 1, where the method includes:
step S101: acquiring classification initial data and deleting abnormal data in the classification initial data, wherein a first column of the classification initial data is an identification name, the second column to the last column of the classification initial data are respectively multi-stage identification classifications corresponding to the identification names, the classification stages corresponding to the second column to the last column are sequentially increased, the identification names in the classification initial data are different names of similar objects, and each of the classification initial data is the same identification name and the multi-stage identification classification corresponding to the identification name;
step S102: the method comprises the steps of (1) classifying multi-stage identification corresponding to identification names of each row in initial data, expanding the multi-stage identification into a plurality of arrays according to different number combinations of the identification classifications to obtain a plurality of arrays corresponding to each identification name, forming a first classification set by the plurality of arrays corresponding to all the identification names, and generating a second classification set after de-duplicating the arrays of the first classification set;
step S103: forming a triplet from the identification name, the classification level corresponding to the identification classification in the second classification set and each data group in the second classification set, wherein each triplet forms a third classification set, and the third classification set is used for counting the classification level corresponding to each identification classification;
step S104: judging the maximum classification level number corresponding to each identification name in the third classification set, and taking an array included in the triplet where the maximum classification level number is located as a reference classification;
step S105: and filling the identification classification missing in the classification initial data by using the reference classification, and generating filled data.
Specifically, the abnormal data in the classification initial data is deleted. For example, if the identification classification is defined as a number, the identification classification of the identification classification in the initial data as a non-number is emptied; if the identification classification is defined as full letter, the identification classification classified as non-letter in the initial data is emptied; and if the identification classification is defined as the fixed bit number, clearing the identification classification which is not the fixed bit number in the initial data.
Specifically, if the classification initial data is data with classification level four: [ A, ID1, ID2, ID3, ID4; B, ID1, ID3, ID4; C, ID1, ID2, ID4; D, ID1, ID3]. Where ID1 to ID4 are the first to fourth identification classifications and a to D are the identification names (i.e., commodity names). In specific implementation, the identification name may be the name of any data object, for example, may be different names of commodities, or may be different names of products and devices. The identification classification refers to classification identifications corresponding to different classification levels, and the identification classification can be expressed in an ID form or in a text, a number form and the like.
In the specific implementation, the multi-stage identification classification corresponding to the identification names of each row in the initial classification data is developed into a plurality of arrays according to different number combinations of the identification classifications, a plurality of arrays corresponding to each identification name are obtained, the plurality of arrays corresponding to all the identification names form a first classification set, and a second classification set is generated after the arrays of the first classification set are de-duplicated:
the method comprises the steps of taking each row of classification initial data as a unit, expanding multi-stage identification classifications corresponding to identification names of each row into a plurality of arrays with increasing columns according to classification stages, and forming a sub-classification set by the plurality of arrays corresponding to the identification names, wherein the number of the arrays of the sub-classification set is the same as the number of stages of the multi-stage identification classifications; and forming a first classification set by sub-classification sets corresponding to the identification names of all the rows in the classification initial data.
Specifically, the multi-level identifier classification corresponding to each identifier name is expanded into a plurality of arrays with increasing columns according to the classification level, for example, if the initial data of classification takes the data of the commodity A, B, C, D respectively including the four-level identifier classification as an example:
[A,ID1,ID2,ID3,ID4;B,ID1,ID3,ID4;C,ID1,ID2,ID4;D,ID1,ID3],
the data of the first class set after expansion are as follows:
(ID1);(ID1,ID2);(ID1,ID2,ID3);(ID1,ID2,ID3,ID4);(ID1);(ID1,ID3);(ID1,ID3,ID4);(ID1);(ID1,ID2);(ID1,ID2,ID4);(ID1);(ID1,ID3)。
specifically, the data of the second classification set generated after the array of the first classification set is de-duplicated is as follows:
(ID1);(ID1,ID2);(ID1,ID2,ID3);(ID1,ID2,ID3,ID4);(ID1,ID3);(ID1,ID3,ID4);(ID1,ID2,ID4)。
in specific implementation, the method comprises the following steps of forming a triplet from the identification name, the classification level corresponding to the identification classification in the second classification set and each data group in the second classification set:
the following steps are circularly executed for each array in the second classification set until the array in the second classification set is processed, and the circulation is ended: reading the last element in the current array, and judging whether the last element is a valid identification classification; if yes, taking the effective identification classification as a target classification; judging the number of columns of the target classification in the current array, and taking the number of columns as the classification level of the target classification; and taking the target classification as a first element, taking the classification level of the target classification as a second element, and taking the current array as a third element to form a triplet.
Specifically, after the above-mentioned classification initial data is processed, the data of the triplet is as follows:
(ID1,1,Bean(ID1));(ID2,2,Bean(ID1,ID2));(ID3,3,Bean(ID1,ID2,ID3));(ID4,4,Bean(ID1,ID2,ID3,ID4));(ID3,2,Bean(ID1,ID3));(ID4,3,Bean(ID1,ID3,D4));(ID4,3,Bean(ID1,ID2,D4))。
in specific implementation, the method comprises the following steps of judging the maximum classification level number corresponding to each identification name in the third classification set, and taking an array included in a triplet where the maximum classification level number is located as a reference classification:
dividing the triads with the same first element into a group in the third classification set to obtain a plurality of groups; in each group, the triples with the largest second element (namely the classification level of the target classification) are reserved and other triples are deleted; combining the remaining triples in all the groups and generating a fourth class set; and taking the array included in each triplet in the fourth classification set as a reference classification.
Specifically, after only the triples with the largest second element (i.e., the classification level of the target classification) are reserved in each packet, the data of the fourth class set formed by the remaining triples in each packet is as follows:
(ID1,1,Bean(ID1));(ID2,2,Bean(ID1,ID2));(ID3,3,Bean(ID1,ID2,ID3));(ID4,4,Bean(ID1,ID2,ID3,ID4))。
specifically, the array included in the triplet in the fourth classification set is used as the reference classification, that is, the data of the reference classification after conversion is as follows:
Bean(ID1);Bean(ID1,ID2);Bean(ID1,ID2,ID3);Bean(ID1,ID2,ID3)。
in specific implementation, the filling of the identification classification missing in the classification initial data by using the reference classification is realized, and the filled data is generated by the following steps:
the following steps are circularly executed for each row in the classification initial data until the row in the classification initial data is processed, and the cycle is ended: reading the last element in the current line data, and judging whether the last element is a valid identification classification; if yes, the effective identification classification is used as a non-missing classification; in the reference classification, searching a corresponding array through the non-missing classification to serve as a filling basic classification; the multi-level identification classifications in the current row are all replaced with the fill base classification.
Specifically, after the identifier classification of each line in the classification initial data is filled based on the reference classification, the filled data is as follows:
[A,ID1,ID2,ID3,ID4;B,ID1,ID2,ID3,ID4;C,ID1,ID2,ID3,ID4;D,ID1,ID2,ID3,ID4]。
specifically, after the classification initial data is filled, the filled data is saved into a corresponding table of a data warehouse by using a structured data processing tool. The structured data processing tool can be used for rapidly processing data, and after the filled classified data are stored in the corresponding table of the data warehouse, the classified data can be rapidly read at any time, and the classified data can be supplemented to the follow-up commodity data.
In this embodiment, a computer device is provided, as shown in fig. 2, including a memory 201, a processor 202, and a computer program stored on the memory and capable of running on the processor, where the processor implements any of the above-described filling methods of missing classification data when executing the computer program.
In particular, the computer device may be a computer terminal, a server or similar computing means.
In the present embodiment, there is provided a computer-readable storage medium storing a computer program for executing the filling method of any of the above-described missing classification data.
In particular, computer-readable storage media, including both permanent and non-permanent, removable and non-removable media, may be used to implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer-readable storage media include, but are not limited to, phase-change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Disks (DVD) or other optical storage, magnetic cassettes, magnetic tape disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device. Computer-readable storage media, as defined herein, does not include transitory computer-readable media (transmission media), such as modulated data signals and carrier waves.
Based on the same inventive concept, the embodiment of the application also provides a filling device for missing classification data, as described in the following embodiment. The principle of solving the problem of the filling device of the missing classified data is similar to that of the filling method of the missing classified data, so that the implementation of the filling device of the missing classified data can be referred to the implementation of the filling method of the missing classified data, and the repeated parts are not repeated. As used below, the term "unit" or "module" may be a combination of software and/or hardware that implements the intended function. While the means described in the following embodiments are preferably implemented in software, implementation in hardware, or a combination of software and hardware, is also possible and contemplated.
Fig. 3 is a block diagram of a filling device for missing classification data according to an embodiment of the present application, as shown in fig. 3, including: the structure is described below, and the initial data acquisition module 301, the classified data expansion module 302, the classified layer number statistics module 303, the reference class generation module 304, and the data filling module 305 are described.
The classification initial data acquiring module 301 is configured to acquire classification initial data and delete abnormal data in the classification initial data, where a first column of the classification initial data is an identification name, a second column to a last column of the classification initial data are respectively multi-stage identification classifications corresponding to the identification names, classification stages corresponding to the second column to the last column are sequentially increased, the identification names in the classification initial data are different names of similar objects, and each of the classification initial data is the same identification name and the multi-stage identification classification corresponding to the identification name;
the classification data expansion module 302 is configured to classify the multi-level identifiers corresponding to the identifier names of each row in the classification initial data, expand the multi-level identifiers into a plurality of arrays according to different number combinations of the identifier classifications, obtain a plurality of arrays corresponding to each identifier name, form a first class set by using the plurality of arrays corresponding to all the identifier names, and generate a second class set after de-duplicating the arrays of the first class set;
the classification layer number statistics module 303 is configured to form a triplet from the identifier name, the classification level corresponding to the identifier classification in the second classification set, and each of the groups in the second classification set, where each triplet forms a third classification set, and the third classification set is used to count the classification level corresponding to each identifier classification;
the reference class generating module 304 is configured to determine a maximum class number corresponding to each identifier name in the third class set, and use an array included in a triplet where the maximum class number is located as a reference class;
the data filling module 305 is configured to fill the identifier classification missing in the classification initial data by using the reference classification, and generate filled data.
In one embodiment, the classification data expansion module comprises:
the classifying initial data expanding unit is used for expanding the multi-stage identifiers corresponding to the identifier names of each row into a plurality of arrays with increasing columns according to the classifying level number by taking each row of the classifying initial data as a unit, and the plurality of arrays corresponding to each identifier name form a sub-classifying set, wherein the number of the arrays of the sub-classifying set is the same as the classifying level number of the multi-stage identifiers;
and the first classification set generating unit is used for forming a first classification set from the sub classification sets corresponding to the identification names of all the rows in the classification initial data.
In one embodiment, the classification layer number statistics module includes:
the second classification set array circulation unit is used for circularly executing the following steps for each array in the second classification set until the arrays in the second classification set are processed completely, and ending the circulation;
the effective identification classification judging unit is used for reading the last element in the current array and judging whether the last element is effective identification classification;
the target classification unit is used for taking the effective identification classification as target classification if the target classification unit is used for judging whether the effective identification classification is the target classification;
the classification level calculation unit is used for judging the number of columns of the target classification in the current array, and taking the number of columns as the classification level of the target classification;
and generating a triplet unit, wherein the triplet unit is used for taking the target classification as a first element, taking the classification level of the target classification as a second element and taking the current array as a third element to form a triplet.
In one embodiment, the reference class generation module includes:
the ternary group unit is used for dividing the ternary groups with the same first element into one group in the third classification set to obtain a plurality of groups;
a maximum triplet determining unit, configured to reserve, in each packet, a triplet with the maximum second element and delete other triples;
a fourth classification set generating unit, configured to combine the remaining triples in all the packets and generate a fourth classification set;
and the reference class generating unit is used for taking an array included in each triplet in the fourth class set as a reference class.
In one embodiment, a data population module comprises:
a classification initial data circulation unit for performing the following steps for each row in the classification initial data in a circulation manner until the row in the classification initial data is processed completely, and ending the circulation;
the effective identification classification judging unit is used for reading the last element in the current line data and judging whether the last element is effective identification classification;
a non-missing classification determining unit, configured to, if so, take the valid identification classification as a non-missing classification;
the filling basic classification unit is used for searching a corresponding array through the non-missing classification in the reference classification as the filling basic classification;
and the data filling unit is used for completely replacing the multi-level identification classification in the current row with the filling basic classification.
In one embodiment, the search filling base classification unit is configured to loop each array in the reference classification through the following steps until the array in the reference classification is processed, and the loop is ended: reading the element of the last column in the current array data, and judging whether the element is consistent with the non-missing classification; if yes, the current array is used as a filling reference for classification.
In one embodiment, the filling device for missing classification data further includes: and the data warehouse storage module is used for storing the filled data into a corresponding table of the data warehouse by using the structured data processing tool after filling the classified initial data.
The embodiment of the application realizes the following technical effects:
through the classification initial data, the multi-stage identification classifications corresponding to the identification names of each row in the classification initial data are unfolded into a plurality of arrays according to different number combinations of the identification classifications, a second classification set is generated after duplication removal, and all possible classification hierarchy relations are acquired; forming a triplet from the identification name, the classification level corresponding to the identification classification in the second classification set and each group in the second classification set to count the classification level corresponding to each identification classification; finding the deepest level of each identification classification by judging the maximum classification level and taking the classification of the row of the deepest level as the reference classification; the missing classification data is populated using the reference classification. The method achieves the aim of filling the missing classified data according to the classified initial data, solves the problem that the commodity data is inaccurate due to commodity classification loss, and ensures that the commodity data is kept accurate.
It will be apparent to those skilled in the art that the modules or steps of the embodiments of the application described above may be implemented in a general purpose computing device, they may be concentrated on a single computing device, or distributed across a network of computing devices, they may alternatively be implemented in program code executable by computing devices, so that they may be stored in a storage device for execution by computing devices, and in some cases, the steps shown or described may be performed in a different order than what is shown or described, or they may be separately fabricated into individual integrated circuit modules, or a plurality of modules or steps in them may be fabricated into a single integrated circuit module. Thus, embodiments of the application are not limited to any specific combination of hardware and software.
The above description is only of the preferred embodiments of the present application and is not intended to limit the present application, and various modifications and variations can be made to the embodiments of the present application by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the protection scope of the present application.

Claims (8)

1. A method for filling missing classification data, comprising:
acquiring classification initial data and deleting abnormal data in the classification initial data, wherein a first column of the classification initial data is an identification name, the second column to the last column of the classification initial data are respectively multi-stage identification classifications corresponding to the identification names, classification stages corresponding to the second column to the last column are sequentially increased, the identification names in the classification initial data are different names of similar objects, and each of the classification initial data is the same identification name and the multi-stage identification classifications corresponding to the identification names;
the multi-stage identification classification corresponding to the identification names of each row in the initial classification data is unfolded into a plurality of arrays according to different number combinations of the identification classifications, a plurality of arrays corresponding to each identification name are obtained, all the arrays corresponding to the identification names form a first classification set, and a second classification set is generated after the arrays of the first classification set are de-duplicated;
forming a triplet from the identification name, the class number corresponding to the identification classification in the second classification set and each group in the second classification set, wherein each triplet forms a third classification set, and the third classification set is used for counting the class number corresponding to each identification classification;
judging the maximum classification level number corresponding to each identification name in the third classification set, and taking an array included in a triplet where the maximum classification level number is located as a reference classification;
filling the identification classification missing in the classification initial data by using the reference classification to generate filled data;
the multi-stage identification classification corresponding to the identification names of each row in the initial classification data is expanded into a plurality of arrays according to different number combinations of the identification classifications, a plurality of arrays corresponding to each identification name are obtained, and a first classification set is formed by the plurality of arrays corresponding to all the identification names, and the method comprises the following steps:
taking each row of the initial data of the classification as a unit, and expanding the multi-stage identification classification corresponding to the identification name of each row into a plurality of arrays with increasing columns according to the classification progression, wherein the plurality of arrays corresponding to the identification name form a sub-classification set, and the number of arrays of the sub-classification set is the same as the progression of the multi-stage identification classification;
the sub-classification sets corresponding to the identification names of all rows in the classification initial data are formed into a first classification set;
filling the identification classification missing in the classification initial data by using the reference classification to generate filled data, wherein the method comprises the following steps: the following steps are circularly executed for each row in the classification initial data until the row in the classification initial data is processed, and the circulation is ended: reading the last element in the current line data, and judging whether the last element is a valid identification classification; if yes, the effective identification classification is used as a non-missing classification; in the reference classification, searching a corresponding array through the non-missing classification to serve as a filling basic classification; and replacing all the multi-level identification classifications in the current row with the filling basic classification.
2. The method of filling missing classification data of claim 1, wherein forming a triplet from the identification name, the number of classification stages corresponding to the identification classification in the second classification set, and each of the sets of numbers in the second classification set, comprises:
the following steps are circularly executed for each array in the second classification set until the array in the second classification set is processed, and the circulation is ended:
reading the last element in the current array, and judging whether the last element is a valid identification classification;
if yes, taking the effective identification classification as a target classification;
judging the column number of the target classification in the current array, and taking the column number as the classification level of the target classification;
and taking the target classification as a first element, taking the classification level of the target classification as a second element, and taking the current array as a third element to form a triplet.
3. The filling method of missing classified data of claim 1 wherein in the reference class, searching for a corresponding array as a filling base class by the non-missing class includes:
the following steps are circularly executed for each array in the standard classification until the array in the standard classification is processed, and the circulation is ended:
reading the element of the last column in the current array data, and judging whether the element is consistent with the non-missing classification;
if yes, the current array is used as a filling reference for classification.
4. The filling method of missing classification data according to claim 2, wherein determining a maximum classification level corresponding to each of the identification names in the third classification set, and taking an array included in a triplet in which the maximum classification level is located as a reference classification, includes:
dividing the triads with the same first element into one group in the third classification set to obtain a plurality of groups;
in each group, reserving the triple with the largest second element, and deleting other triples;
combining the remaining triples in all the groups and generating a fourth class set;
and taking an array included by each triplet in the fourth classification set as a reference classification.
5. The filling method of missing classification data according to any one of claims 1 to 4, further comprising:
after the classification initial data is filled, the filled data is saved into a corresponding table of a data warehouse by using a structured data processing tool.
6. A filling device for missing classification data, comprising:
the system comprises an initial classification data acquisition module, a classification initial data generation module and a classification data generation module, wherein the initial classification data acquisition module is used for acquiring initial classification data and deleting abnormal data in the initial classification data, a first column of the initial classification data is an identification name, the second column to the last column of the initial classification data are respectively multi-stage identification classifications corresponding to the identification names, the classification stages corresponding to the second column to the last column are sequentially increased, the identification names in the initial classification data are different names of similar objects, and each row of the initial classification data is the same identification name and the multi-stage identification classifications corresponding to the identification names;
the classification data unfolding module is used for unfolding the multi-stage identification classifications corresponding to the identification names of each row in the classification initial data into a plurality of arrays according to different number combinations of the identification classifications to obtain a plurality of arrays corresponding to each identification name, forming a first classification set by the plurality of arrays corresponding to all the identification names, and generating a second classification set after de-duplicating the arrays of the first classification set;
the classification layer number statistics module is used for forming a triplet from the identification name, the classification level corresponding to the identification classification in the second classification set and each data group in the second classification set, and each triplet forms a third classification set, wherein the third classification set is used for counting the classification level corresponding to each identification classification;
the reference classification generation module is used for judging the maximum classification level number corresponding to each identification name in the third classification set, and taking an array included in the triplet where the maximum classification level number is located as a reference classification;
the data filling module is used for filling the identification classification missing in the classification initial data by utilizing the reference classification to generate filled data;
the classified data expansion module comprises:
the classifying initial data expanding unit is used for expanding the multi-stage identifiers corresponding to the identifier names of each row into a plurality of arrays with increasing columns according to the classifying level number by taking each row of the classifying initial data as a unit, and the plurality of arrays corresponding to each identifier name form a sub-classifying set, wherein the number of the arrays of the sub-classifying set is the same as the classifying level number of the multi-stage identifiers;
the first classification set generation unit is used for forming sub classification sets corresponding to the identification names of all lines in the classification initial data into a first classification set;
the data filling module comprises:
a classification initial data circulation unit for performing the following steps for each row in the classification initial data in a circulation manner until the row in the classification initial data is processed completely, and ending the circulation;
the effective identification classification judging unit is used for reading the last element in the current line data and judging whether the last element is effective identification classification;
a non-missing classification determining unit, configured to, if so, take the valid identification classification as a non-missing classification;
the filling basic classification unit is used for searching a corresponding array through the non-missing classification in the reference classification as the filling basic classification;
and the data filling unit is used for completely replacing the multi-level identification classification in the current row with the filling basic classification.
7. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the filling method of missing classification data according to any of claims 1 to 5 when the computer program is executed.
8. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program that performs the filling method of missing classification data according to any one of claims 1 to 5.
CN202310869517.3A 2023-07-17 2023-07-17 Filling method, filling device, computer equipment and medium for missing classification data Active CN116610662B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310869517.3A CN116610662B (en) 2023-07-17 2023-07-17 Filling method, filling device, computer equipment and medium for missing classification data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310869517.3A CN116610662B (en) 2023-07-17 2023-07-17 Filling method, filling device, computer equipment and medium for missing classification data

Publications (2)

Publication Number Publication Date
CN116610662A CN116610662A (en) 2023-08-18
CN116610662B true CN116610662B (en) 2023-10-03

Family

ID=87680368

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310869517.3A Active CN116610662B (en) 2023-07-17 2023-07-17 Filling method, filling device, computer equipment and medium for missing classification data

Country Status (1)

Country Link
CN (1) CN116610662B (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108335200A (en) * 2018-05-03 2018-07-27 湖南大学 A kind of credit rating method that feature based is chosen
CN110659268A (en) * 2019-08-15 2020-01-07 中国平安财产保险股份有限公司 Data filling method and device based on clustering algorithm and computer equipment
CN110766030A (en) * 2018-07-25 2020-02-07 北京国双科技有限公司 Method and device for determining missing value processing mode
CN111414353A (en) * 2020-02-29 2020-07-14 平安科技(深圳)有限公司 Intelligent missing data filling method and device and computer readable storage medium
CN113159154A (en) * 2021-04-12 2021-07-23 浙江工业大学 Time series characteristic reconstruction and dynamic identification method for crop classification

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070214133A1 (en) * 2004-06-23 2007-09-13 Edo Liberty Methods for filtering data and filling in missing data using nonlinear inference

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108335200A (en) * 2018-05-03 2018-07-27 湖南大学 A kind of credit rating method that feature based is chosen
CN110766030A (en) * 2018-07-25 2020-02-07 北京国双科技有限公司 Method and device for determining missing value processing mode
CN110659268A (en) * 2019-08-15 2020-01-07 中国平安财产保险股份有限公司 Data filling method and device based on clustering algorithm and computer equipment
CN111414353A (en) * 2020-02-29 2020-07-14 平安科技(深圳)有限公司 Intelligent missing data filling method and device and computer readable storage medium
CN113159154A (en) * 2021-04-12 2021-07-23 浙江工业大学 Time series characteristic reconstruction and dynamic identification method for crop classification

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
APT-KNN:一种面向分类问题的高效缺失值填充算法;徐宇明;陈诚;熊;朱扬勇;;计算机应用与软件(04);全文 *
基于EM和贝叶斯网络的丢失数据填充算法;李宏;阿玛尼;李平;吴敏;;计算机工程与应用(05);全文 *

Also Published As

Publication number Publication date
CN116610662A (en) 2023-08-18

Similar Documents

Publication Publication Date Title
US11055360B2 (en) Data write-in method and apparatus in a distributed file system
US11347787B2 (en) Image retrieval method and apparatus, system, server, and storage medium
EP2069979B1 (en) Dynamic fragment mapping
CN110825733B (en) Multi-sampling-stream-oriented time series data management method and system
EP3640813B1 (en) Cluster-based random walk method and apparatus
CN112765405B (en) Method and system for clustering and inquiring spatial data search results
CN109033365B (en) Data processing method and related equipment
CN108664583A (en) A kind of index tree method for building up and image search method
CN114781688A (en) Method, device, equipment and storage medium for identifying abnormal data of business expansion project
CN116610662B (en) Filling method, filling device, computer equipment and medium for missing classification data
US10776334B2 (en) Random walking and cluster-based random walking method, apparatus and device
CN116756253B (en) Data storage and query methods, devices, equipment and media of relational database
CN109739854A (en) A kind of date storage method and device
US20140067751A1 (en) Compressed set representation for sets as measures in olap cubes
CN110889424B (en) Vector index establishing method and device and vector retrieving method and device
US8533167B1 (en) Compressed set representation for sets as measures in OLAP cubes
CN110019357B (en) Database query script generation method and device
CN110908587A (en) Method and device for storing time sequence data
CN114691612A (en) Data writing method and device and data reading method and device
CN110880005B (en) Vector index establishing method and device and vector retrieving method and device
CN111460325B (en) POI searching method, device and equipment
CN113283468A (en) Three-dimensional model retrieval method and device based on three-dimensional shape knowledge graph
CN113127493B (en) Method and device for cutting block chain state data, electronic equipment and storage medium
CN110059148A (en) The accurate searching method that spatial key applied to electronic map is inquired
CN108460048B (en) Method and equipment for querying unique value

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant