CN110750685A - Data warehousing method and device - Google Patents

Data warehousing method and device Download PDF

Info

Publication number
CN110750685A
CN110750685A CN201910936699.5A CN201910936699A CN110750685A CN 110750685 A CN110750685 A CN 110750685A CN 201910936699 A CN201910936699 A CN 201910936699A CN 110750685 A CN110750685 A CN 110750685A
Authority
CN
China
Prior art keywords
data
warehousing
network data
sample
generate
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910936699.5A
Other languages
Chinese (zh)
Other versions
CN110750685B (en
Inventor
王静云
郭景赞
田元兵
乔自知
郭省力
李德屹
李京辉
金雨超
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China United Network Communications Group Co Ltd
Original Assignee
China United Network Communications Group Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China United Network Communications Group Co Ltd filed Critical China United Network Communications Group Co Ltd
Priority to CN201910936699.5A priority Critical patent/CN110750685B/en
Publication of CN110750685A publication Critical patent/CN110750685A/en
Application granted granted Critical
Publication of CN110750685B publication Critical patent/CN110750685B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/901Indexing; Data structures therefor; Storage structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/906Clustering; Classification

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the invention discloses a data storage method and device, and relates to the field of computers. The automatic storage of the network data can be realized by adopting a multi-dimensional hierarchical clustering algorithm, and the operation and maintenance cost is reduced. The method comprises the following steps: acquiring a network data sample of a user terminal, and extracting a list of data with the least category from the network data sample as a characteristic sample; performing similarity calculation on the characteristic samples to generate a similarity matrix of the characteristic samples, and classifying data in the characteristic samples according to the similarity matrix to generate a new data group; and inputting the new data group to the corresponding position of the database table according to the preset sequence of the hierarchical clustering tree. The embodiment of the invention is applied to a network system.

Description

Data warehousing method and device
Technical Field
The embodiment of the invention relates to the field of computers, in particular to a data storage method and device.
Background
With the development of 5G, the massive nature and the type complexity of network data urgently need to realize an intelligent data analysis and processing system. For the original data reported by each manufacturer, a large amount of labor and time cost are needed for processing and warehousing the analyzed data to adapt to different types of network data which change in real time. The current data warehousing operation needs manual configuration of a base table structure (table name, and fields and types of the base table), however, the database table name, table fields, field types, and changes of field sequences and the like can cause failure of warehousing. Due to the diversity characteristic of network data types and the requirement of real-time property, the existing network data storage method consumes a large amount of labor and time cost to construct a database table structure, has low automation capacity and cannot meet the requirement of intelligent operation and maintenance.
Disclosure of Invention
The embodiment of the invention provides a data storage method and device, which can realize automatic storage of network data by adopting a multi-dimensional hierarchical clustering algorithm and reduce operation and maintenance cost.
In order to achieve the purpose, the invention adopts the following technical scheme:
in a first aspect, a method for warehousing data is provided, the method comprising: acquiring a network data sample of a user terminal, and extracting a list of data with the least category from the network data sample as a characteristic sample; performing similarity calculation on the characteristic samples to generate a similarity matrix of the characteristic samples, and classifying data in the characteristic samples according to the similarity matrix to generate a new data group; and inputting the new data group to the corresponding position of the database table according to the preset sequence of the hierarchical clustering tree.
Firstly, acquiring a network data sample of a user terminal, and extracting a list of data with the least category from the network data sample as a characteristic sample; so that the network data is more standard when being stored in the database subsequently. Then, carrying out similarity calculation on the characteristic samples to generate a similarity matrix of the characteristic samples, and classifying data in the characteristic samples according to the similarity matrix to generate a new data group; to improve the automation of network data classification. Finally, inputting the new data group to the corresponding position of the database table according to the preset sequence of the hierarchical clustering tree; therefore, the problem that a large amount of manpower and time cost are needed to construct the database table structure in the prior art is solved, and the operation and maintenance cost of network data warehousing is reduced.
In a second aspect, a data warehousing apparatus is provided, the data warehousing apparatus comprising: the device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring a network data sample of a user terminal; the processing unit is used for extracting a list of data with the least category from the network data samples acquired by the acquisition unit to serve as a feature sample; the processing unit is also used for carrying out similarity calculation on the characteristic samples to generate a similarity matrix of the characteristic samples and classifying data in the characteristic samples according to the similarity matrix to generate a new data group; and the processing unit is also used for inputting the new data group to the corresponding position of the database table according to the preset sequence of the hierarchical clustering tree.
It can be understood that, the data warehousing device provided above is configured to execute the method corresponding to the first aspect provided above, and therefore, the beneficial effects that can be achieved by the data warehousing device may refer to the beneficial effects of the method corresponding to the first aspect above and the beneficial effects of the corresponding schemes in the following detailed description, which are not described herein again.
In a third aspect, a data warehousing device is provided, where the data warehousing device structurally includes a processor and a memory, the memory is coupled to the processor and is used to store necessary program instructions and data of the data warehousing device, and the processor is used to execute the program instructions stored in the memory, so that the data warehousing device executes the method of the first aspect.
In a fourth aspect, a computer storage medium is provided, in which computer program code is stored, which, when run on a data-warehousing apparatus, causes the data-warehousing apparatus to perform the method of the first aspect.
A fifth aspect provides a computer program product having stored thereon the above computer software instructions, which, when run on a data-warehousing device, cause the data-warehousing device to execute a program as described above for the method of the first aspect.
Drawings
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Fig. 1 is a schematic flow chart of a data warehousing method according to an embodiment of the present invention;
fig. 2 is a schematic structural diagram of a data warehousing apparatus according to an embodiment of the present invention;
fig. 3 is a schematic structural diagram of another data warehousing device according to an embodiment of the present invention;
fig. 4 is a schematic structural diagram of another data warehousing device according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
It should be noted that, in the embodiments of the present invention, words such as "exemplary" or "for example" are used to indicate examples, illustrations or explanations. Any embodiment or design described as "exemplary" or "e.g.," an embodiment of the present invention is not necessarily to be construed as preferred or advantageous over other embodiments or designs. Rather, use of the word "exemplary" or "such as" is intended to present concepts related in a concrete fashion.
It should be noted that, in the embodiments of the present invention, "of", "corresponding" and "corresponding" may be sometimes used in combination, and it should be noted that, when the difference is not emphasized, the intended meaning is consistent.
For the convenience of clearly describing the technical solutions of the embodiments of the present invention, in the embodiments of the present invention, the words "first", "second", and the like are used for distinguishing the same items or similar items with basically the same functions and actions, and those skilled in the art can understand that the words "first", "second", and the like are not limited in number or execution order.
According to the research results of internet data centers for many years, the global network data volume is doubled approximately every 2 years, the network data volume generated each year is increased exponentially, the network data acceleration rate accords with the moore law, and the total global network data volume is estimated to reach 35ZB in 2020. How to effectively acquire, load, analyze and process the massive network data becomes an important link and foundation of big data application. The existing network data storage method consumes a large amount of manpower and time cost to construct the database table structure, so that the automation capability is low, and the requirement of intelligent operation and maintenance cannot be met.
To solve the problems in the prior art, referring to fig. 1, an embodiment of the present invention provides a data entry method, including:
101. and acquiring a network data sample of the user terminal, and extracting a list of data with the least category from the network data sample as a feature sample.
The obtaining of the network data sample of the user terminal specifically includes: collecting network data of user terminal and according to field key for network datamValue of the characteristic valuemnAnalyzing the list format to generate a network data sample; where m represents a column and n represents a row.
It should be noted that the network data of the user terminal at least includes one or more of the following items: user terminal network configuration data, access network data, core network data, and the like. The key or value may be in the form of a numeric value type or may relate to various types of strings (e.g., chinese or english, etc.).
Illustratively, after analyzing and processing a network data file of a user terminal, the data input of the data acquisition module generates a uniform list format with key-value pairs as data types, and inputs the key-value list format as the data characteristics of the data cleaning module into a network data samplem-valuemn]. Wherein m represents a column, n represents a row and can also represent a category, and the types of the processed network data files comprise xml, csv, txt and the like. Key-value pair List [ keys-values]Is a list and is composed of a plurality of key value pairs, and the ith sample is expressed mathematically as: [ key ]1-value11,key2-value21,key3-value31,…,keym-valuem1]. Each key or field key represents a data feature of one dimension, and value is a value corresponding to the field key. Network data samples as shown in equation (1-1); wherein. Each row represents an input sample and each column represents a one-dimensional feature sample.
Figure BDA0002221770770000041
Optionally, in consideration of the fact that the network data may have error, invalid and other characteristic values during collection; therefore, after collecting the network data of the user terminal, the method further includes: and cleaning repeated and invalid characteristic values in the network data by adopting a machine learning algorithm.
102. And performing similarity calculation on the characteristic samples to generate a similarity matrix of the characteristic samples, and classifying data in the characteristic samples according to the similarity matrix to generate a new data group.
The similarity calculation of the characteristic samples is performed to generate a similarity matrix of the characteristic samples, and the similarity calculation specifically includes:
and (3) performing similarity calculation on the feature samples according to the following formulas (1-2) and (1-3) to generate a similarity matrix of the feature samples:
Figure BDA0002221770770000051
D(x,x)=D(y,y)=0(1-3);
where x and y represent different data in the feature sample, respectively.
103. And inputting the new data group to the corresponding position of the database table according to the preset sequence of the hierarchical clustering tree.
It should be noted that the preset sequence of the hierarchical clustering tree may be to input the network data in the hierarchical clustering tree into the database table according to the hierarchical division structure of the hierarchical clustering tree (for example, the hierarchical clustering tree is layered according to the sequence from top to bottom). Specifically, the network data is correspondingly input into a table name, a table field and a table of the data type of the field constructed by hierarchical clustering.
For better understanding, the embodiments of the present invention are exemplarily illustrated for the above method, and specifically, the following is provided:
the network data of the user terminal is collected as shown in table 1.
TABLE 1
Figure BDA0002221770770000052
Figure BDA0002221770770000061
Analyzing the network data sample in the table 1 to generate the key shown in the table 2m-valuemnIn a list format.
TABLE 2
Sample numbering
1 begin Time:2019-06-18T03:00:00+08:00 duration:PTS900
2 begin Time:2019-06-18T03:00:00+08:00 duration:PTS3600
3 begin Time:2019-06-18T04:00:00+08:00 duration:PTS1200
4 begin Time:2019-06-18T03:00:00+08:00 duration:PTS900
5 begin Time:2019-06-18T04:00:00+08:00 duration:PTS900
6 begin Time:2019-06-18T04:00:00+08:00 duration:PTS3600
Since key is the corresponding data class value in beginTimenThe key is the data type value corresponding to the durationnAre classified into three categories. Thus, now take key as beginTime example to compute the similarity matrix between corresponding values, then the valuesnAs shown in table 3:
TABLE 3
value1 2019-06-18T03:00:00+08:00
value2 2019-06-18T03:00:00+08:00
value3 2019-06-18T04:00:00+08:00
value4 2019-06-18T03:00:00+08:00
value5 2019-06-18T04:00:00+08:00
value6 2019-06-18T04:00:00+08:00
Similarity calculation is performed on the data in table 3 according to formulas (1-2) and (1-3), and a matrix shown as follows is generated:
Figure BDA0002221770770000071
then, the new class data set generated according to the matrix includes two groups in total: one group is value1、value2、value4(ii) a Another group is value3、value5、value6. If the preset sequence of the hierarchical clustering tree is that the network data are input into the database table from top to bottom; establishing a base table with the name of T03 according to the table name, the table fields and the data types of the fields, wherein T03 is the corresponding data type of the table name, the beginTime is the table fields and the character string is the beginTime field; english is Tablename T03; fields: { begingTime: string }, Tablename is table name; the files contains the fields of the table and the corresponding field type information. Then value will be1、value2、value4Inputting corresponding network data into a library table with the name of T03, wherein the library table comprises a beginTime field, and the data type of the beginTime is a character string; value3、value5、value6The same is true. It should be noted that the above description is only for beginTime-valuenThe analysis was performed so that the library table contained only network data corresponding to beginTime (e.g., the library table T03 contained only network data corresponding to beginTime 2019-06-18T03:00:00+08: 00).
More specifically, value corresponding to begin TimenAnalyzing the obtained value of the same type1、value2、value4For example, the value is further analyzed according to the above steps in combination with the network data corresponding to the duration in Table 11、value2、value4Is divided into two groups, one group is value1、value4(ii) a One group is value2. Therefore, a library table with a library table name of T03_ PTS900 is constructed, which contains a beginTime and duration fields, wherein the data types of beginTime and duration are both character strings; english is represented as Tablename T03_ PTS 900; fields: { begingTime: string, duration: string }, where "_" may be replaced with another concatenated character. And will value1And value4Corresponding network data is input into the library table. The above-mentioned pair beginTime-valuenRun-value based on analysisnThe analysis is also performed, so the table contains the network data corresponding to begin Time and the network data corresponding to duration (e.g. the table T03 \ u \)The PTS900 includes network data corresponding to beginTime 2019-06-18T03:00:00+08:00 and PTS900 corresponding to duration).
In addition, after a row of data with the least category is extracted from the network data sample and used as the feature sample, and the step 101-.
Firstly, acquiring a network data sample of a user terminal, and extracting a list of data with the least category from the network data sample as a characteristic sample; so that the network data is more standard when being stored in the database subsequently. Then, carrying out similarity calculation on the characteristic samples to generate a similarity matrix of the characteristic samples, and classifying data in the characteristic samples according to the similarity matrix to generate a new data group; to improve the automation of network data classification. Finally, inputting the new data group to the corresponding position of the database table according to the preset sequence of the hierarchical clustering tree; the problem that a large amount of manpower and time cost are consumed to construct a database table structure in the prior art is solved, and the operation and maintenance cost of network data warehousing is reduced.
The embodiment of the present invention may perform the division of the functional modules on the data warehousing device according to the method embodiment, for example, each functional module may be divided corresponding to each function, or two or more functions may be integrated into one processing module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. It should be noted that, the division of the modules in the embodiment of the present invention is schematic, and is only a logic function division, and there may be another division manner in actual implementation.
In the case of dividing each functional module according to each function, fig. 2 shows a schematic diagram of a possible structure of the data warehousing device 20 related to the above embodiment, where the data warehousing device 20 includes:
an obtaining unit 201, configured to obtain a network data sample of a user terminal.
The processing unit 202 is configured to extract a list of data with the least category from the network data samples acquired by the acquiring unit 201 as a feature sample.
The processing unit 202 is further configured to perform similarity calculation on the feature samples to generate a similarity matrix of the feature samples, and classify data in the feature samples according to the similarity matrix to generate a new class data set.
The processing unit 202 is further configured to input the new class data group to a corresponding position of the database table according to a preset sequence of the hierarchical clustering tree.
In an exemplary scheme, the processing unit 202 is specifically configured to perform similarity calculation on the feature samples according to the following formula to generate a similarity matrix of the feature samples:
D(x,x)=D(y,y)=0;
where x and y represent different data in the feature sample, respectively.
In an exemplary scheme, the obtaining unit 201 is specifically configured to collect network data of a user terminal, and perform key-based processing on the network datamValue of the characteristic valuemnAnalyzing the list format to generate a network data sample; where m represents a column and n represents a row.
In an exemplary scheme, the processing unit 202 is further configured to perform a machine learning algorithm on the network data to perform a cleaning process on the network data to process repeated and invalid feature values in the network data.
Since the data storage device in the embodiment of the present invention may be applied to implement the method embodiment, the technical effect obtained by the data storage device may also refer to the method embodiment, and the embodiment of the present invention is not described herein again.
In the case of an integrated unit, fig. 3 shows a possible structural schematic diagram of the data warehousing device 20 involved in the above embodiment. The data warehousing device 20 includes: a processing module 301, a communication module 302 and a storage module 303. The processing module 301 is used for controlling and managing the actions of the data warehousing device 20, for example, the processing module 301 is used for supporting the data warehousing device 20 to execute the processes 102 and 103 in fig. 1. The communication module 302 is used to support communication between the data warehousing device 20 and other entities. The storage module 303 is used for storing program codes and data of the data warehousing device 20.
The processing module 301 may be a processor or a controller, such as a Central Processing Unit (CPU), a general-purpose processor, a Digital Signal Processor (DSP), an application-specific integrated circuit (ASIC), a Field Programmable Gate Array (FPGA) or other programmable logic device, a transistor logic device, a hardware component, or any combination thereof. Which may implement or perform the various illustrative logical blocks, modules, and circuits described in connection with the disclosure. A processor may also be a combination of computing functions, e.g., comprising one or more microprocessors, a DSP and a microprocessor, or the like. The communication module 302 may be a transceiver, a transceiving circuit or a communication interface, etc. The storage module 303 may be a memory.
When the processing module 301 is a processor as shown in fig. 4, the communication module 302 is a transceiver as shown in fig. 4, and the storage module 303 is a memory as shown in fig. 4, the data warehousing device 20 according to the embodiment of the present application may be the following data warehousing device 20.
Referring to fig. 4, the data warehousing device 20 includes: a processor 401, a transceiver 402, a memory 403, and a bus 404.
The processor 401, the transceiver 402 and the memory 403 are connected to each other through a bus 404; the bus 404 may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown, but this does not mean that there is only one bus or one type of bus.
The processor 401 may be a general-purpose Central Processing Unit (CPU), a microprocessor, an application-specific integrated circuit (ASIC), or one or more ics for controlling the execution of programs in accordance with the present invention.
The memory 403 may be a read-only memory (ROM) or other type of static storage device that can store static information and instructions, a Random Access Memory (RAM) or other type of dynamic storage device that can store information and instructions, an electrically erasable programmable read-only memory (EEPROM), a compact disc read-only memory (CD-ROM) or other optical disk storage, optical disk storage (including compact disc, laser disc, optical disc, digital versatile disc, blu-ray disc, etc.), magnetic disk storage media or other magnetic storage devices, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer, but is not limited to these. The memory may be self-contained and coupled to the processor via a bus. The memory may also be integral to the processor.
The memory 403 is used for storing application program codes for executing the scheme of the application, and the processor 401 controls the execution. The transceiver 402 is configured to receive content input by an external device, and the processor 401 is configured to execute application program codes stored in the memory 403, so as to implement the data entry method in the embodiment of the present application.
It should be understood that, in the various embodiments of the present application, the sequence numbers of the above-mentioned processes do not mean the execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiments of the present application.
Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described apparatuses, devices and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus, and method may be implemented in other ways. For example, the above-described embodiments of the apparatus are merely illustrative, and for example, a division of a unit is merely a logical division, and an actual implementation may have another division, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.
Units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.
In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented using a software program, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. The procedures or functions according to the embodiments of the present application are all or partially generated when the computer program instructions are loaded and executed on a computer. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored on a computer readable storage medium or transmitted from one computer readable storage medium to another computer readable storage medium, for example, the computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center via wire (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). Computer-readable storage media can be any available media that can be accessed by a computer or can comprise one or more data storage devices, such as servers, data centers, and the like, that can be integrated with the media. The usable medium may be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., Solid State Disk (SSD)), among others.
The embodiment of the present invention further provides a computer program product, which can be directly loaded into the memory and contains software codes, and the computer program product can implement the data storage method after being loaded and executed by the computer.
The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims (11)

1. A method for data warehousing, comprising:
acquiring a network data sample of a user terminal, and extracting a list of data with the least category from the network data sample as a characteristic sample;
performing similarity calculation on the characteristic samples to generate a similarity matrix of the characteristic samples, and classifying data in the characteristic samples according to the similarity matrix to generate a new data group;
and inputting the new data group to a corresponding position of a database table according to a preset sequence of the hierarchical clustering tree.
2. The data warehousing method according to claim 1, wherein performing similarity calculation on the feature samples to generate a similarity matrix of the feature samples specifically includes:
performing similarity calculation on the feature samples according to the following formula to generate a similarity matrix of the feature samples:
Figure FDA0002221770760000011
D(x,x)=D(y,y)=0;
wherein x and y represent different data in the feature sample, respectively.
3. The method according to claim 1, wherein the obtaining the network data sample of the user terminal specifically includes:
collecting the network data of the user terminal and according to the field key to the network datamValue of the characteristic valuemnAnalyzing the list format to generate the network data sample; where m represents a column and n represents a row.
4. The method according to claim 3, further comprising, after collecting the network data of the user terminal:
and cleaning the repeated and invalid characteristic values in the network data by adopting a machine learning algorithm.
5. A data warehousing apparatus, comprising:
the device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring a network data sample of a user terminal;
the processing unit is used for extracting a list of data with the least category from the network data samples acquired by the acquisition unit to serve as a feature sample;
the processing unit is further configured to perform similarity calculation on the feature samples to generate a similarity matrix of the feature samples, and classify data in the feature samples according to the similarity matrix to generate a new class data set;
and the processing unit is also used for inputting the new data group to the corresponding position of the database table according to the preset sequence of the hierarchical clustering tree.
6. The data warehousing device of claim 5, comprising:
the processing unit is specifically configured to perform similarity calculation on the feature sample according to the following formula to generate a similarity matrix of the feature sample:
Figure FDA0002221770760000021
D(x,x)=D(y,y)=0;
wherein x and y represent different data in the feature sample, respectively.
7. The data warehousing device of claim 5, comprising:
the acquiring unit is specifically configured to acquire network data of the user terminal and key the network data according to a keywordmValue of the characteristic valuemnAnalyzing the list format to generate the network data sample; where m represents a column and n represents a row.
8. The data warehousing device of claim 5, comprising:
the processing unit is further used for cleaning the network data by adopting a machine learning algorithm to process repeated and invalid characteristic values in the network data.
9. A data-warehousing device characterized in that the data-warehousing device comprises a processor and a memory, the memory is coupled to the processor and used for storing necessary program instructions and data of the data-warehousing device, and the processor is used for executing the program instructions stored in the memory to make the data-warehousing device execute the data-warehousing method as claimed in any one of claims 1 to 4.
10. A computer storage medium having computer program code stored therein, which when run on a data-warehousing apparatus causes the data-warehousing apparatus to perform the data-warehousing method of any one of claims 1-4.
11. A computer program product having computer software instructions stored thereon for causing a data-warehousing apparatus to perform the data-warehousing method of any one of claims 1-4 when the computer software instructions are run on the data-warehousing apparatus.
CN201910936699.5A 2019-09-29 2019-09-29 Data warehousing method and device Active CN110750685B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910936699.5A CN110750685B (en) 2019-09-29 2019-09-29 Data warehousing method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910936699.5A CN110750685B (en) 2019-09-29 2019-09-29 Data warehousing method and device

Publications (2)

Publication Number Publication Date
CN110750685A true CN110750685A (en) 2020-02-04
CN110750685B CN110750685B (en) 2022-08-26

Family

ID=69277500

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910936699.5A Active CN110750685B (en) 2019-09-29 2019-09-29 Data warehousing method and device

Country Status (1)

Country Link
CN (1) CN110750685B (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104699796A (en) * 2015-03-18 2015-06-10 浪潮集团有限公司 Data cleaning method based on data warehouse
CN106776951A (en) * 2016-12-02 2017-05-31 航天星图科技(北京)有限公司 One kind cleaning contrast storage method
CN107168995A (en) * 2017-03-29 2017-09-15 联想(北京)有限公司 A kind of data processing method and server
US20170308557A1 (en) * 2016-04-21 2017-10-26 LeanTaas Method and system for cleansing and de-duplicating data
CN107301210A (en) * 2017-06-06 2017-10-27 福建中经汇通有限责任公司 A kind of data processing method
CN110069495A (en) * 2019-03-13 2019-07-30 中科恒运股份有限公司 Date storage method, device and terminal device

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104699796A (en) * 2015-03-18 2015-06-10 浪潮集团有限公司 Data cleaning method based on data warehouse
US20170308557A1 (en) * 2016-04-21 2017-10-26 LeanTaas Method and system for cleansing and de-duplicating data
CN106776951A (en) * 2016-12-02 2017-05-31 航天星图科技(北京)有限公司 One kind cleaning contrast storage method
CN107168995A (en) * 2017-03-29 2017-09-15 联想(北京)有限公司 A kind of data processing method and server
CN107301210A (en) * 2017-06-06 2017-10-27 福建中经汇通有限责任公司 A kind of data processing method
CN110069495A (en) * 2019-03-13 2019-07-30 中科恒运股份有限公司 Date storage method, device and terminal device

Also Published As

Publication number Publication date
CN110750685B (en) 2022-08-26

Similar Documents

Publication Publication Date Title
JP5241370B2 (en) Table classification apparatus, table classification method, and table classification program
JP5092165B2 (en) Data construction method and system
JP6242540B1 (en) Data conversion system and data conversion method
CN109933502B (en) Electronic device, user operation record processing method and storage medium
CN113254354A (en) Test case recommendation method and device, readable storage medium and electronic equipment
CN114461644A (en) Data acquisition method and device, electronic equipment and storage medium
CN111914294A (en) Database sensitive data identification method and system
CN109656928B (en) Method and device for obtaining relationships between tables
Kılınç et al. Multi‐level reranking approach for bug localization
CN114741392A (en) Data query method and device, electronic equipment and storage medium
CN112445775A (en) Fault analysis method, device, equipment and storage medium of photoetching machine
CN110874366A (en) Data processing and query method and device
CN110750685B (en) Data warehousing method and device
CN115510289A (en) Data cube configuration method and device, electronic equipment and storage medium
CN115051863A (en) Abnormal flow detection method and device, electronic equipment and readable storage medium
CN112131215B (en) Bottom-up database information acquisition method and device
CN105677827B (en) A kind of acquisition methods and device of list
JP7292235B2 (en) Analysis support device and analysis support method
CN114595216A (en) Data verification method and device, storage medium and electronic equipment
CN113094415B (en) Data extraction method, data extraction device, computer readable medium and electronic equipment
CN113779248A (en) Data classification model training method, data processing method and storage medium
CN111949644A (en) Data quality evaluation method and equipment for power industry service system
CN113220551A (en) Index trend prediction and early warning method and device, electronic equipment and storage medium
CN117077598B (en) 3D parasitic parameter optimization method based on Mini-batch gradient descent method
JP2020166443A (en) Data processing method recommendation system, data processing method recommendation method, and data processing method recommendation program

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant